CN116010619A - Knowledge extraction method in complex equipment knowledge graph construction process - Google Patents

Knowledge extraction method in complex equipment knowledge graph construction process Download PDF

Info

Publication number
CN116010619A
CN116010619A CN202211738751.4A CN202211738751A CN116010619A CN 116010619 A CN116010619 A CN 116010619A CN 202211738751 A CN202211738751 A CN 202211738751A CN 116010619 A CN116010619 A CN 116010619A
Authority
CN
China
Prior art keywords
data
knowledge
extraction
model
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211738751.4A
Other languages
Chinese (zh)
Inventor
张海柱
丁国富
王淑营
郑庆
马自力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202211738751.4A priority Critical patent/CN116010619A/en
Publication of CN116010619A publication Critical patent/CN116010619A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a knowledge extraction method in the process of constructing a knowledge graph of complex equipment, which adopts different methods to extract through classifying different data types, greatly saves the time of knowledge extraction, adopts a plurality of deep learning models and neural networks in the extraction of unstructured data which is most difficult to process, and marks, classifies faults, identifies entities and extracts relations from massive heterogeneous data through a plurality of different training sets, so that the knowledge extraction of the unstructured data is more convenient, the accuracy is higher, and the inquiry of the related knowledge of the fault maintenance of the complex equipment is more comprehensive and convenient.

Description

Knowledge extraction method in complex equipment knowledge graph construction process
Technical Field
The invention belongs to the technical field of complex equipment design, and particularly relates to a knowledge extraction method in a complex equipment knowledge graph construction process.
Background
Products in the fields of electric power, engineering machinery, rail transit and the like are typical complex equipment which is safely and reliably in service under long life cycle and complex working conditions, and the research and development process is a complex system engineering. Various data such as models, businesses, perceptions and the like can be generated at each stage of the research and development period, and the data presents explosive growth along with continuous acquisition and analysis of various perceptions of the manufacturing process and the operation and maintenance process. The digital twin of the equipment is data driven, the fusion of multi-source data is the primary problem to be solved, but the analysis method of pure data driving ignores the experience knowledge, how to form the 'brain' of the complex equipment, and the fusion of the multi-source data and the digging of the potential relation of the multi-source data are critical to the construction of the digital twin of the complex equipment.
As an important branch of the new generation of artificial intelligence technology, knowledge maps, known as "learned AI", are often used to fuse multi-source data to build large-scale knowledge bases. The knowledge graph is a semantic network, expresses semantic relations among entities, is suitable for describing complex space structuring of complex equipment multi-layer systems, subsystems, components and parts, and can obtain more comprehensive space-time correlation knowledge through access analysis mining of multi-source heterogeneous time sequence perception data.
However, the structuring of the complex equipment system has the characteristics of multiple layers, and meanwhile, the data sources of all stages of the whole life cycle of the complex product are different, the structuring forms are different, and the complex equipment system has the characteristics of multiple source isomerism. Knowledge data of complex products at various stages of the full lifecycle include documents, pictures, tables, models, files, etc., not only structured data, but also semi-structured and unstructured data. Such data originates not only from the stages of the full lifecycle, but also from the various levels of systems, subsystems, components, parts of the complex equipment system architecture. The knowledge extraction is very difficult, so that the data processing amount is large and omission is easy in the knowledge extraction process.
Disclosure of Invention
In view of the above, the invention provides a method for classifying data accurately, has high automation degree, can extract knowledge of massive heterogeneous data in each stage of a full life cycle, and lays a data blanket for building a knowledge graph of complex equipment.
A knowledge data extraction method in a complex equipment knowledge graph construction process comprises the following steps:
performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types;
if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data;
if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template;
and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained.
Preferably, the data mining specifically includes:
classifying according to the types of the data of each stage of the whole life cycle of the product;
the data of each stage is divided into structured data, semi-structured data and unstructured data according to data structuring.
Preferably, the data stored in the related data form in the call relation database is specifically:
the D2R tool is used for conversion, the table names of the database are directly mapped to the classes in the resource description framework, the fields are mapped to the attributes of the classes, and the relations among the classes are obtained from the table representing the relations.
Preferably, the extracting the corresponding knowledge data through the network data extraction template and the non-network data extraction template specifically includes:
inputting the network data into a wrapper, and extracting knowledge data in the network data; the wrapper is encapsulated by a markup language and a path language;
and inputting the non-network data into a template library, and performing template matching to extract knowledge data in the non-network data.
Preferably, the template library shape is a structured template library created by generalized total structuring of tables or process data in complex equipment data.
Preferably, the method further comprises the step of inputting plain text data in second knowledge data obtained after extraction of the network data and the non-network data into an unstructured text knowledge extraction model for entity extraction.
Preferably, the labeling of the data by the deep learning template specifically includes:
and labeling the entity, the relation, the sequence and the fault type of the data through a text labeling tool to obtain labeled training data.
Preferably, the entity identification is specifically:
inputting the labeled training data into a model, and training by using a BERT model to obtain word vectors containing semantic information;
inputting the word vector into a bi-directional neural network;
and carrying out entity identification on the unlabeled test set data by using the deep learning model.
Preferably, the text classification specifically includes:
inputting the labeled training data into a word2vec model, training to obtain a word vector model, then taking the word vector model as the input of a textCNN model,
feature extraction of the convolution layer and pooling of the pooling layer are carried out;
and outputting the probability of each category through the full-connection single softmax, and classifying the unlabeled test set by using the obtained model.
Preferably, the relation extraction is specifically:
acquiring OWL files of the ontology;
analyzing to obtain concept, relation and concept triples;
acquiring a CSV file of an entity identification result;
reading the concept corresponding to the CSV header according to the concept, the relation and the concept triplet;
reading an entity corresponding to the concept;
and establishing entity, relation and entity triples, storing the entity, relation and entity triples into a database, and completing relation extraction.
The invention provides a knowledge data extraction method in the process of constructing a knowledge graph of complex equipment, which comprises the following steps: performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types; if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data; if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template; and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained. According to the knowledge extraction method provided by the invention, different methods are adopted for extraction through classifying different data types, so that the time for knowledge extraction is greatly saved, in the extraction of unstructured data which is most difficult to process, a plurality of deep learning models and neural networks are adopted, and the knowledge extraction of unstructured data is more convenient, the accuracy is higher, and the inquiry of related knowledge of fault maintenance of complex equipment is more comprehensive and convenient through a plurality of different training sets from massive heterogeneous data marks, fault classification, entity identification and relation extraction.
Drawings
FIG. 1 is a flow chart of a knowledge extraction process disclosed in an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data annotation according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a entity identification model framework according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of BILSTM layer in the entity recognition model according to the embodiment of the present invention;
FIG. 5 is a schematic diagram of a classification training model according to an embodiment of the present invention;
FIG. 6 is a graph showing the effect of entity recognition under different learning rates of the recognition model according to the embodiment of the present invention;
FIG. 7 is a graph showing the effect of identifying different dropout parameter entities of the identification model according to the embodiment of the present invention;
fig. 8 is a schematic diagram of a relationship extraction flow disclosed in an embodiment of the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the following specific embodiments.
The invention solves the problems in the prior art, and provides a knowledge data extraction method in the process of constructing a knowledge graph of complex equipment, which comprises the following steps: performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types; if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data; if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template; and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained.
According to the invention, the complex equipment system structure has the characteristics of multiple layers, and meanwhile, the complex products have different data sources at each stage of the whole life cycle, different structural forms and the characteristics of multiple source isomerism. Knowledge data of complex products at various stages of the full lifecycle include documents, pictures, tables, models, files, etc., not only structured data, but also semi-structured and unstructured data. Such data originates not only from the stages of the full lifecycle, but also from the various levels of systems, subsystems, components, parts of the complex equipment system architecture. Before knowledge extraction, the data sources are first classified and identified,
the invention aims at a multi-stage multi-source heterogeneous knowledge data research structuring, semi-structuring and unstructured knowledge extraction method of complex equipment, wherein structured data is preferably derived from table storage data in a database, semi-structured data is derived from tables, web pages and the like, and unstructured data is derived from various knowledge data in a text form. A complex equipment multi-source heterogeneous data knowledge extraction framework is shown in fig. 1. The knowledge extraction overall framework is divided into two parts: and (5) multi-source heterogeneous data mining and knowledge extraction models.
According to the invention, the multi-source heterogeneous data mining is specifically: classifying according to the types of the data of each stage in the whole life cycle of the product, and dividing the data of each stage into structured data, semi-structured data and unstructured data according to the data structure.
Different knowledge extraction methods are adopted according to different data types, and if the data types are structured data, a D2R tool is preferably adopted for conversion, and D2R (Database to RDF) is one way of extracting knowledge from a relational database. Database table names are mapped directly to classes in RDF (resource description framework), and fields are mapped to attributes of the classes. The relationships between classes may be derived from a table representing the relationships; the method of use of the tool is a method that can be appreciated by those skilled in the art.
According to the invention, knowledge data sources of the complex equipment are structured data, wherein most of the data are derived from unstructured text and semi-structured text, such as technical standards, fault text records and the like, belonging to unstructured text data, and process routes, manufacturing processes, webpage data and the like belong to semi-structured text data. The semi-structured text data can be divided into three types according to the structure and the characteristics, namely web types, scientific and technical literature types and other types, wherein the web types are extracted and the scientific and technical literature types exist in web pages, and other types, such as data stored in a form of a table or a text with a certain structure in complex equipment, are characterized by having certain structural characteristics, the data are generalized into a plurality of nouns by virtue of a text format, the characteristics of combining the format and the freedom are provided, the two types of knowledge points are commonly characterized by having a certain framework, the content is stored in the form of text, and for the certain structural, the invention adopts a template matching based mode to acquire knowledge points, and simultaneously adopts a deep learning method to extract the knowledge data for the text needing further entity extraction. The method flow framework is shown in fig. 2-6. The process comprises two parts, namely a webpage data extraction template and a non-webpage data extraction template, and meanwhile, plain text data which needs to be subjected to knowledge extraction is input into an unstructured text knowledge extraction model to be subjected to entity extraction.
According to the invention, the extraction of the corresponding knowledge data through the network data extraction template and the non-network data extraction template is specifically as follows: inputting the network data into a wrapper, and extracting knowledge data in the network data; the wrapper is encapsulated by a markup language and a path language; and inputting the non-network data into a template library, and performing template matching to extract knowledge data in the non-network data. Further the template library shape is a structured template library created by generalized total structuring of tables or process data in complex equipment data.
And if the plain text data in the second knowledge data obtained after the network data and the non-network data are further extracted is required to be input into an unstructured text knowledge extraction model for entity extraction.
More preferably, the method specifically analyzes the specific situation of the obtained semi-structured data when selecting the extraction template, for example, unstructured data in the webpage exists in the form of html, xml and the like, and the unstructured data is constructed by using the markup language which has the characteristic of regularization, and XPath and CSS selector expressions can be written by using a regular expression mode to extract elements in the webpage. These expressions are generalized and encapsulated into a wrapper. XPath is an XML path language (XMLPATHLangage), which is a language used to determine the location of a portion of an XML document, and is based on the tree structure of XML, providing the ability to find nodes in a data structure tree. Initially the original purpose of XPath was to use it as a generic grammar model between XPointer and XSL. But XPath is quickly adopted by developers as a small query language to parse the required knowledge data from the web page by the XPath method.
The non-webpage data extraction template aims at data stored in a certain structure in complex equipment data, the data has certain structural information, the structural information is generalized and structured to form a structural template library, and templates in the structural template library are used for template matching through compiling regular expressions so as to obtain knowledge.
If the data type is unstructured data, the knowledge extraction process is relatively more complex, the situation that needs to be considered is more, the entity extraction is also called named entity identification, the purpose of which is to extract entity information elements from the text, and the method is the basis for solving a plurality of NLP tasks. The entity extraction method comprises a rule-based extraction method, a statistical learning-based method and a deep learning-based method. The invention performs entity extraction from fault text data in an operation and maintenance stage based on a BERT+Bilstm-CRF deep learning model.
The data are marked by the deep learning template specifically comprises the following steps:
and labeling the entity, the relation, the sequence and the fault type of the data through a text labeling tool to obtain labeled training data.
Preferably, the entity identification is specifically:
inputting the labeled training data into a model, and training by using a BERT model to obtain word vectors containing semantic information;
inputting the word vector into a bi-directional neural network;
and carrying out entity identification on the unlabeled test set data by using the deep learning model.
In the fault stage, as different fault modes have different maintenance methods and strategies, and meanwhile, the same fault modes often have the same maintenance methods and strategies, in order to more accurately acquire faults, guide maintenance by faults and classify the fault modes based on fault text data, the invention carries out text classification based on a deep learning model of word2 vec+textCNN. The method comprises the following steps: inputting the marked training data into a word2vec model, training to obtain a word vector model, taking the word vector model as the input of a textCNN model, and carrying out feature extraction of a convolution layer and pooling of a pooling layer; and outputting the probability of each category through the full-connection single soffmax, and classifying the unlabeled test set by using the obtained model.
After the fault text data is subjected to fault mode classification, the fault mode can be regarded as an entity or a concept, and then the parts identified by the entity and the fault mode classification can be corresponding through a relation. Entity identification, fault classification, and relationship extraction must all be done to solve the prior art problems.
Relationship extraction is one of important subtasks of knowledge extraction, and is oriented to unstructured text data, and relationship extraction is to extract semantic relationships between two or more entities from text. The relationship extraction is closely related to the entity extraction of unstructured text data, and the possible relationship between entities is extracted after the entities in the text are identified.
At present, the relationship extraction method can be classified into a relationship extraction method based on a template, a relationship extraction method based on supervised learning, and a relationship extraction method based on weak supervised learning. The invention uses the ontology mode to formally express the concept and the semantic relation of the mode layer.
The ontology is a basis for forming a semantic model of the knowledge graph, and data filling of the knowledge graph is normalized and restrained at a semantic level, so that the knowledge graph can be constructed and supported on a large scale. By reading the owl file (ontology model file) of the ontology, analyzing out (concept-relationship-concept) triples; meanwhile, the entity identification result is (concept, entity); by matching the entity concepts in the recognition result with the parsed concepts, a relationship is established between the concepts, relationship acquisition is realized,
the knowledge graph serves as a semantic network and can establish the connection among a plurality of entities, wherein the semantic relationship plays an important role. In the construction of the knowledge graph data pattern layer, semantic relations such as "belongs to", "occurs", "phenomenon", "failure source", "failure time", "travel kilometers" and the like are defined.
The complex equipment knowledge graph relation extraction method is realized based on named entity recognition and ontology modeling, and the flow is shown in fig. 8. Firstly, acquiring an ontology model file and an entity obtained by entity identification, then acquiring corresponding entities according to concepts in the ontology, then establishing a relationship between two entities according to a relationship between two concepts in the ontology, establishing model connection between different originally identified entities, and associating the entities classified according to different texts.
The relation extraction specifically comprises the following steps: acquiring OWL files of the ontology; analyzing to obtain concept, relation and concept triples; acquiring a CSV file of an entity identification result; reading the concept corresponding to the CSV header according to the concept, the relation and the concept triplet; reading an entity corresponding to the concept; and (3) establishing (entity, relation and entity) triples, storing the triples into a database, and completing relation extraction.
The following explains the specific embodiment of the present invention by taking the knowledge extraction process at the operation and maintenance stage of the bogie of the high-speed train as an example.
And (3) data acquisition: the data sources of the knowledge graph constructed by the invention comprise fault text data and hundred degree encyclopedia, thesis and patent data about the operation and maintenance stage of the bogie of the high-speed train. Knowledge crawling is performed by taking the multi-level structural names of a bogie system, a subsystem, components and parts of the high-speed train as keywords, crawling is performed on semi-structural data of a webpage by using a scrapy framework, and the data is stored by using a mysql relational database.
And storing full text index links in a link mode aiming at the full text of journal papers and patents. The fault text data exists in the form of text, and knowledge extraction is required by using NLP technology (natural language processing).
For unstructured data, the invention uses NLP technology to realize two tasks of entity recognition and text classification, and the model uses a deep learning model based on supervised learning, so that a large amount of labeling data is needed, and for the entity recognition model, a brat labeling tool is used for labeling the entity and the relation, and the labeling code of the definition entity is shown in table 1. Sequence labeling is performed by using a BIO mode.
Table 1 entity annotation coding
Figure BDA0004033618840000101
The Brat marking tool is a text marking tool applied to a webserver end under Linux, can be used for marking entities, relations, events and attributes, has the advantages of no installation, cooperation of multiple persons, visual operation interfaces and the like, receives a great deal of attention, and is also used for marking the entities and the relations due to huge data marking workload, and part of marking results are shown in figure 2 in a mode of cooperation of multiple persons.
BIO sequence labeling: sequence labeling (Sequence labeling) is one of the basic problems we often encounter when solving NLP problems. In sequence labeling, we want to label each element of a sequence with a label. Generally, a sequence refers to a sentence, and an element refers to a word in the sentence. For example, the information extraction problem can be considered as a sequence labeling problem, such as extracting meeting time, place, etc. BIO labeling is to label each element as "B-X", "I-X" or "O". Wherein "B-X" indicates that the fragment in which the element is located is of the X type and that the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located is of the X type and that the element is at the middle of the fragment, "O" indicates that it is not of any type.
Fault type marking: the invention sorts the fault text and learns the related data and refers to the expert experience to sort 14 fault modes, marks the fault text respectively, and partial marking results are shown in table 2. The 14 failure modes are respectively: deformation, fracture, bulge, lateral acceleration overrun, cracking, oil leakage, oil seepage, loosening, damage, temperature rise, foreign matter and shaft temperature measurement failure.
Table 2 failure mode flags
Figure BDA0004033618840000111
And after labeling, identifying the named entity of the data.
The traditional word2vec language model cannot solve the problem of word ambiguity, and aims at the problem of word ambiguity possibly existing in named entities, text feature extraction is carried out based on the BERT language model to obtain a word granularity vector matrix, then context information extraction is carried out by using BILSTM, and a global optimal sequence is extracted by combining with a CRF model. The model frame is shown in fig. 3.
The model mainly comprises three layers, a BERT pre-training language model layer, a bidirectional LSTM layer and a CRF layer.
(1) BERT model
BERT is a natural language processing pre-training language characterization model the BERT marks the word vector in the input text information obtained by initialization as a sequence x= (X) 1 ,x 2 ,x 3 ,...,x n ) The obtained character vector can effectively extract the characteristics in the text by utilizing the mutual relation among the words, pretrains by utilizing the structuring of a self-attention mechanism, pretrains the deep bi-directional characterization by fusing the contexts at the left side and the right side on the basis of all layers, compared with the prior pretraining model, the text vector captures the contextual information in the true sense, and can learn the relation among continuous text fragments.
(2) BILSTM layer
The layer can automatically extract the characteristics of sentences, the specific structuring is shown in fig. 4, n-dimensional word vectors obtained by the BERT layer are used as the input of each time step of the two-way long and short-time memory neural network, and the output is the respective score of all labels of each word of the sentences. The Recurrent Neural Network (RNN) has better performance in processing text information, but has weaker capability in accessing context information, which also causes the problem of gradient dispersion, and the long and short term memory network (LSTM) solves the problem by introducing a unit and a gate mechanism, wherein the gate can be used for updating information in a memory unit, and the memory unit can be used for storing history information.
Figure BDA0004033618840000121
Figure BDA0004033618840000122
Figure BDA0004033618840000123
Figure BDA0004033618840000124
h t =o t tanh(c t ) (5)
Wherein g is an activation function, i, f, o represent an input gate, a forget gate, an input gate, and c is a memory unit. However, unidirectional LSTM networks have the same disadvantage as RNNs in that the way the information propagates is still unidirectional, which makes it impossible to consider both the forward and backward features, and thus ignores the context information. Thus, a bi-directional LSTM (BiLSTM) appears, the main idea being to reverse-input the observed sequence into the LSTM, which is equivalent to introducing two LSTM networks, applying one forward LSTM and one backward LSTM to each sequence, and combining the two outputs yields the final result. The forward LSTM network can receive the input x from the first time to the t time in time sequence 1 To x t And obtain a forward hidden state sequence
Figure BDA0004033618840000131
The reverse LSTM network can receive the time from the time t to the time 1Input x t To x 1 And get the reverse hidden state sequence +.>
Figure BDA0004033618840000132
The two state sequences are spliced to obtain a state sequence h t Then h is t The mapping is that kk represents the number of labels in the labeling set, and a vector containing k predicted values can be obtained after normalization processing, wherein each predicted value represents the probability that the current word is labeled as a certain label.
Figure BDA0004033618840000133
The forward features and the backward features can be combined by utilizing the bidirectional LSTM, so that the purpose of obtaining the bidirectional features is achieved, and the accuracy of labeling can be enhanced by the additional features compared with the unidirectional LSTM. Context information can be captured efficiently and it is also a significant advantage in fitting nonlinearities.
(3) CRF layer
The entity recognition work is carried out by marking the parts of speech and marking word boundaries of the language materials, and then further marking the acquired language data by using a BIO marking system. Where "B" represents the beginning of a noun phrase, "I" represents the middle of the noun phrase, and "O" represents a noun phrase that is not. Taking the sentence "remote supervision finds 01 car gearbox oil leak" as an example, table 3 shows a specific labeling process, wherein "source", "BJ", "mal" are entity types "failure source", "component", "failure phenomenon", respectively.
TABLE 3 BIO sequence tags
Figure BDA0004033618840000134
The CRF layer is used for completing sequence labeling of sentences, the output result of the double-layer LSTM layer is a predictive score for each label, and the layer takes the scores as input so as to predict the label with the maximum probabilityLabeling sequence of the rate. For observation sequence X, its predicted sequence result y= (y) 1 ,y 2 ,...,y n ) The score is defined as:
Figure BDA0004033618840000141
wherein P is a fractional matrix of bidirectional LSTM layer output, P i,j Representing the score of the j-th tag of the i-th word in the sentence. A is a transition probability matrix, A i,j The transition probability from tag i to tag j is represented.
For each training sample, all possible labeling sequence scores can be derived, normalized using the following equation:
Figure BDA0004033618840000142
the parameters of the model are continually adjusted to maximize the loss function as follows.
Figure BDA0004033618840000143
After model training is completed, the labeling sequence with the largest function as follows is used as the optimal labeling sequence.
Figure BDA0004033618840000144
The layer is not necessary for named entity identification, but can ensure the validity of the label by using CRF, and obtain the globally optimal labeling sequence, thereby improving the accuracy of identification.
Model training: the invention inputs the marked training data into the model, uses the BERT model to train to obtain word vector expression containing semantic information, takes the word vector output by the BERT as the input of the two-way neural network, and selects proper model parameters according to a certain evaluation standard in the training process. And finally, carrying out entity identification on the unlabeled test set data by using the selected model.
After entity identification, the fault types are classified, and the knowledge graph of the fault field of the bogie of the high-speed train is constructed, so that not only can knowledge support be provided for fault inquiry, but also guidance can be provided for fault maintenance. For the same fault, the fault phenomena may be different, but the maintenance method may be the same, for example, the fault phenomena are "gearbox oil leakage" and "damper oil leakage", and the maintenance method is "procedure 1: flaw detection is carried out on the gear box; and both fault phenomena belong to the same fault mode "oil leak". In order to better guide maintenance, the fault text is subjected to fault mode classification, namely a classification model is trained based on a deep learning model of word2vec+textCNN, and the input text is subjected to fault mode classification. The model frame is shown in fig. 5.
The model mainly has two layers, namely a word embedding (embedding) layer and a TextCNN (convolutional neural network) layer.
Word embedding (embedding) layer
When training with neural networks, it is necessary to convert text into a form that the model can understand. The function of the layer is to represent the text as a word vector with low dimension and dense, the vector often has semantic information and contains characteristics useful for classification. The Word2vec model construction flow is as follows:
1) Word segmentation and de-stop words: and (3) word segmentation is carried out on sentences by using a word segmentation device in the field of high-speed train bogies, which is obtained by training the word segmentation device based on BILSTM-CRF-textCNN. The training process and effect of the word segmentation device will be described in detail in the third section.
2) Constructing a dictionary: and counting word frequency. Traversing the whole text, and counting the occurrence frequency of all words in the text.
3) Constructing a Haffman tree: and constructing a Huffman tree according to the occurrence probability of the words, wherein all classifications are positioned on leaf nodes of the Huffman tree.
4) Training word vector: binary codes are generated for leaf nodes in the Huffman tree, and word vectors in each non-leaf node and leaf node are initialized and then trained.
5) And testing the generated model.
The textCNN (convolutional neural network) layer uses word vectors output by the emmbedding layer as the input of a model, uses convolution kernels with different sizes to perform feature extraction, fixes the dimensionality of the vectors by the pooling layer, and finally outputs the probability of each category by the fully connected softmax layer.
1) Feature extraction: and extracting features by using different convolution check data, wherein the sliding window, namely the convolution kernel, has the sizes of 3,4 and 5 respectively.
2) Pooling layer: after extracting the characteristics, the dimension of the input vector is fixed through a maximum pooling layer, so that the calculation is convenient.
3) Full tie layer: the model is subjected to feature extraction, and the probability of each category is output through a fully connected softmax after the maximum pooling.
Model training: inputting labeled training data into a model, training by using a word2vec model to obtain a word vector model, inputting the word vector model into a textCNN, extracting features of a convolution layer, pooling the layers, and finally outputting probability of each class by using the obtained model through Softmax (the meaning of the Softmax is that a certain maximum value is not uniquely determined, and a probability value is given to each output classification result to represent possibility of belonging to each class). The classification model is also a neural network, and model parameters need to be adjusted according to corresponding evaluation criteria in the training process.
Specific effect experiment and data analysis:
the experiment of the invention mainly utilizes a TensorFlow deep learning framework, and firstly, each parameter of the model is set, wherein the node number of the bidirectional LSTM layer is set to be 100, that is to say, the dimension of the output vector of the model can reach 200. Wherein the optimizer uses Adam algorithm.
Some parameters in the neural network need to be manually adjusted continuously in the learning process of the model, such as learning rate (Leanringrate), and the proper learning rate can enable the model to be fitted more efficiently without affecting the accuracy of the model, and too large learning rate can cause too fast convergence of the model and even failure to converge; too small learning rate can result in too slow convergence rate, affecting the training efficiency of the model. According to the invention, the learning rate is continuously adjusted in the model training process, and finally the learning rate is set to be 0.001, so that the experimental requirement can be better met. Fig. 6 shows the recognition effect (F1 value) of the algorithm at learning rates of 0.001 and 0.01, respectively.
In order to avoid the phenomenon of overfitting in the model training process, a Dropout method is used, parameters of the model are continuously adjusted according to the identification effect in the model training process, and finally the parameters are considered to be set to be 0.5, so that the model training process has a good effect. Fig. 7 shows the recognition effect (F1 value) of the algorithm when the dropout parameter is set to 0.5 and 0.8, respectively.
The entity recognition model and fault text classification model parameter settings are shown in table 4.
Table 4 entity identification and fault classification parameter settings
Figure BDA0004033618840000171
And the evaluation indexes adopted for evaluating the experimental results mainly comprise accuracy (Precision), recall rate (Recal 1) and F1 value. The accuracy reflects the ratio of the number of correctly identified entities to the number of identified entities, the term references may reflect the accuracy of entity identification, and the recall represents the ratio of the number of correctly identified entities to the total number of entities, reflecting the recall of entity identification. The F1 value is a harmonic mean value of the accuracy and the recall, and can be regarded as a comprehensive evaluation index of the accuracy and the recall. The specific formula is as follows:
Figure BDA0004033618840000172
Figure BDA0004033618840000173
/>
Figure BDA0004033618840000174
(1) Results and analysis
The invention takes the word vector output by the BERT pre-training model as the input of BILSTM-CRF to construct the entity recognition model, and compares the entity recognition effects of different pre-training word vectors; training a special word segmentation device according to the text characteristics in the fault field of the high-speed train bogie, combining word2vec+textCNN to construct a fault text classification model, and comparing the classification effect of the classification model obtained by the conventional jieba word segmentation based on the common NLP technology, wherein the results are shown in Table 5 and Table 6.
Table 5 comparison of model results
Figure BDA0004033618840000181
Table 6 comparison of model results
Figure BDA0004033618840000182
In comparison of named entity recognition results, the entity recognition effect based on the BERT pre-training word vector is better than the recognition effect based on the CBOW word vector model, wherein the obtained recognition accuracy is lower due to less labeling data of the subsystem, but the recall rate is higher, and the BERT+BiLSTM-CRF model is adopted as a final model for named entity recognition through analysis and comparison.
The method has the advantages that the single classification effect comparison of the classification model finds that the classification model effect based on the field word segmentation model is better than the classification result of the jieba word segmentation. The method is characterized in that in the classification model based on the domain word segmentation model, domain words can be better identified in the word segmentation and word frequency counting process, so that a domain dictionary is constructed, and the classification model based on the domain dictionary can better carry semantic features related to the domain, so that the classification result is better. Through analysis and comparison, the invention adopts a word2vec+textCNN model based on a field word segmentation model as a final fault classification model.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that the above-mentioned preferred embodiment should not be construed as limiting the invention, and the scope of the invention should be defined by the appended claims. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the spirit and scope of the invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (10)

1. The knowledge extraction method in the complex equipment knowledge graph construction process is characterized by comprising the following steps of:
performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types;
if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data;
if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template;
and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained.
2. The method according to claim 1, wherein the data mining is specifically:
classifying according to the types of the data of each stage of the whole life cycle of the product;
the data of each stage is divided into structured data, semi-structured data and unstructured data according to data structuring.
3. The method according to claim 1, wherein the data stored in correspondence with the related data form in the relational database is specifically:
the D2R tool is used for conversion, the table names of the database are directly mapped to the classes in the resource description framework, the fields are mapped to the attributes of the classes, and the relations among the classes are obtained from the table representing the relations.
4. The method according to claim 1, wherein extracting the corresponding knowledge data through the network data extraction template and the non-network data extraction template is specifically:
inputting the network data into a wrapper, and extracting knowledge data in the network data; the wrapper is encapsulated by a markup language and a path language;
and inputting the non-network data into a template library, and performing template matching to extract knowledge data in the non-network data.
5. The method of claim 4, wherein the template library shape is a structured template library created by generalized total structuring of tables or process data in complex equipment data.
6. The method of claim 4, further comprising inputting plain text data in the second knowledge data obtained after the extraction of the network data and the non-network data into an unstructured text knowledge extraction model for entity extraction.
7. The method according to claim 1, wherein the labeling of the data by the deep learning template is specifically:
and labeling the entities, the relations, the sequences and the fault types of the data based on the text labeling tool to obtain labeled training data.
8. The method according to claim 7, wherein the entity identification is in particular:
inputting the labeled training data into a model, and training by using a BERT model to obtain word vectors containing semantic information;
inputting the word vector into a bi-directional neural network;
and carrying out entity identification on the unlabeled test set data by using the deep learning model.
9. The method according to claim 8, wherein the text classification is specifically:
inputting the labeled training data into a word2vec model, training to obtain a word vector model, then taking the word vector model as the input of a textCNN model,
feature extraction of the convolution layer and pooling of the pooling layer are carried out;
and outputting the probability of each category through the full-connection single softmax, and classifying the unlabeled test set by using the obtained model.
10. The method according to claim 1, wherein the relation extraction is specifically:
acquiring OWL files of the ontology;
analyzing to obtain (concept, relation, concept) triples;
acquiring a CSV file of an entity identification result;
reading the concept corresponding to the CSV header and the entity corresponding to the concept according to the triplet; and (3) establishing (entity, relation and entity) triples, storing the triples into a database, and completing relation extraction.
CN202211738751.4A 2022-12-31 2022-12-31 Knowledge extraction method in complex equipment knowledge graph construction process Pending CN116010619A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211738751.4A CN116010619A (en) 2022-12-31 2022-12-31 Knowledge extraction method in complex equipment knowledge graph construction process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211738751.4A CN116010619A (en) 2022-12-31 2022-12-31 Knowledge extraction method in complex equipment knowledge graph construction process

Publications (1)

Publication Number Publication Date
CN116010619A true CN116010619A (en) 2023-04-25

Family

ID=86026797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211738751.4A Pending CN116010619A (en) 2022-12-31 2022-12-31 Knowledge extraction method in complex equipment knowledge graph construction process

Country Status (1)

Country Link
CN (1) CN116010619A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467468A (en) * 2023-05-05 2023-07-21 国网浙江省电力有限公司 Power management system abnormal information handling method based on knowledge graph technology
CN116893924A (en) * 2023-09-11 2023-10-17 江西南昌济生制药有限责任公司 Equipment fault processing method, device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116467468A (en) * 2023-05-05 2023-07-21 国网浙江省电力有限公司 Power management system abnormal information handling method based on knowledge graph technology
CN116467468B (en) * 2023-05-05 2024-01-05 国网浙江省电力有限公司 Power management system abnormal information handling method based on knowledge graph technology
CN116893924A (en) * 2023-09-11 2023-10-17 江西南昌济生制药有限责任公司 Equipment fault processing method, device, electronic equipment and storage medium
CN116893924B (en) * 2023-09-11 2023-12-01 江西南昌济生制药有限责任公司 Equipment fault processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112199511A (en) Cross-language multi-source vertical domain knowledge graph construction method
CN112732934B (en) Power grid equipment word segmentation dictionary and fault case library construction method
CN111737496A (en) Power equipment fault knowledge map construction method
CN110929030A (en) Text abstract and emotion classification combined training method
CN116010619A (en) Knowledge extraction method in complex equipment knowledge graph construction process
CN110298403B (en) Emotion analysis method and system for enterprise main body in financial news
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN109871955A (en) A kind of aviation safety accident causality abstracting method
CN115099338A (en) Power grid master equipment-oriented multi-source heterogeneous quality information fusion processing method and system
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN112463981A (en) Enterprise internal operation management risk identification and extraction method and system based on deep learning
CN112883286A (en) BERT-based method, equipment and medium for analyzing microblog emotion of new coronary pneumonia epidemic situation
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
Li et al. A method for resume information extraction using bert-bilstm-crf
CN115292490A (en) Analysis algorithm for policy interpretation semantics
CN115757695A (en) Log language model training method and system
CN111428502A (en) Named entity labeling method for military corpus
Qu et al. Knowledge-driven recognition methodology for electricity safety hazard scenarios
CN117474010A (en) Power grid language model-oriented power transmission and transformation equipment defect corpus construction method
CN112632978A (en) End-to-end-based substation multi-event relation extraction method
CN117094390A (en) Knowledge graph construction and intelligent search method oriented to ocean engineering field
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN113361259B (en) Service flow extraction method
CN115858807A (en) Question-answering system based on aviation equipment fault knowledge map
CN115098687A (en) Alarm checking method and device for scheduling operation of electric power SDH optical transmission system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination