CN116010619A

CN116010619A - Knowledge extraction method in complex equipment knowledge graph construction process

Info

Publication number: CN116010619A
Application number: CN202211738751.4A
Authority: CN
Inventors: 张海柱; 丁国富; 王淑营; 郑庆; 马自力
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2022-12-31
Filing date: 2022-12-31
Publication date: 2023-04-25

Abstract

The invention provides a knowledge extraction method in the process of constructing a knowledge graph of complex equipment, which adopts different methods to extract through classifying different data types, greatly saves the time of knowledge extraction, adopts a plurality of deep learning models and neural networks in the extraction of unstructured data which is most difficult to process, and marks, classifies faults, identifies entities and extracts relations from massive heterogeneous data through a plurality of different training sets, so that the knowledge extraction of the unstructured data is more convenient, the accuracy is higher, and the inquiry of the related knowledge of the fault maintenance of the complex equipment is more comprehensive and convenient.

Description

Knowledge extraction method in complex equipment knowledge graph construction process

Technical Field

The invention belongs to the technical field of complex equipment design, and particularly relates to a knowledge extraction method in a complex equipment knowledge graph construction process.

Background

Products in the fields of electric power, engineering machinery, rail transit and the like are typical complex equipment which is safely and reliably in service under long life cycle and complex working conditions, and the research and development process is a complex system engineering. Various data such as models, businesses, perceptions and the like can be generated at each stage of the research and development period, and the data presents explosive growth along with continuous acquisition and analysis of various perceptions of the manufacturing process and the operation and maintenance process. The digital twin of the equipment is data driven, the fusion of multi-source data is the primary problem to be solved, but the analysis method of pure data driving ignores the experience knowledge, how to form the 'brain' of the complex equipment, and the fusion of the multi-source data and the digging of the potential relation of the multi-source data are critical to the construction of the digital twin of the complex equipment.

As an important branch of the new generation of artificial intelligence technology, knowledge maps, known as "learned AI", are often used to fuse multi-source data to build large-scale knowledge bases. The knowledge graph is a semantic network, expresses semantic relations among entities, is suitable for describing complex space structuring of complex equipment multi-layer systems, subsystems, components and parts, and can obtain more comprehensive space-time correlation knowledge through access analysis mining of multi-source heterogeneous time sequence perception data.

However, the structuring of the complex equipment system has the characteristics of multiple layers, and meanwhile, the data sources of all stages of the whole life cycle of the complex product are different, the structuring forms are different, and the complex equipment system has the characteristics of multiple source isomerism. Knowledge data of complex products at various stages of the full lifecycle include documents, pictures, tables, models, files, etc., not only structured data, but also semi-structured and unstructured data. Such data originates not only from the stages of the full lifecycle, but also from the various levels of systems, subsystems, components, parts of the complex equipment system architecture. The knowledge extraction is very difficult, so that the data processing amount is large and omission is easy in the knowledge extraction process.

Disclosure of Invention

In view of the above, the invention provides a method for classifying data accurately, has high automation degree, can extract knowledge of massive heterogeneous data in each stage of a full life cycle, and lays a data blanket for building a knowledge graph of complex equipment.

A knowledge data extraction method in a complex equipment knowledge graph construction process comprises the following steps:

performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types;

if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data;

if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template;

and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained.

Preferably, the data mining specifically includes:

classifying according to the types of the data of each stage of the whole life cycle of the product;

the data of each stage is divided into structured data, semi-structured data and unstructured data according to data structuring.

Preferably, the data stored in the related data form in the call relation database is specifically:

the D2R tool is used for conversion, the table names of the database are directly mapped to the classes in the resource description framework, the fields are mapped to the attributes of the classes, and the relations among the classes are obtained from the table representing the relations.

Preferably, the extracting the corresponding knowledge data through the network data extraction template and the non-network data extraction template specifically includes:

inputting the network data into a wrapper, and extracting knowledge data in the network data; the wrapper is encapsulated by a markup language and a path language;

and inputting the non-network data into a template library, and performing template matching to extract knowledge data in the non-network data.

Preferably, the template library shape is a structured template library created by generalized total structuring of tables or process data in complex equipment data.

Preferably, the method further comprises the step of inputting plain text data in second knowledge data obtained after extraction of the network data and the non-network data into an unstructured text knowledge extraction model for entity extraction.

Preferably, the labeling of the data by the deep learning template specifically includes:

and labeling the entity, the relation, the sequence and the fault type of the data through a text labeling tool to obtain labeled training data.

Preferably, the entity identification is specifically:

inputting the labeled training data into a model, and training by using a BERT model to obtain word vectors containing semantic information;

inputting the word vector into a bi-directional neural network;

and carrying out entity identification on the unlabeled test set data by using the deep learning model.

Preferably, the text classification specifically includes:

inputting the labeled training data into a word2vec model, training to obtain a word vector model, then taking the word vector model as the input of a textCNN model,

feature extraction of the convolution layer and pooling of the pooling layer are carried out;

and outputting the probability of each category through the full-connection single softmax, and classifying the unlabeled test set by using the obtained model.

Preferably, the relation extraction is specifically:

acquiring OWL files of the ontology;

analyzing to obtain concept, relation and concept triples;

acquiring a CSV file of an entity identification result;

reading the concept corresponding to the CSV header according to the concept, the relation and the concept triplet;

reading an entity corresponding to the concept;

and establishing entity, relation and entity triples, storing the entity, relation and entity triples into a database, and completing relation extraction.

The invention provides a knowledge data extraction method in the process of constructing a knowledge graph of complex equipment, which comprises the following steps: performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types; if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data; if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template; and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained. According to the knowledge extraction method provided by the invention, different methods are adopted for extraction through classifying different data types, so that the time for knowledge extraction is greatly saved, in the extraction of unstructured data which is most difficult to process, a plurality of deep learning models and neural networks are adopted, and the knowledge extraction of unstructured data is more convenient, the accuracy is higher, and the inquiry of related knowledge of fault maintenance of complex equipment is more comprehensive and convenient through a plurality of different training sets from massive heterogeneous data marks, fault classification, entity identification and relation extraction.

Drawings

FIG. 1 is a flow chart of a knowledge extraction process disclosed in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data annotation according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a entity identification model framework according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of BILSTM layer in the entity recognition model according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of a classification training model according to an embodiment of the present invention;

FIG. 6 is a graph showing the effect of entity recognition under different learning rates of the recognition model according to the embodiment of the present invention;

FIG. 7 is a graph showing the effect of identifying different dropout parameter entities of the identification model according to the embodiment of the present invention;

fig. 8 is a schematic diagram of a relationship extraction flow disclosed in an embodiment of the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the following specific embodiments.

The invention solves the problems in the prior art, and provides a knowledge data extraction method in the process of constructing a knowledge graph of complex equipment, which comprises the following steps: performing data mining to obtain knowledge data types, and performing knowledge extraction based on the data types; if the data type is structured data, calling the data stored correspondingly in the related database form in the relational database to obtain corresponding first knowledge data; if the data type is semi-structured data, extracting corresponding second knowledge data through a network data extraction template and a non-network data extraction template; and if the data type is unstructured data, marking, entity identification, text classification and relation extraction are carried out on the data through a deep learning model, so that corresponding third knowledge data is obtained.

According to the invention, the complex equipment system structure has the characteristics of multiple layers, and meanwhile, the complex products have different data sources at each stage of the whole life cycle, different structural forms and the characteristics of multiple source isomerism. Knowledge data of complex products at various stages of the full lifecycle include documents, pictures, tables, models, files, etc., not only structured data, but also semi-structured and unstructured data. Such data originates not only from the stages of the full lifecycle, but also from the various levels of systems, subsystems, components, parts of the complex equipment system architecture. Before knowledge extraction, the data sources are first classified and identified,

the invention aims at a multi-stage multi-source heterogeneous knowledge data research structuring, semi-structuring and unstructured knowledge extraction method of complex equipment, wherein structured data is preferably derived from table storage data in a database, semi-structured data is derived from tables, web pages and the like, and unstructured data is derived from various knowledge data in a text form. A complex equipment multi-source heterogeneous data knowledge extraction framework is shown in fig. 1. The knowledge extraction overall framework is divided into two parts: and (5) multi-source heterogeneous data mining and knowledge extraction models.

According to the invention, the multi-source heterogeneous data mining is specifically: classifying according to the types of the data of each stage in the whole life cycle of the product, and dividing the data of each stage into structured data, semi-structured data and unstructured data according to the data structure.

Different knowledge extraction methods are adopted according to different data types, and if the data types are structured data, a D2R tool is preferably adopted for conversion, and D2R (Database to RDF) is one way of extracting knowledge from a relational database. Database table names are mapped directly to classes in RDF (resource description framework), and fields are mapped to attributes of the classes. The relationships between classes may be derived from a table representing the relationships; the method of use of the tool is a method that can be appreciated by those skilled in the art.

According to the invention, knowledge data sources of the complex equipment are structured data, wherein most of the data are derived from unstructured text and semi-structured text, such as technical standards, fault text records and the like, belonging to unstructured text data, and process routes, manufacturing processes, webpage data and the like belong to semi-structured text data. The semi-structured text data can be divided into three types according to the structure and the characteristics, namely web types, scientific and technical literature types and other types, wherein the web types are extracted and the scientific and technical literature types exist in web pages, and other types, such as data stored in a form of a table or a text with a certain structure in complex equipment, are characterized by having certain structural characteristics, the data are generalized into a plurality of nouns by virtue of a text format, the characteristics of combining the format and the freedom are provided, the two types of knowledge points are commonly characterized by having a certain framework, the content is stored in the form of text, and for the certain structural, the invention adopts a template matching based mode to acquire knowledge points, and simultaneously adopts a deep learning method to extract the knowledge data for the text needing further entity extraction. The method flow framework is shown in fig. 2-6. The process comprises two parts, namely a webpage data extraction template and a non-webpage data extraction template, and meanwhile, plain text data which needs to be subjected to knowledge extraction is input into an unstructured text knowledge extraction model to be subjected to entity extraction.

According to the invention, the extraction of the corresponding knowledge data through the network data extraction template and the non-network data extraction template is specifically as follows: inputting the network data into a wrapper, and extracting knowledge data in the network data; the wrapper is encapsulated by a markup language and a path language; and inputting the non-network data into a template library, and performing template matching to extract knowledge data in the non-network data. Further the template library shape is a structured template library created by generalized total structuring of tables or process data in complex equipment data.

And if the plain text data in the second knowledge data obtained after the network data and the non-network data are further extracted is required to be input into an unstructured text knowledge extraction model for entity extraction.

More preferably, the method specifically analyzes the specific situation of the obtained semi-structured data when selecting the extraction template, for example, unstructured data in the webpage exists in the form of html, xml and the like, and the unstructured data is constructed by using the markup language which has the characteristic of regularization, and XPath and CSS selector expressions can be written by using a regular expression mode to extract elements in the webpage. These expressions are generalized and encapsulated into a wrapper. XPath is an XML path language (XMLPATHLangage), which is a language used to determine the location of a portion of an XML document, and is based on the tree structure of XML, providing the ability to find nodes in a data structure tree. Initially the original purpose of XPath was to use it as a generic grammar model between XPointer and XSL. But XPath is quickly adopted by developers as a small query language to parse the required knowledge data from the web page by the XPath method.

The non-webpage data extraction template aims at data stored in a certain structure in complex equipment data, the data has certain structural information, the structural information is generalized and structured to form a structural template library, and templates in the structural template library are used for template matching through compiling regular expressions so as to obtain knowledge.

If the data type is unstructured data, the knowledge extraction process is relatively more complex, the situation that needs to be considered is more, the entity extraction is also called named entity identification, the purpose of which is to extract entity information elements from the text, and the method is the basis for solving a plurality of NLP tasks. The entity extraction method comprises a rule-based extraction method, a statistical learning-based method and a deep learning-based method. The invention performs entity extraction from fault text data in an operation and maintenance stage based on a BERT+Bilstm-CRF deep learning model.

The data are marked by the deep learning template specifically comprises the following steps:

Preferably, the entity identification is specifically:

inputting the word vector into a bi-directional neural network;

In the fault stage, as different fault modes have different maintenance methods and strategies, and meanwhile, the same fault modes often have the same maintenance methods and strategies, in order to more accurately acquire faults, guide maintenance by faults and classify the fault modes based on fault text data, the invention carries out text classification based on a deep learning model of word2 vec+textCNN. The method comprises the following steps: inputting the marked training data into a word2vec model, training to obtain a word vector model, taking the word vector model as the input of a textCNN model, and carrying out feature extraction of a convolution layer and pooling of a pooling layer; and outputting the probability of each category through the full-connection single soffmax, and classifying the unlabeled test set by using the obtained model.

After the fault text data is subjected to fault mode classification, the fault mode can be regarded as an entity or a concept, and then the parts identified by the entity and the fault mode classification can be corresponding through a relation. Entity identification, fault classification, and relationship extraction must all be done to solve the prior art problems.

Relationship extraction is one of important subtasks of knowledge extraction, and is oriented to unstructured text data, and relationship extraction is to extract semantic relationships between two or more entities from text. The relationship extraction is closely related to the entity extraction of unstructured text data, and the possible relationship between entities is extracted after the entities in the text are identified.

At present, the relationship extraction method can be classified into a relationship extraction method based on a template, a relationship extraction method based on supervised learning, and a relationship extraction method based on weak supervised learning. The invention uses the ontology mode to formally express the concept and the semantic relation of the mode layer.

The ontology is a basis for forming a semantic model of the knowledge graph, and data filling of the knowledge graph is normalized and restrained at a semantic level, so that the knowledge graph can be constructed and supported on a large scale. By reading the owl file (ontology model file) of the ontology, analyzing out (concept-relationship-concept) triples; meanwhile, the entity identification result is (concept, entity); by matching the entity concepts in the recognition result with the parsed concepts, a relationship is established between the concepts, relationship acquisition is realized,

the knowledge graph serves as a semantic network and can establish the connection among a plurality of entities, wherein the semantic relationship plays an important role. In the construction of the knowledge graph data pattern layer, semantic relations such as "belongs to", "occurs", "phenomenon", "failure source", "failure time", "travel kilometers" and the like are defined.

The complex equipment knowledge graph relation extraction method is realized based on named entity recognition and ontology modeling, and the flow is shown in fig. 8. Firstly, acquiring an ontology model file and an entity obtained by entity identification, then acquiring corresponding entities according to concepts in the ontology, then establishing a relationship between two entities according to a relationship between two concepts in the ontology, establishing model connection between different originally identified entities, and associating the entities classified according to different texts.

The relation extraction specifically comprises the following steps: acquiring OWL files of the ontology; analyzing to obtain concept, relation and concept triples; acquiring a CSV file of an entity identification result; reading the concept corresponding to the CSV header according to the concept, the relation and the concept triplet; reading an entity corresponding to the concept; and (3) establishing (entity, relation and entity) triples, storing the triples into a database, and completing relation extraction.

The following explains the specific embodiment of the present invention by taking the knowledge extraction process at the operation and maintenance stage of the bogie of the high-speed train as an example.

And (3) data acquisition: the data sources of the knowledge graph constructed by the invention comprise fault text data and hundred degree encyclopedia, thesis and patent data about the operation and maintenance stage of the bogie of the high-speed train. Knowledge crawling is performed by taking the multi-level structural names of a bogie system, a subsystem, components and parts of the high-speed train as keywords, crawling is performed on semi-structural data of a webpage by using a scrapy framework, and the data is stored by using a mysql relational database.

And storing full text index links in a link mode aiming at the full text of journal papers and patents. The fault text data exists in the form of text, and knowledge extraction is required by using NLP technology (natural language processing).

For unstructured data, the invention uses NLP technology to realize two tasks of entity recognition and text classification, and the model uses a deep learning model based on supervised learning, so that a large amount of labeling data is needed, and for the entity recognition model, a brat labeling tool is used for labeling the entity and the relation, and the labeling code of the definition entity is shown in table 1. Sequence labeling is performed by using a BIO mode.

Table 1 entity annotation coding

The Brat marking tool is a text marking tool applied to a webserver end under Linux, can be used for marking entities, relations, events and attributes, has the advantages of no installation, cooperation of multiple persons, visual operation interfaces and the like, receives a great deal of attention, and is also used for marking the entities and the relations due to huge data marking workload, and part of marking results are shown in figure 2 in a mode of cooperation of multiple persons.

BIO sequence labeling: sequence labeling (Sequence labeling) is one of the basic problems we often encounter when solving NLP problems. In sequence labeling, we want to label each element of a sequence with a label. Generally, a sequence refers to a sentence, and an element refers to a word in the sentence. For example, the information extraction problem can be considered as a sequence labeling problem, such as extracting meeting time, place, etc. BIO labeling is to label each element as "B-X", "I-X" or "O". Wherein "B-X" indicates that the fragment in which the element is located is of the X type and that the element is at the beginning of the fragment, "I-X" indicates that the fragment in which the element is located is of the X type and that the element is at the middle of the fragment, "O" indicates that it is not of any type.

Fault type marking: the invention sorts the fault text and learns the related data and refers to the expert experience to sort 14 fault modes, marks the fault text respectively, and partial marking results are shown in table 2. The 14 failure modes are respectively: deformation, fracture, bulge, lateral acceleration overrun, cracking, oil leakage, oil seepage, loosening, damage, temperature rise, foreign matter and shaft temperature measurement failure.

Table 2 failure mode flags

And after labeling, identifying the named entity of the data.

The traditional word2vec language model cannot solve the problem of word ambiguity, and aims at the problem of word ambiguity possibly existing in named entities, text feature extraction is carried out based on the BERT language model to obtain a word granularity vector matrix, then context information extraction is carried out by using BILSTM, and a global optimal sequence is extracted by combining with a CRF model. The model frame is shown in fig. 3.

The model mainly comprises three layers, a BERT pre-training language model layer, a bidirectional LSTM layer and a CRF layer.

(1) BERT model

BERT is a natural language processing pre-training language characterization model the BERT marks the word vector in the input text information obtained by initialization as a sequence x= (X) ₁ ，x ₂ ，x ₃ ，...，x _n ) The obtained character vector can effectively extract the characteristics in the text by utilizing the mutual relation among the words, pretrains by utilizing the structuring of a self-attention mechanism, pretrains the deep bi-directional characterization by fusing the contexts at the left side and the right side on the basis of all layers, compared with the prior pretraining model, the text vector captures the contextual information in the true sense, and can learn the relation among continuous text fragments.

(2) BILSTM layer

The layer can automatically extract the characteristics of sentences, the specific structuring is shown in fig. 4, n-dimensional word vectors obtained by the BERT layer are used as the input of each time step of the two-way long and short-time memory neural network, and the output is the respective score of all labels of each word of the sentences. The Recurrent Neural Network (RNN) has better performance in processing text information, but has weaker capability in accessing context information, which also causes the problem of gradient dispersion, and the long and short term memory network (LSTM) solves the problem by introducing a unit and a gate mechanism, wherein the gate can be used for updating information in a memory unit, and the memory unit can be used for storing history information.

h _t ＝o _t tanh(c _t ) (5)

Wherein g is an activation function, i, f, o represent an input gate, a forget gate, an input gate, and c is a memory unit. However, unidirectional LSTM networks have the same disadvantage as RNNs in that the way the information propagates is still unidirectional, which makes it impossible to consider both the forward and backward features, and thus ignores the context information. Thus, a bi-directional LSTM (BiLSTM) appears, the main idea being to reverse-input the observed sequence into the LSTM, which is equivalent to introducing two LSTM networks, applying one forward LSTM and one backward LSTM to each sequence, and combining the two outputs yields the final result. The forward LSTM network can receive the input x from the first time to the t time in time sequence ₁ To x _t And obtain a forward hidden state sequence

The reverse LSTM network can receive the time from the time t to the time 1Input x _t To x ₁ And get the reverse hidden state sequence +.>

The two state sequences are spliced to obtain a state sequence h _t Then h is _t The mapping is that kk represents the number of labels in the labeling set, and a vector containing k predicted values can be obtained after normalization processing, wherein each predicted value represents the probability that the current word is labeled as a certain label.

The forward features and the backward features can be combined by utilizing the bidirectional LSTM, so that the purpose of obtaining the bidirectional features is achieved, and the accuracy of labeling can be enhanced by the additional features compared with the unidirectional LSTM. Context information can be captured efficiently and it is also a significant advantage in fitting nonlinearities.

(3) CRF layer

The entity recognition work is carried out by marking the parts of speech and marking word boundaries of the language materials, and then further marking the acquired language data by using a BIO marking system. Where "B" represents the beginning of a noun phrase, "I" represents the middle of the noun phrase, and "O" represents a noun phrase that is not. Taking the sentence "remote supervision finds 01 car gearbox oil leak" as an example, table 3 shows a specific labeling process, wherein "source", "BJ", "mal" are entity types "failure source", "component", "failure phenomenon", respectively.

TABLE 3 BIO sequence tags

The CRF layer is used for completing sequence labeling of sentences, the output result of the double-layer LSTM layer is a predictive score for each label, and the layer takes the scores as input so as to predict the label with the maximum probabilityLabeling sequence of the rate. For observation sequence X, its predicted sequence result y= (y) ₁ ，y ₂ ，...，y _n ) The score is defined as:

wherein P is a fractional matrix of bidirectional LSTM layer output, P _i，j Representing the score of the j-th tag of the i-th word in the sentence. A is a transition probability matrix, A _i，j The transition probability from tag i to tag j is represented.

For each training sample, all possible labeling sequence scores can be derived, normalized using the following equation:

the parameters of the model are continually adjusted to maximize the loss function as follows.

After model training is completed, the labeling sequence with the largest function as follows is used as the optimal labeling sequence.

The layer is not necessary for named entity identification, but can ensure the validity of the label by using CRF, and obtain the globally optimal labeling sequence, thereby improving the accuracy of identification.

Model training: the invention inputs the marked training data into the model, uses the BERT model to train to obtain word vector expression containing semantic information, takes the word vector output by the BERT as the input of the two-way neural network, and selects proper model parameters according to a certain evaluation standard in the training process. And finally, carrying out entity identification on the unlabeled test set data by using the selected model.

After entity identification, the fault types are classified, and the knowledge graph of the fault field of the bogie of the high-speed train is constructed, so that not only can knowledge support be provided for fault inquiry, but also guidance can be provided for fault maintenance. For the same fault, the fault phenomena may be different, but the maintenance method may be the same, for example, the fault phenomena are "gearbox oil leakage" and "damper oil leakage", and the maintenance method is "procedure 1: flaw detection is carried out on the gear box; and both fault phenomena belong to the same fault mode "oil leak". In order to better guide maintenance, the fault text is subjected to fault mode classification, namely a classification model is trained based on a deep learning model of word2vec+textCNN, and the input text is subjected to fault mode classification. The model frame is shown in fig. 5.

The model mainly has two layers, namely a word embedding (embedding) layer and a TextCNN (convolutional neural network) layer.

Word embedding (embedding) layer

When training with neural networks, it is necessary to convert text into a form that the model can understand. The function of the layer is to represent the text as a word vector with low dimension and dense, the vector often has semantic information and contains characteristics useful for classification. The Word2vec model construction flow is as follows:

1) Word segmentation and de-stop words: and (3) word segmentation is carried out on sentences by using a word segmentation device in the field of high-speed train bogies, which is obtained by training the word segmentation device based on BILSTM-CRF-textCNN. The training process and effect of the word segmentation device will be described in detail in the third section.

2) Constructing a dictionary: and counting word frequency. Traversing the whole text, and counting the occurrence frequency of all words in the text.

3) Constructing a Haffman tree: and constructing a Huffman tree according to the occurrence probability of the words, wherein all classifications are positioned on leaf nodes of the Huffman tree.

4) Training word vector: binary codes are generated for leaf nodes in the Huffman tree, and word vectors in each non-leaf node and leaf node are initialized and then trained.

5) And testing the generated model.

The textCNN (convolutional neural network) layer uses word vectors output by the emmbedding layer as the input of a model, uses convolution kernels with different sizes to perform feature extraction, fixes the dimensionality of the vectors by the pooling layer, and finally outputs the probability of each category by the fully connected softmax layer.

1) Feature extraction: and extracting features by using different convolution check data, wherein the sliding window, namely the convolution kernel, has the sizes of 3,4 and 5 respectively.

2) Pooling layer: after extracting the characteristics, the dimension of the input vector is fixed through a maximum pooling layer, so that the calculation is convenient.

3) Full tie layer: the model is subjected to feature extraction, and the probability of each category is output through a fully connected softmax after the maximum pooling.

Model training: inputting labeled training data into a model, training by using a word2vec model to obtain a word vector model, inputting the word vector model into a textCNN, extracting features of a convolution layer, pooling the layers, and finally outputting probability of each class by using the obtained model through Softmax (the meaning of the Softmax is that a certain maximum value is not uniquely determined, and a probability value is given to each output classification result to represent possibility of belonging to each class). The classification model is also a neural network, and model parameters need to be adjusted according to corresponding evaluation criteria in the training process.

Specific effect experiment and data analysis:

the experiment of the invention mainly utilizes a TensorFlow deep learning framework, and firstly, each parameter of the model is set, wherein the node number of the bidirectional LSTM layer is set to be 100, that is to say, the dimension of the output vector of the model can reach 200. Wherein the optimizer uses Adam algorithm.

Some parameters in the neural network need to be manually adjusted continuously in the learning process of the model, such as learning rate (Leanringrate), and the proper learning rate can enable the model to be fitted more efficiently without affecting the accuracy of the model, and too large learning rate can cause too fast convergence of the model and even failure to converge; too small learning rate can result in too slow convergence rate, affecting the training efficiency of the model. According to the invention, the learning rate is continuously adjusted in the model training process, and finally the learning rate is set to be 0.001, so that the experimental requirement can be better met. Fig. 6 shows the recognition effect (F1 value) of the algorithm at learning rates of 0.001 and 0.01, respectively.

In order to avoid the phenomenon of overfitting in the model training process, a Dropout method is used, parameters of the model are continuously adjusted according to the identification effect in the model training process, and finally the parameters are considered to be set to be 0.5, so that the model training process has a good effect. Fig. 7 shows the recognition effect (F1 value) of the algorithm when the dropout parameter is set to 0.5 and 0.8, respectively.

The entity recognition model and fault text classification model parameter settings are shown in table 4.

Table 4 entity identification and fault classification parameter settings

And the evaluation indexes adopted for evaluating the experimental results mainly comprise accuracy (Precision), recall rate (Recal 1) and F1 value. The accuracy reflects the ratio of the number of correctly identified entities to the number of identified entities, the term references may reflect the accuracy of entity identification, and the recall represents the ratio of the number of correctly identified entities to the total number of entities, reflecting the recall of entity identification. The F1 value is a harmonic mean value of the accuracy and the recall, and can be regarded as a comprehensive evaluation index of the accuracy and the recall. The specific formula is as follows:

/>

(1) Results and analysis

The invention takes the word vector output by the BERT pre-training model as the input of BILSTM-CRF to construct the entity recognition model, and compares the entity recognition effects of different pre-training word vectors; training a special word segmentation device according to the text characteristics in the fault field of the high-speed train bogie, combining word2vec+textCNN to construct a fault text classification model, and comparing the classification effect of the classification model obtained by the conventional jieba word segmentation based on the common NLP technology, wherein the results are shown in Table 5 and Table 6.

Table 5 comparison of model results

Table 6 comparison of model results

In comparison of named entity recognition results, the entity recognition effect based on the BERT pre-training word vector is better than the recognition effect based on the CBOW word vector model, wherein the obtained recognition accuracy is lower due to less labeling data of the subsystem, but the recall rate is higher, and the BERT+BiLSTM-CRF model is adopted as a final model for named entity recognition through analysis and comparison.

The method has the advantages that the single classification effect comparison of the classification model finds that the classification model effect based on the field word segmentation model is better than the classification result of the jieba word segmentation. The method is characterized in that in the classification model based on the domain word segmentation model, domain words can be better identified in the word segmentation and word frequency counting process, so that a domain dictionary is constructed, and the classification model based on the domain dictionary can better carry semantic features related to the domain, so that the classification result is better. Through analysis and comparison, the invention adopts a word2vec+textCNN model based on a field word segmentation model as a final fault classification model.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that the above-mentioned preferred embodiment should not be construed as limiting the invention, and the scope of the invention should be defined by the appended claims. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the spirit and scope of the invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. The knowledge extraction method in the complex equipment knowledge graph construction process is characterized by comprising the following steps of:

2. The method according to claim 1, wherein the data mining is specifically:

3. The method according to claim 1, wherein the data stored in correspondence with the related data form in the relational database is specifically:

4. The method according to claim 1, wherein extracting the corresponding knowledge data through the network data extraction template and the non-network data extraction template is specifically:

5. The method of claim 4, wherein the template library shape is a structured template library created by generalized total structuring of tables or process data in complex equipment data.

6. The method of claim 4, further comprising inputting plain text data in the second knowledge data obtained after the extraction of the network data and the non-network data into an unstructured text knowledge extraction model for entity extraction.

7. The method according to claim 1, wherein the labeling of the data by the deep learning template is specifically:

and labeling the entities, the relations, the sequences and the fault types of the data based on the text labeling tool to obtain labeled training data.

8. The method according to claim 7, wherein the entity identification is in particular:

inputting the word vector into a bi-directional neural network;

9. The method according to claim 8, wherein the text classification is specifically:

10. The method according to claim 1, wherein the relation extraction is specifically:

acquiring OWL files of the ontology;

analyzing to obtain (concept, relation, concept) triples;

acquiring a CSV file of an entity identification result;

reading the concept corresponding to the CSV header and the entity corresponding to the concept according to the triplet; and (3) establishing (entity, relation and entity) triples, storing the triples into a database, and completing relation extraction.