CN112836509A

CN112836509A - Expert system knowledge base construction method and system

Info

Publication number: CN112836509A
Application number: CN202110197687.2A
Authority: CN
Inventors: 陈衡; 岳莹莹; 周诗坤; 史磊; 张兴军
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-05-25

Abstract

The invention discloses a method and a system for constructing an expert system knowledge base, which are characterized in that a web front end is used for collecting design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process, and Chinese word segmentation training is carried out on collected texts to obtain word segmentation mark sequences; training a Word segmentation mark sequence through a Word2vec model, generating a Word embedding vector, and constructing a text vector based on a Word vector and weight to represent feature data; classifying the text vectors by adopting a KNN algorithm, and storing the characteristic data corresponding to the classified text vectors into a feedback information database after completing the characteristic data by an expert system knowledge base; the feedback information database is processed periodically by using a clustering algorithm, and an expert system knowledge base is constructed and completed; and the whole process optimization of the manufacturing industry is realized.

Description

Expert system knowledge base construction method and system

Technical Field

The invention belongs to the technical field of intersection of data mining, machine learning and natural language processing, and particularly relates to a method and a system for constructing an expert system knowledge base.

Background

The problems of complicated data types and sources are faced in the operation and maintenance process of manufacturing enterprises, and the found problems of design, production and manufacture are not formed into an effective closed loop and fed back to the design, production and manufacture. Based on big data modeling technology and system structure, by providing various structured and unstructured data such as drawings, models, documents and the like covering design and research and development of manufacturing enterprises, applying text feature extraction technology and text mining method to automatically obtain knowledge, effectively obtaining and applying knowledge in the business fields of design, manufacture, management and the like, standardizing various data of the whole life cycle of the manufacturing enterprises to form an expert system knowledge base of the research and development design of the manufacturing industry, and realizing digital value-added service based on operation and maintenance data and optimization of the design and manufacture process as required.

At the present stage, under the limitation of natural language processing technology and related extraction technology, each semantic component and the corresponding relation thereof in the sentence cannot be well identified, so that accurate classification cannot be performed.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method and a system for constructing an expert system knowledge base aiming at the defects in the prior art, automatically acquire knowledge by applying a text feature extraction technology, classify the selected features by adopting a classification method based on machine learning, form a systematic manufacturing enterprise operation and maintenance big knowledge base, provide data support and scientific basis for subsequent design optimization and construction management decisions of the same type, and realize the whole-process optimization of the manufacturing industry.

The invention adopts the following technical scheme:

an expert system knowledge base construction method comprises the following steps:

s1, collecting design problems and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;

s2, training the Word segmentation marking sequence in the step S1 through a Word2vec model, generating Word embedding vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on class frequency variance, and constructing text vector representation feature data based on the Word vectors and the weight;

s3, classifying the text vectors obtained in the step S2 by adopting a KNN algorithm, and storing the feature data corresponding to the classified text vectors into a feedback information database after completing the feature data by an expert system knowledge base;

and S4, processing the feedback information database in the step S3 by using a clustering algorithm periodically, and constructing and finishing an expert system knowledge base.

Specifically, in step S1, after removing the non-text portion from the data collected at the front end of the web through a Python regular expression or a beautiful soup, training the data; suppose that character c is input at time t_tThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector x_t∈R^H1，H₁X to be connected in series_tThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF network_tThen, after softmax transformation, a vector y with the same dimension as the label set is obtained_t∈R^DDenotes c_tAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.

Specifically, step S2 specifically includes:

s201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vector

Is the word w_iK is the dimension of the word vector;

s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.

Further, in step S202, the feature word vec (d)_i) Expressed as:

wherein, V_tIs the word w_iIs the word vector of, tf is the word w_iFrequency of occurrence in document d, idf being the word w_iInverse document frequency, τ, in document d_t,iIs the word w_iIn document d_jClass frequency variance of (2).

Specifically, the improved TF-IDF algorithm is specifically:

tf-idf-τ_i,j＝tf-idf_i,j*τ_i

wherein a frequency-like variance tau is introduced_iThe distribution of the terms in different categories is measured as follows:

wherein df (d, w)_i) Including words w for a corpus of text d_iThe number of the documents in the document list,

is of class c_jContaining the word w_iN is the number of text categories, τ_iIs the word w_iClass frequency variance of (2).

Specifically, step S3 specifically includes:

s301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;

s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.

Specifically, step S4 specifically includes:

s401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;

s402, setting a threshold value A and a threshold value B of the cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database.

Another technical solution of the present invention is a system for constructing an expert system knowledge base, comprising:

the system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;

the analysis module is used for training a Word segmentation and marking sequence through a Word2vec model to generate Word embedded vectors, then analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing text vector representation based on the Word vectors and the weight;

the classification module is used for classifying by adopting a KNN algorithm; after the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;

and the construction module is used for periodically processing the feedback information database by using a machine learning clustering algorithm to construct and complete the expert system knowledge base.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to an expert system knowledge base construction method, which is based on big data modeling technology and system structure, automatically acquires and classifies knowledge by applying text feature extraction technology and machine learning-based classification method, constructs a manufacturing enterprise research and development design big data knowledge base, and provides digital value-added service for operation and maintenance optimization: in the text preprocessing stage, Chinese word segmentation is processed by adopting a model based on the combination of BI-LSTM and CRF, so that the Chinese word segmentation effect is obviously improved. In the characteristic extraction stage, a Word2Vec and improved TF-IDF combined characteristic extraction method is adopted, the occurrence frequency of characteristic words in the whole text base is considered, the distribution conditions of the characteristic words in different categories are also considered, and the extracted characteristic data is more accurate. And a knowledge base construction stage, wherein a generative framework rule is adopted to express and reason data. Meanwhile, a clustering algorithm is used, data are periodically classified, extracted and marked, repeated processing and storage are avoided, and an expert knowledge base for operation and maintenance of a manufacturing enterprise is automatically and rapidly established.

Furthermore, a bidirectional long-time memory conditional random field model based on a deep learning algorithm is adopted to perform Chinese word segmentation training on the collected text, and a word segmentation marking sequence is obtained. The model can automatically learn text characteristics, can model text context dependent information, considers label information before and after sentence characters at the same time by the CRF layer, carries out reasoning on text information, has good word segmentation performance, and has good generalization capability on cross-domain data.

Further, the Word embedding vectors are generated by training the label sequence in the step S1 through the Word2vec model, and then the weight of each Word vector in the text is analyzed through improving the TF-IDF algorithm based on the frequency-like variance, so that the text vector based on the Word vector and the weight is constructed to represent the feature data.

Furthermore, Word2vec is adopted for Word expression, the obtained Word vectors are low-dimensional, dense and real, semantic information is well reserved, but the importance degree of words cannot be expressed, and therefore the TF-IDF algorithm is introduced to calculate the weight of each Word vector in the text. But the TF-IDF algorithm only considers the occurrence frequency of the feature words in the whole corpus and ignores the distribution condition of the feature words in different categories, so that some words contributing to category judgment are lost, and therefore the class-frequency-variance-based improved TF-IDF algorithm is adopted to form vector text representation based on word vectors and weights, and word context relationship is well reserved.

And further, predicting and multi-classifying by adopting a KNN algorithm, completing the feature data corresponding to the classified text vectors through a knowledge representation rule of an expert system knowledge base, and storing the feature data into a feedback information database.

Further, the feedback information database is processed periodically by using a clustering algorithm so as to quickly construct an expert system knowledge base, and marks are added to corresponding records in the feedback information database so as to avoid repeated processing.

In conclusion, the invention can quickly and accurately construct the knowledge base of the expert system.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic representation of the structure of a BI-LSTM and CRF combination of the present invention;

FIG. 3 is a schematic diagram of the model of Word2Vec and improved TF-IDF binding according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.

The invention provides a method for constructing an expert system knowledge base, which aims at the problem of complicated data types and sources in the operation and maintenance process of manufacturing enterprises, adopts a Chinese word segmentation model based on the combination of a BI-LSTM and a CRF neural network in the word segmentation stage, not only keeps the characteristic that the LSTM can utilize context information, but also considers the front-back dependency relationship between output labels through a CRF layer, and obviously improves the Chinese word segmentation effect. And in the characteristic extraction stage, words are mapped to a vector space by using Word2Vec, the words are converted into Word vectors, and an improved TF-IDF characteristic extraction method is combined with the Word vectors, so that the semantic information of the words is considered, and the dimensionality of the Word vectors is controlled. The invention can better combine the requirements of enterprises and users deeply mined by big data of manufacturing enterprises and provide digital value-added service for operation and maintenance optimization.

Referring to fig. 1, the method for constructing an expert system knowledge base of the present invention includes the following steps:

s1, collecting design and manufacturing problems in the operation and maintenance process of manufacturing enterprises and feedback of users in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;

removing a non-text part in the data collected by the web front end through a Python regular expression or a beautiful soup, and then training the data; suppose that character c is input at time t_tThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector x_t∈R^H1，H₁X to be connected in series_tThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF network_tThen, after softmax transformation, a vector y with the same dimension as the label set is obtained_t∈R^DDenotes c_tAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.

S2, training the marking sequence in the step S1 through a Word2vec model to generate Word embedded vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing a text vector based on the Word vectors and the weight to represent feature data;

referring to fig. 2, the bidirectional long-and-short-term memory conditional random field model can automatically learn text features and model text context dependent information; meanwhile, label information before and after the characters of the sentence is considered when the conditional random field algorithm infers the text information, the method has good word segmentation performance, and has good generalization capability on cross-domain data.

Is the word w_iK is the dimension of the word vector;

Feature word vec (d)_i) Expressed as:

wherein, V_tIs the word w_iIs the word vector of, tf is the word w_iFrequency of occurrence in document d, idf being the word w_iInverse document frequency, τ, in document d_t,iIs the word w_iIn document d_jClass-frequency variance of

The improved TF-IDF algorithm specifically comprises the following steps:

tf-idf-τ_i,j＝tf-idf_i,j*τ_i

wherein df (d, w)_i) As textLibrary d contains words w_iThe number of the documents in the document list,

S3, classifying the text space vectors obtained in the step S2 by a KNN algorithm, completing the feature data corresponding to the classified text vectors through an expert system knowledge base knowledge representation rule, and storing the feature data into a feedback information database;

firstly, training through a Word2vec model to obtain all Word vectors in a sample; then, analyzing the weight of each word vector in the text by adopting a similar frequency variance-based improved TF-IDF algorithm, and constructing text vector representation based on the word vectors and the weights; compared with the traditional machine learning text classification, the text classification model combined with the two has better classification effect.

Referring to fig. 3, TF-IDF values of the words are calculated, feature words are extracted by using the TF-IDF values, Word2vec Word vectors are adopted, and dimension reduction is performed by calculating feature selection and feature weight on the basis of a Word bag model.

And S4, periodically processing the feedback information database in the step S3 by using a clustering algorithm to construct an expert system knowledge base. For feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database so as to avoid repeated processing.

s402, setting threshold values A and B of cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database so as to avoid repeated processing.

In another embodiment of the present invention, an expert system knowledge base construction system is provided, which can be used to implement the expert system knowledge base construction method described above, and specifically, the expert system knowledge base construction system includes a preprocessing module, a word segmentation module, an analysis construction module, and a classification module.

the analysis module is used for generating Word embedded vectors through Word2vec model training, then analyzing the weight of each Word vector in the text through improving the TF-IDF algorithm based on the similar frequency variance, and constructing text vector representation based on the Word vectors and the weight;

and the classification module is used for classifying by adopting a KNN algorithm. After the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;

the building module is used for processing the feedback information database periodically by using a machine learning clustering algorithm, directly storing the data of which the cluster size is larger than a threshold A into the knowledge base for the feedback data which is not stored into the knowledge base, and determining whether the data of which the cluster size is smaller than the threshold A and larger than a threshold B is stored into the knowledge base again through manual confirmation; marking the newly added knowledge on the corresponding record in the feedback information database to avoid repeated processing;

the method comprises the following steps that a user inputs operation and maintenance information of a ship through a web client interface, preprocessing and feature extraction are carried out after texts are collected, feature data of 'lubricating oil pressure is too low' are obtained, the feature data are classified into a 'lubricating system' category, the lubricating oil pressure is too low, the lubricating oil pipe is broken or air exists, a lubricating oil filter is dirty and blocked, the viscosity of the lubricating oil is too low, and the like, and a production type frame knowledge representation rule is adopted to complement the data:

and after the characteristic data with the generating frame structure are stored into a feedback information database one by one, periodically extracting data from the database according to the cycle time set by a trigger for cluster analysis and marking. And for data which are not stored in the knowledge base, storing data which are not accessed and have cluster size larger than a threshold A into the knowledge base, manually confirming the data which have cluster size smaller than the threshold A and larger than a threshold B, and storing the data without processing the marked data.

In summary, the expert system knowledge base construction method and system provided by the invention are oriented to the operation and maintenance big data of the manufacturing enterprise, the text feature extraction technology and the machine learning-based classification method are applied to automatically acquire and classify knowledge, the manufacturing enterprise research and development design big data knowledge base is constructed, and the digital value-added service is provided for the operation and maintenance optimization.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A construction method of an expert system knowledge base is characterized by comprising the following steps:

2. The method according to claim 1, wherein in step S1, after removing the non-text part of the data collected by the web front end by means of Python regular expression or beautiful soup, the data is trained; suppose that a word is input at time tSymbol c_tThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector x_t∈R^H1，H₁X to be connected in series_tThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF network_tThen, after softmax transformation, a vector y with the same dimension as the label set is obtained_t∈R^DDenotes c_tAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.

3. The method according to claim 1, wherein step S2 is specifically:

Is the word w_iK is the dimension of the word vector;

4. The method of claim 3, wherein in step S202, the feature word vec (d)_i) Expressed as:

wherein, V_tIs the word w_iIs the word vector of, tf is the word w_iFrequency of occurrence in document d, idf being the word w_iReverse direction in document dDocument frequency, τ_t,iIs the word w_iIn document d_jClass frequency variance of (2).

5. The method of claim 1, wherein the improved TF-IDF algorithm is specifically:

tf-idf-τ_i,j＝tf-idf_i,j*τ_i

6. The method according to claim 1, wherein step S3 is specifically:

7. The method according to claim 1, wherein step S4 is specifically:

8. An expert system knowledge base construction system, comprising: