CN112836509A - Expert system knowledge base construction method and system - Google Patents

Expert system knowledge base construction method and system Download PDF

Info

Publication number
CN112836509A
CN112836509A CN202110197687.2A CN202110197687A CN112836509A CN 112836509 A CN112836509 A CN 112836509A CN 202110197687 A CN202110197687 A CN 202110197687A CN 112836509 A CN112836509 A CN 112836509A
Authority
CN
China
Prior art keywords
word
data
text
vector
knowledge base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110197687.2A
Other languages
Chinese (zh)
Inventor
陈衡
岳莹莹
周诗坤
史磊
张兴军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202110197687.2A priority Critical patent/CN112836509A/en
Publication of CN112836509A publication Critical patent/CN112836509A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for constructing an expert system knowledge base, which are characterized in that a web front end is used for collecting design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process, and Chinese word segmentation training is carried out on collected texts to obtain word segmentation mark sequences; training a Word segmentation mark sequence through a Word2vec model, generating a Word embedding vector, and constructing a text vector based on a Word vector and weight to represent feature data; classifying the text vectors by adopting a KNN algorithm, and storing the characteristic data corresponding to the classified text vectors into a feedback information database after completing the characteristic data by an expert system knowledge base; the feedback information database is processed periodically by using a clustering algorithm, and an expert system knowledge base is constructed and completed; and the whole process optimization of the manufacturing industry is realized.

Description

Expert system knowledge base construction method and system
Technical Field
The invention belongs to the technical field of intersection of data mining, machine learning and natural language processing, and particularly relates to a method and a system for constructing an expert system knowledge base.
Background
The problems of complicated data types and sources are faced in the operation and maintenance process of manufacturing enterprises, and the found problems of design, production and manufacture are not formed into an effective closed loop and fed back to the design, production and manufacture. Based on big data modeling technology and system structure, by providing various structured and unstructured data such as drawings, models, documents and the like covering design and research and development of manufacturing enterprises, applying text feature extraction technology and text mining method to automatically obtain knowledge, effectively obtaining and applying knowledge in the business fields of design, manufacture, management and the like, standardizing various data of the whole life cycle of the manufacturing enterprises to form an expert system knowledge base of the research and development design of the manufacturing industry, and realizing digital value-added service based on operation and maintenance data and optimization of the design and manufacture process as required.
At the present stage, under the limitation of natural language processing technology and related extraction technology, each semantic component and the corresponding relation thereof in the sentence cannot be well identified, so that accurate classification cannot be performed.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a system for constructing an expert system knowledge base aiming at the defects in the prior art, automatically acquire knowledge by applying a text feature extraction technology, classify the selected features by adopting a classification method based on machine learning, form a systematic manufacturing enterprise operation and maintenance big knowledge base, provide data support and scientific basis for subsequent design optimization and construction management decisions of the same type, and realize the whole-process optimization of the manufacturing industry.
The invention adopts the following technical scheme:
an expert system knowledge base construction method comprises the following steps:
s1, collecting design problems and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;
s2, training the Word segmentation marking sequence in the step S1 through a Word2vec model, generating Word embedding vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on class frequency variance, and constructing text vector representation feature data based on the Word vectors and the weight;
s3, classifying the text vectors obtained in the step S2 by adopting a KNN algorithm, and storing the feature data corresponding to the classified text vectors into a feedback information database after completing the feature data by an expert system knowledge base;
and S4, processing the feedback information database in the step S3 by using a clustering algorithm periodically, and constructing and finishing an expert system knowledge base.
Specifically, in step S1, after removing the non-text portion from the data collected at the front end of the web through a Python regular expression or a beautiful soup, training the data; suppose that character c is input at time ttThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector xt∈RH1,H1X to be connected in seriestThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF networktThen, after softmax transformation, a vector y with the same dimension as the label set is obtainedt∈RDDenotes ctAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.
Specifically, step S2 specifically includes:
s201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vector
Figure BDA0002947689290000021
Is the word wiK is the dimension of the word vector;
s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.
Further, in step S202, the feature word vec (d)i) Expressed as:
Figure BDA0002947689290000031
wherein, VtIs the word wiIs the word vector of, tf is the word wiFrequency of occurrence in document d, idf being the word wiInverse document frequency, τ, in document dt,iIs the word wiIn document djClass frequency variance of (2).
Specifically, the improved TF-IDF algorithm is specifically:
tf-idf-τi,j=tf-idfi,ji
wherein a frequency-like variance tau is introducediThe distribution of the terms in different categories is measured as follows:
Figure BDA0002947689290000032
wherein df (d, w)i) Including words w for a corpus of text diThe number of the documents in the document list,
Figure BDA0002947689290000033
is of class cjContaining the word wiN is the number of text categories, τiIs the word wiClass frequency variance of (2).
Specifically, step S3 specifically includes:
s301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;
s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.
Specifically, step S4 specifically includes:
s401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;
s402, setting a threshold value A and a threshold value B of the cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database.
Another technical solution of the present invention is a system for constructing an expert system knowledge base, comprising:
the system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;
the analysis module is used for training a Word segmentation and marking sequence through a Word2vec model to generate Word embedded vectors, then analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing text vector representation based on the Word vectors and the weight;
the classification module is used for classifying by adopting a KNN algorithm; after the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;
and the construction module is used for periodically processing the feedback information database by using a machine learning clustering algorithm to construct and complete the expert system knowledge base.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to an expert system knowledge base construction method, which is based on big data modeling technology and system structure, automatically acquires and classifies knowledge by applying text feature extraction technology and machine learning-based classification method, constructs a manufacturing enterprise research and development design big data knowledge base, and provides digital value-added service for operation and maintenance optimization: in the text preprocessing stage, Chinese word segmentation is processed by adopting a model based on the combination of BI-LSTM and CRF, so that the Chinese word segmentation effect is obviously improved. In the characteristic extraction stage, a Word2Vec and improved TF-IDF combined characteristic extraction method is adopted, the occurrence frequency of characteristic words in the whole text base is considered, the distribution conditions of the characteristic words in different categories are also considered, and the extracted characteristic data is more accurate. And a knowledge base construction stage, wherein a generative framework rule is adopted to express and reason data. Meanwhile, a clustering algorithm is used, data are periodically classified, extracted and marked, repeated processing and storage are avoided, and an expert knowledge base for operation and maintenance of a manufacturing enterprise is automatically and rapidly established.
Furthermore, a bidirectional long-time memory conditional random field model based on a deep learning algorithm is adopted to perform Chinese word segmentation training on the collected text, and a word segmentation marking sequence is obtained. The model can automatically learn text characteristics, can model text context dependent information, considers label information before and after sentence characters at the same time by the CRF layer, carries out reasoning on text information, has good word segmentation performance, and has good generalization capability on cross-domain data.
Further, the Word embedding vectors are generated by training the label sequence in the step S1 through the Word2vec model, and then the weight of each Word vector in the text is analyzed through improving the TF-IDF algorithm based on the frequency-like variance, so that the text vector based on the Word vector and the weight is constructed to represent the feature data.
Furthermore, Word2vec is adopted for Word expression, the obtained Word vectors are low-dimensional, dense and real, semantic information is well reserved, but the importance degree of words cannot be expressed, and therefore the TF-IDF algorithm is introduced to calculate the weight of each Word vector in the text. But the TF-IDF algorithm only considers the occurrence frequency of the feature words in the whole corpus and ignores the distribution condition of the feature words in different categories, so that some words contributing to category judgment are lost, and therefore the class-frequency-variance-based improved TF-IDF algorithm is adopted to form vector text representation based on word vectors and weights, and word context relationship is well reserved.
And further, predicting and multi-classifying by adopting a KNN algorithm, completing the feature data corresponding to the classified text vectors through a knowledge representation rule of an expert system knowledge base, and storing the feature data into a feedback information database.
Further, the feedback information database is processed periodically by using a clustering algorithm so as to quickly construct an expert system knowledge base, and marks are added to corresponding records in the feedback information database so as to avoid repeated processing.
In conclusion, the invention can quickly and accurately construct the knowledge base of the expert system.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic representation of the structure of a BI-LSTM and CRF combination of the present invention;
FIG. 3 is a schematic diagram of the model of Word2Vec and improved TF-IDF binding according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.
The invention provides a method for constructing an expert system knowledge base, which aims at the problem of complicated data types and sources in the operation and maintenance process of manufacturing enterprises, adopts a Chinese word segmentation model based on the combination of a BI-LSTM and a CRF neural network in the word segmentation stage, not only keeps the characteristic that the LSTM can utilize context information, but also considers the front-back dependency relationship between output labels through a CRF layer, and obviously improves the Chinese word segmentation effect. And in the characteristic extraction stage, words are mapped to a vector space by using Word2Vec, the words are converted into Word vectors, and an improved TF-IDF characteristic extraction method is combined with the Word vectors, so that the semantic information of the words is considered, and the dimensionality of the Word vectors is controlled. The invention can better combine the requirements of enterprises and users deeply mined by big data of manufacturing enterprises and provide digital value-added service for operation and maintenance optimization.
Referring to fig. 1, the method for constructing an expert system knowledge base of the present invention includes the following steps:
s1, collecting design and manufacturing problems in the operation and maintenance process of manufacturing enterprises and feedback of users in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;
removing a non-text part in the data collected by the web front end through a Python regular expression or a beautiful soup, and then training the data; suppose that character c is input at time ttThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector xt∈RH1,H1X to be connected in seriestThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF networktThen, after softmax transformation, a vector y with the same dimension as the label set is obtainedt∈RDDenotes ctAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.
S2, training the marking sequence in the step S1 through a Word2vec model to generate Word embedded vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing a text vector based on the Word vectors and the weight to represent feature data;
referring to fig. 2, the bidirectional long-and-short-term memory conditional random field model can automatically learn text features and model text context dependent information; meanwhile, label information before and after the characters of the sentence is considered when the conditional random field algorithm infers the text information, the method has good word segmentation performance, and has good generalization capability on cross-domain data.
S201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vector
Figure BDA0002947689290000081
Is the word wiK is the dimension of the word vector;
s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.
Feature word vec (d)i) Expressed as:
Figure BDA0002947689290000082
wherein, VtIs the word wiIs the word vector of, tf is the word wiFrequency of occurrence in document d, idf being the word wiInverse document frequency, τ, in document dt,iIs the word wiIn document djClass-frequency variance of
The improved TF-IDF algorithm specifically comprises the following steps:
tf-idf-τi,j=tf-idfi,ji
wherein a frequency-like variance tau is introducediThe distribution of the terms in different categories is measured as follows:
Figure BDA0002947689290000091
wherein df (d, w)i) As textLibrary d contains words wiThe number of the documents in the document list,
Figure BDA0002947689290000092
is of class cjContaining the word wiN is the number of text categories, τiIs the word wiClass frequency variance of (2).
S3, classifying the text space vectors obtained in the step S2 by a KNN algorithm, completing the feature data corresponding to the classified text vectors through an expert system knowledge base knowledge representation rule, and storing the feature data into a feedback information database;
firstly, training through a Word2vec model to obtain all Word vectors in a sample; then, analyzing the weight of each word vector in the text by adopting a similar frequency variance-based improved TF-IDF algorithm, and constructing text vector representation based on the word vectors and the weights; compared with the traditional machine learning text classification, the text classification model combined with the two has better classification effect.
Referring to fig. 3, TF-IDF values of the words are calculated, feature words are extracted by using the TF-IDF values, Word2vec Word vectors are adopted, and dimension reduction is performed by calculating feature selection and feature weight on the basis of a Word bag model.
S301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;
s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.
And S4, periodically processing the feedback information database in the step S3 by using a clustering algorithm to construct an expert system knowledge base. For feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database so as to avoid repeated processing.
S401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;
s402, setting threshold values A and B of cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database so as to avoid repeated processing.
In another embodiment of the present invention, an expert system knowledge base construction system is provided, which can be used to implement the expert system knowledge base construction method described above, and specifically, the expert system knowledge base construction system includes a preprocessing module, a word segmentation module, an analysis construction module, and a classification module.
The system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;
the analysis module is used for generating Word embedded vectors through Word2vec model training, then analyzing the weight of each Word vector in the text through improving the TF-IDF algorithm based on the similar frequency variance, and constructing text vector representation based on the Word vectors and the weight;
and the classification module is used for classifying by adopting a KNN algorithm. After the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;
the building module is used for processing the feedback information database periodically by using a machine learning clustering algorithm, directly storing the data of which the cluster size is larger than a threshold A into the knowledge base for the feedback data which is not stored into the knowledge base, and determining whether the data of which the cluster size is smaller than the threshold A and larger than a threshold B is stored into the knowledge base again through manual confirmation; marking the newly added knowledge on the corresponding record in the feedback information database to avoid repeated processing;
the method comprises the following steps that a user inputs operation and maintenance information of a ship through a web client interface, preprocessing and feature extraction are carried out after texts are collected, feature data of 'lubricating oil pressure is too low' are obtained, the feature data are classified into a 'lubricating system' category, the lubricating oil pressure is too low, the lubricating oil pipe is broken or air exists, a lubricating oil filter is dirty and blocked, the viscosity of the lubricating oil is too low, and the like, and a production type frame knowledge representation rule is adopted to complement the data:
Figure BDA0002947689290000111
Figure BDA0002947689290000121
and after the characteristic data with the generating frame structure are stored into a feedback information database one by one, periodically extracting data from the database according to the cycle time set by a trigger for cluster analysis and marking. And for data which are not stored in the knowledge base, storing data which are not accessed and have cluster size larger than a threshold A into the knowledge base, manually confirming the data which have cluster size smaller than the threshold A and larger than a threshold B, and storing the data without processing the marked data.
In summary, the expert system knowledge base construction method and system provided by the invention are oriented to the operation and maintenance big data of the manufacturing enterprise, the text feature extraction technology and the machine learning-based classification method are applied to automatically acquire and classify knowledge, the manufacturing enterprise research and development design big data knowledge base is constructed, and the digital value-added service is provided for the operation and maintenance optimization.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (8)

1. A construction method of an expert system knowledge base is characterized by comprising the following steps:
s1, collecting design problems and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;
s2, training the Word segmentation marking sequence in the step S1 through a Word2vec model, generating Word embedding vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on class frequency variance, and constructing text vector representation feature data based on the Word vectors and the weight;
s3, classifying the text vectors obtained in the step S2 by adopting a KNN algorithm, and storing the feature data corresponding to the classified text vectors into a feedback information database after completing the feature data by an expert system knowledge base;
and S4, processing the feedback information database in the step S3 by using a clustering algorithm periodically, and constructing and finishing an expert system knowledge base.
2. The method according to claim 1, wherein in step S1, after removing the non-text part of the data collected by the web front end by means of Python regular expression or beautiful soup, the data is trained; suppose that a word is input at time tSymbol ctThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector xt∈RH1,H1X to be connected in seriestThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF networktThen, after softmax transformation, a vector y with the same dimension as the label set is obtainedt∈RDDenotes ctAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.
3. The method according to claim 1, wherein step S2 is specifically:
s201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vector
Figure FDA0002947689280000011
Figure FDA0002947689280000012
Is the word wiK is the dimension of the word vector;
s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.
4. The method of claim 3, wherein in step S202, the feature word vec (d)i) Expressed as:
Figure FDA0002947689280000021
wherein, VtIs the word wiIs the word vector of, tf is the word wiFrequency of occurrence in document d, idf being the word wiReverse direction in document dDocument frequency, τt,iIs the word wiIn document djClass frequency variance of (2).
5. The method of claim 1, wherein the improved TF-IDF algorithm is specifically:
tf-idf-τi,j=tf-idfi,ji
wherein a frequency-like variance tau is introducediThe distribution of the terms in different categories is measured as follows:
Figure FDA0002947689280000022
wherein df (d, w)i) Including words w for a corpus of text diThe number of the documents in the document list,
Figure FDA0002947689280000023
is of class cjContaining the word wiN is the number of text categories, τiIs the word wiClass frequency variance of (2).
6. The method according to claim 1, wherein step S3 is specifically:
s301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;
s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.
7. The method according to claim 1, wherein step S4 is specifically:
s401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;
s402, setting a threshold value A and a threshold value B of the cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database.
8. An expert system knowledge base construction system, comprising:
the system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;
the analysis module is used for training a Word segmentation and marking sequence through a Word2vec model to generate Word embedded vectors, then analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing text vector representation based on the Word vectors and the weight;
the classification module is used for classifying by adopting a KNN algorithm; after the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;
and the construction module is used for periodically processing the feedback information database by using a machine learning clustering algorithm to construct and complete the expert system knowledge base.
CN202110197687.2A 2021-02-22 2021-02-22 Expert system knowledge base construction method and system Pending CN112836509A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110197687.2A CN112836509A (en) 2021-02-22 2021-02-22 Expert system knowledge base construction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110197687.2A CN112836509A (en) 2021-02-22 2021-02-22 Expert system knowledge base construction method and system

Publications (1)

Publication Number Publication Date
CN112836509A true CN112836509A (en) 2021-05-25

Family

ID=75934199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110197687.2A Pending CN112836509A (en) 2021-02-22 2021-02-22 Expert system knowledge base construction method and system

Country Status (1)

Country Link
CN (1) CN112836509A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114969348A (en) * 2022-07-27 2022-08-30 杭州电子科技大学 Electronic file classification method and system based on inversion regulation knowledge base
WO2022257201A1 (en) * 2021-06-09 2022-12-15 山东交通学院 Urban traffic safety early-warning method and system based on human-machine hybrid-augmented intelligence
CN116069760A (en) * 2023-01-09 2023-05-05 青岛中投创新技术转移有限公司 Patent management data processing system, device and method
TWI813448B (en) * 2022-09-20 2023-08-21 世界先進積體電路股份有限公司 Expert system and expert method
CN116721730A (en) * 2023-06-15 2023-09-08 医途(杭州)科技有限公司 Whole-course patient management system based on digital therapy

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415789A (en) * 2018-01-24 2018-08-17 西安交通大学 Node failure forecasting system and method towards extensive mixing heterogeneous storage system
CN108460119A (en) * 2018-02-13 2018-08-28 南京途牛科技有限公司 A kind of system for supporting efficiency using machine learning lift technique
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
US20190370394A1 (en) * 2018-05-31 2019-12-05 Fmr Llc Automated computer text classification and routing using artificial intelligence transfer learning
CN111767741A (en) * 2020-06-30 2020-10-13 福建农林大学 Text emotion analysis method based on deep learning and TFIDF algorithm
CN111767394A (en) * 2020-06-24 2020-10-13 中国工商银行股份有限公司 Abstract extraction method and device based on artificial intelligence expert system
CN112101028A (en) * 2020-08-17 2020-12-18 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
CN112232079A (en) * 2020-10-15 2021-01-15 燕山大学 Microblog comment data classification method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415789A (en) * 2018-01-24 2018-08-17 西安交通大学 Node failure forecasting system and method towards extensive mixing heterogeneous storage system
CN108460119A (en) * 2018-02-13 2018-08-28 南京途牛科技有限公司 A kind of system for supporting efficiency using machine learning lift technique
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
US20190370394A1 (en) * 2018-05-31 2019-12-05 Fmr Llc Automated computer text classification and routing using artificial intelligence transfer learning
CN109359302A (en) * 2018-10-26 2019-02-19 重庆大学 A kind of optimization method of field term vector and fusion sort method based on it
CN111767394A (en) * 2020-06-24 2020-10-13 中国工商银行股份有限公司 Abstract extraction method and device based on artificial intelligence expert system
CN111767741A (en) * 2020-06-30 2020-10-13 福建农林大学 Text emotion analysis method based on deep learning and TFIDF algorithm
CN112101028A (en) * 2020-08-17 2020-12-18 淮阴工学院 Multi-feature bidirectional gating field expert entity extraction method and system
CN112232079A (en) * 2020-10-15 2021-01-15 燕山大学 Microblog comment data classification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
乔宇峰: "基于专家知识库的继电保护定值智能辅助审核系统应用", 《内蒙古电力技术》, vol. 37, no. 1, 31 January 2019 (2019-01-31), pages 2 - 5 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022257201A1 (en) * 2021-06-09 2022-12-15 山东交通学院 Urban traffic safety early-warning method and system based on human-machine hybrid-augmented intelligence
CN114969348A (en) * 2022-07-27 2022-08-30 杭州电子科技大学 Electronic file classification method and system based on inversion regulation knowledge base
CN114969348B (en) * 2022-07-27 2023-10-27 杭州电子科技大学 Electronic file hierarchical classification method and system based on inversion adjustment knowledge base
TWI813448B (en) * 2022-09-20 2023-08-21 世界先進積體電路股份有限公司 Expert system and expert method
CN116069760A (en) * 2023-01-09 2023-05-05 青岛中投创新技术转移有限公司 Patent management data processing system, device and method
CN116069760B (en) * 2023-01-09 2023-12-15 青岛华慧泽知识产权代理有限公司 Patent management data processing system, device and method
CN116721730A (en) * 2023-06-15 2023-09-08 医途(杭州)科技有限公司 Whole-course patient management system based on digital therapy
CN116721730B (en) * 2023-06-15 2024-03-08 医途(杭州)科技有限公司 Whole-course patient management system based on digital therapy

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN110413780B (en) Text emotion analysis method and electronic equipment
CN112836509A (en) Expert system knowledge base construction method and system
US10089581B2 (en) Data driven classification and data quality checking system
CN110415071B (en) Automobile competitive product comparison method based on viewpoint mining analysis
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN110008365B (en) Image processing method, device and equipment and readable storage medium
US20170004414A1 (en) Data driven classification and data quality checking method
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN115203338A (en) Label and label example recommendation method
CN114691525A (en) Test case selection method and device
CN114722198A (en) Method, system and related device for determining product classification code
CN112579730A (en) High-expansibility multi-label text classification method and device
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN113961666A (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN111930944B (en) File label classification method and device
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN117420998A (en) Client UI interaction component generation method, device, terminal and medium
CN116226747A (en) Training method of data classification model, data classification method and electronic equipment
CN110765872A (en) Online mathematical education resource classification method based on visual features
CN117235253A (en) Truck user implicit demand mining method based on natural language processing technology
KR100809751B1 (en) System and method for making analysis of document
CN114528378A (en) Text classification method and device, electronic equipment and storage medium
CN115130453A (en) Interactive information generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination