CN112836509A - Expert system knowledge base construction method and system - Google Patents
Expert system knowledge base construction method and system Download PDFInfo
- Publication number
- CN112836509A CN112836509A CN202110197687.2A CN202110197687A CN112836509A CN 112836509 A CN112836509 A CN 112836509A CN 202110197687 A CN202110197687 A CN 202110197687A CN 112836509 A CN112836509 A CN 112836509A
- Authority
- CN
- China
- Prior art keywords
- word
- data
- text
- vector
- knowledge base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000009411 base construction Methods 0.000 title claims description 10
- 239000013598 vector Substances 0.000 claims abstract description 86
- 238000004519 manufacturing process Methods 0.000 claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 28
- 230000008569 process Effects 0.000 claims abstract description 21
- 238000013461 design Methods 0.000 claims abstract description 17
- 238000012423 maintenance Methods 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 9
- 230000002457 bidirectional effect Effects 0.000 claims description 8
- 238000009826 distribution Methods 0.000 claims description 8
- 238000010801 machine learning Methods 0.000 claims description 8
- 238000013135 deep learning Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 6
- 238000012790 confirmation Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 4
- 230000006399 behavior Effects 0.000 claims description 3
- 230000000737 periodic effect Effects 0.000 claims description 3
- 235000014347 soups Nutrition 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 abstract description 7
- 238000000605 extraction Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 239000010687 lubricating oil Substances 0.000 description 5
- 238000012827 research and development Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000009430 construction management Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001050 lubricating effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for constructing an expert system knowledge base, which are characterized in that a web front end is used for collecting design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process, and Chinese word segmentation training is carried out on collected texts to obtain word segmentation mark sequences; training a Word segmentation mark sequence through a Word2vec model, generating a Word embedding vector, and constructing a text vector based on a Word vector and weight to represent feature data; classifying the text vectors by adopting a KNN algorithm, and storing the characteristic data corresponding to the classified text vectors into a feedback information database after completing the characteristic data by an expert system knowledge base; the feedback information database is processed periodically by using a clustering algorithm, and an expert system knowledge base is constructed and completed; and the whole process optimization of the manufacturing industry is realized.
Description
Technical Field
The invention belongs to the technical field of intersection of data mining, machine learning and natural language processing, and particularly relates to a method and a system for constructing an expert system knowledge base.
Background
The problems of complicated data types and sources are faced in the operation and maintenance process of manufacturing enterprises, and the found problems of design, production and manufacture are not formed into an effective closed loop and fed back to the design, production and manufacture. Based on big data modeling technology and system structure, by providing various structured and unstructured data such as drawings, models, documents and the like covering design and research and development of manufacturing enterprises, applying text feature extraction technology and text mining method to automatically obtain knowledge, effectively obtaining and applying knowledge in the business fields of design, manufacture, management and the like, standardizing various data of the whole life cycle of the manufacturing enterprises to form an expert system knowledge base of the research and development design of the manufacturing industry, and realizing digital value-added service based on operation and maintenance data and optimization of the design and manufacture process as required.
At the present stage, under the limitation of natural language processing technology and related extraction technology, each semantic component and the corresponding relation thereof in the sentence cannot be well identified, so that accurate classification cannot be performed.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method and a system for constructing an expert system knowledge base aiming at the defects in the prior art, automatically acquire knowledge by applying a text feature extraction technology, classify the selected features by adopting a classification method based on machine learning, form a systematic manufacturing enterprise operation and maintenance big knowledge base, provide data support and scientific basis for subsequent design optimization and construction management decisions of the same type, and realize the whole-process optimization of the manufacturing industry.
The invention adopts the following technical scheme:
an expert system knowledge base construction method comprises the following steps:
s1, collecting design problems and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;
s2, training the Word segmentation marking sequence in the step S1 through a Word2vec model, generating Word embedding vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on class frequency variance, and constructing text vector representation feature data based on the Word vectors and the weight;
s3, classifying the text vectors obtained in the step S2 by adopting a KNN algorithm, and storing the feature data corresponding to the classified text vectors into a feedback information database after completing the feature data by an expert system knowledge base;
and S4, processing the feedback information database in the step S3 by using a clustering algorithm periodically, and constructing and finishing an expert system knowledge base.
Specifically, in step S1, after removing the non-text portion from the data collected at the front end of the web through a Python regular expression or a beautiful soup, training the data; suppose that character c is input at time ttThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector xt∈RH1,H1X to be connected in seriestThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF networktThen, after softmax transformation, a vector y with the same dimension as the label set is obtainedt∈RDDenotes ctAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.
Specifically, step S2 specifically includes:
s201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vectorIs the word wiK is the dimension of the word vector;
s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.
Further, in step S202, the feature word vec (d)i) Expressed as:
wherein, VtIs the word wiIs the word vector of, tf is the word wiFrequency of occurrence in document d, idf being the word wiInverse document frequency, τ, in document dt,iIs the word wiIn document djClass frequency variance of (2).
Specifically, the improved TF-IDF algorithm is specifically:
tf-idf-τi,j=tf-idfi,j*τi
wherein a frequency-like variance tau is introducediThe distribution of the terms in different categories is measured as follows:
wherein df (d, w)i) Including words w for a corpus of text diThe number of the documents in the document list,is of class cjContaining the word wiN is the number of text categories, τiIs the word wiClass frequency variance of (2).
Specifically, step S3 specifically includes:
s301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;
s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.
Specifically, step S4 specifically includes:
s401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;
s402, setting a threshold value A and a threshold value B of the cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database.
Another technical solution of the present invention is a system for constructing an expert system knowledge base, comprising:
the system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;
the analysis module is used for training a Word segmentation and marking sequence through a Word2vec model to generate Word embedded vectors, then analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing text vector representation based on the Word vectors and the weight;
the classification module is used for classifying by adopting a KNN algorithm; after the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;
and the construction module is used for periodically processing the feedback information database by using a machine learning clustering algorithm to construct and complete the expert system knowledge base.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to an expert system knowledge base construction method, which is based on big data modeling technology and system structure, automatically acquires and classifies knowledge by applying text feature extraction technology and machine learning-based classification method, constructs a manufacturing enterprise research and development design big data knowledge base, and provides digital value-added service for operation and maintenance optimization: in the text preprocessing stage, Chinese word segmentation is processed by adopting a model based on the combination of BI-LSTM and CRF, so that the Chinese word segmentation effect is obviously improved. In the characteristic extraction stage, a Word2Vec and improved TF-IDF combined characteristic extraction method is adopted, the occurrence frequency of characteristic words in the whole text base is considered, the distribution conditions of the characteristic words in different categories are also considered, and the extracted characteristic data is more accurate. And a knowledge base construction stage, wherein a generative framework rule is adopted to express and reason data. Meanwhile, a clustering algorithm is used, data are periodically classified, extracted and marked, repeated processing and storage are avoided, and an expert knowledge base for operation and maintenance of a manufacturing enterprise is automatically and rapidly established.
Furthermore, a bidirectional long-time memory conditional random field model based on a deep learning algorithm is adopted to perform Chinese word segmentation training on the collected text, and a word segmentation marking sequence is obtained. The model can automatically learn text characteristics, can model text context dependent information, considers label information before and after sentence characters at the same time by the CRF layer, carries out reasoning on text information, has good word segmentation performance, and has good generalization capability on cross-domain data.
Further, the Word embedding vectors are generated by training the label sequence in the step S1 through the Word2vec model, and then the weight of each Word vector in the text is analyzed through improving the TF-IDF algorithm based on the frequency-like variance, so that the text vector based on the Word vector and the weight is constructed to represent the feature data.
Furthermore, Word2vec is adopted for Word expression, the obtained Word vectors are low-dimensional, dense and real, semantic information is well reserved, but the importance degree of words cannot be expressed, and therefore the TF-IDF algorithm is introduced to calculate the weight of each Word vector in the text. But the TF-IDF algorithm only considers the occurrence frequency of the feature words in the whole corpus and ignores the distribution condition of the feature words in different categories, so that some words contributing to category judgment are lost, and therefore the class-frequency-variance-based improved TF-IDF algorithm is adopted to form vector text representation based on word vectors and weights, and word context relationship is well reserved.
And further, predicting and multi-classifying by adopting a KNN algorithm, completing the feature data corresponding to the classified text vectors through a knowledge representation rule of an expert system knowledge base, and storing the feature data into a feedback information database.
Further, the feedback information database is processed periodically by using a clustering algorithm so as to quickly construct an expert system knowledge base, and marks are added to corresponding records in the feedback information database so as to avoid repeated processing.
In conclusion, the invention can quickly and accurately construct the knowledge base of the expert system.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic representation of the structure of a BI-LSTM and CRF combination of the present invention;
FIG. 3 is a schematic diagram of the model of Word2Vec and improved TF-IDF binding according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.
The invention provides a method for constructing an expert system knowledge base, which aims at the problem of complicated data types and sources in the operation and maintenance process of manufacturing enterprises, adopts a Chinese word segmentation model based on the combination of a BI-LSTM and a CRF neural network in the word segmentation stage, not only keeps the characteristic that the LSTM can utilize context information, but also considers the front-back dependency relationship between output labels through a CRF layer, and obviously improves the Chinese word segmentation effect. And in the characteristic extraction stage, words are mapped to a vector space by using Word2Vec, the words are converted into Word vectors, and an improved TF-IDF characteristic extraction method is combined with the Word vectors, so that the semantic information of the words is considered, and the dimensionality of the Word vectors is controlled. The invention can better combine the requirements of enterprises and users deeply mined by big data of manufacturing enterprises and provide digital value-added service for operation and maintenance optimization.
Referring to fig. 1, the method for constructing an expert system knowledge base of the present invention includes the following steps:
s1, collecting design and manufacturing problems in the operation and maintenance process of manufacturing enterprises and feedback of users in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;
removing a non-text part in the data collected by the web front end through a Python regular expression or a beautiful soup, and then training the data; suppose that character c is input at time ttThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector xt∈RH1,H1X to be connected in seriestThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF networktThen, after softmax transformation, a vector y with the same dimension as the label set is obtainedt∈RDDenotes ctAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.
S2, training the marking sequence in the step S1 through a Word2vec model to generate Word embedded vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing a text vector based on the Word vectors and the weight to represent feature data;
referring to fig. 2, the bidirectional long-and-short-term memory conditional random field model can automatically learn text features and model text context dependent information; meanwhile, label information before and after the characters of the sentence is considered when the conditional random field algorithm infers the text information, the method has good word segmentation performance, and has good generalization capability on cross-domain data.
S201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vectorIs the word wiK is the dimension of the word vector;
s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.
Feature word vec (d)i) Expressed as:
wherein, VtIs the word wiIs the word vector of, tf is the word wiFrequency of occurrence in document d, idf being the word wiInverse document frequency, τ, in document dt,iIs the word wiIn document djClass-frequency variance of
The improved TF-IDF algorithm specifically comprises the following steps:
tf-idf-τi,j=tf-idfi,j*τi
wherein a frequency-like variance tau is introducediThe distribution of the terms in different categories is measured as follows:
wherein df (d, w)i) As textLibrary d contains words wiThe number of the documents in the document list,is of class cjContaining the word wiN is the number of text categories, τiIs the word wiClass frequency variance of (2).
S3, classifying the text space vectors obtained in the step S2 by a KNN algorithm, completing the feature data corresponding to the classified text vectors through an expert system knowledge base knowledge representation rule, and storing the feature data into a feedback information database;
firstly, training through a Word2vec model to obtain all Word vectors in a sample; then, analyzing the weight of each word vector in the text by adopting a similar frequency variance-based improved TF-IDF algorithm, and constructing text vector representation based on the word vectors and the weights; compared with the traditional machine learning text classification, the text classification model combined with the two has better classification effect.
Referring to fig. 3, TF-IDF values of the words are calculated, feature words are extracted by using the TF-IDF values, Word2vec Word vectors are adopted, and dimension reduction is performed by calculating feature selection and feature weight on the basis of a Word bag model.
S301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;
s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.
And S4, periodically processing the feedback information database in the step S3 by using a clustering algorithm to construct an expert system knowledge base. For feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database so as to avoid repeated processing.
S401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;
s402, setting threshold values A and B of cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database so as to avoid repeated processing.
In another embodiment of the present invention, an expert system knowledge base construction system is provided, which can be used to implement the expert system knowledge base construction method described above, and specifically, the expert system knowledge base construction system includes a preprocessing module, a word segmentation module, an analysis construction module, and a classification module.
The system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;
the analysis module is used for generating Word embedded vectors through Word2vec model training, then analyzing the weight of each Word vector in the text through improving the TF-IDF algorithm based on the similar frequency variance, and constructing text vector representation based on the Word vectors and the weight;
and the classification module is used for classifying by adopting a KNN algorithm. After the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;
the building module is used for processing the feedback information database periodically by using a machine learning clustering algorithm, directly storing the data of which the cluster size is larger than a threshold A into the knowledge base for the feedback data which is not stored into the knowledge base, and determining whether the data of which the cluster size is smaller than the threshold A and larger than a threshold B is stored into the knowledge base again through manual confirmation; marking the newly added knowledge on the corresponding record in the feedback information database to avoid repeated processing;
the method comprises the following steps that a user inputs operation and maintenance information of a ship through a web client interface, preprocessing and feature extraction are carried out after texts are collected, feature data of 'lubricating oil pressure is too low' are obtained, the feature data are classified into a 'lubricating system' category, the lubricating oil pressure is too low, the lubricating oil pipe is broken or air exists, a lubricating oil filter is dirty and blocked, the viscosity of the lubricating oil is too low, and the like, and a production type frame knowledge representation rule is adopted to complement the data:
and after the characteristic data with the generating frame structure are stored into a feedback information database one by one, periodically extracting data from the database according to the cycle time set by a trigger for cluster analysis and marking. And for data which are not stored in the knowledge base, storing data which are not accessed and have cluster size larger than a threshold A into the knowledge base, manually confirming the data which have cluster size smaller than the threshold A and larger than a threshold B, and storing the data without processing the marked data.
In summary, the expert system knowledge base construction method and system provided by the invention are oriented to the operation and maintenance big data of the manufacturing enterprise, the text feature extraction technology and the machine learning-based classification method are applied to automatically acquire and classify knowledge, the manufacturing enterprise research and development design big data knowledge base is constructed, and the digital value-added service is provided for the operation and maintenance optimization.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.
Claims (8)
1. A construction method of an expert system knowledge base is characterized by comprising the following steps:
s1, collecting design problems and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performing Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain word segmentation mark sequences;
s2, training the Word segmentation marking sequence in the step S1 through a Word2vec model, generating Word embedding vectors, analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on class frequency variance, and constructing text vector representation feature data based on the Word vectors and the weight;
s3, classifying the text vectors obtained in the step S2 by adopting a KNN algorithm, and storing the feature data corresponding to the classified text vectors into a feedback information database after completing the feature data by an expert system knowledge base;
and S4, processing the feedback information database in the step S3 by using a clustering algorithm periodically, and constructing and finishing an expert system knowledge base.
2. The method according to claim 1, wherein in step S1, after removing the non-text part of the data collected by the web front end by means of Python regular expression or beautiful soup, the data is trained; suppose that a word is input at time tSymbol ctThe window size is k; the character sequence corresponds to k sub-vectors with dimension d in a word embedding layer, and each sub-vector is connected in series to form a long vector xt∈RH1,H1X to be connected in seriestThe embedded vector is used as the input of the BI-LSTM and CRF network, and the output h is obtained after the transformation of the BI-LSTM and CRF networktThen, after softmax transformation, a vector y with the same dimension as the label set is obtainedt∈RDDenotes ctAnd D is the number of the lexeme labels, and finally, a mark sequence with the maximum probability is output.
3. The method according to claim 1, wherein step S2 is specifically:
s201, training a label sequence obtained by Word segmentation of the text by adopting a Word2vec model, and converting the segmented text into a low-dimensional numerical value vector Is the word wiK is the dimension of the word vector;
s202, calculating the weight of each word vector in the text by adopting an improved TF-IDF algorithm, and extracting the feature words by considering the occurrence frequency of the feature words in the whole corpus and the distribution conditions of the feature words in different categories.
4. The method of claim 3, wherein in step S202, the feature word vec (d)i) Expressed as:
wherein, VtIs the word wiIs the word vector of, tf is the word wiFrequency of occurrence in document d, idf being the word wiReverse direction in document dDocument frequency, τt,iIs the word wiIn document djClass frequency variance of (2).
5. The method of claim 1, wherein the improved TF-IDF algorithm is specifically:
tf-idf-τi,j=tf-idfi,j*τi
wherein a frequency-like variance tau is introducediThe distribution of the terms in different categories is measured as follows:
6. The method according to claim 1, wherein step S3 is specifically:
s301, after the new text arrives, determining a vector of the new text according to the feature words; selecting k texts most similar to the new texts in the training text set, and measuring similarity by using cosine of a vector included angle; sequentially calculating the weight of each class in k neighbors of the new text, wherein the weight of each class is equal to the sum of the similarity of the training samples and the test samples belonging to the class in the k neighbors; comparing the weights of the classes, and classifying the texts into the class with the maximum weight;
s302, after the characteristic data are obtained through classification, expressing the rule by adopting a production type frame knowledge based on an expert system, expressing the characteristic data into data with a conditional behavior structure, expressing the data into a main body by using the frame, embedding the data into the frame, enabling the frame to find the corresponding rule through a rule class, finding the corresponding frame through a subordinate frame name by the rule class, and finally storing the processed data into a feedback information database.
7. The method according to claim 1, wherein step S4 is specifically:
s401, setting a period through a trigger, carrying out periodic processing on a feedback information database by adopting a clustering algorithm, randomly selecting k objects from n data objects as initial clustering centers, respectively allocating the rest data to corresponding clusters according to the similarity of the rest data and the clustering centers, then calculating the clustering center of each obtained new cluster, and continuously repeating the process until a standard measure function starts to converge;
s402, setting a threshold value A and a threshold value B of the cluster size, and judging data to be stored in a knowledge base; for feedback data which are not stored in the knowledge base, directly storing data with cluster size larger than a threshold value A into the knowledge base, and determining whether the data with cluster size smaller than the threshold value A and larger than a threshold value B are stored in the knowledge base again through manual confirmation; and marking the corresponding record of the newly added knowledge in the feedback information database.
8. An expert system knowledge base construction system, comprising:
the system comprises a preprocessing module, a word segmentation and marking module and a word segmentation and marking module, wherein the preprocessing module collects design and manufacturing problems in the operation and maintenance process of a manufacturing enterprise and feedback of a user in the using process through a web front end, and performs Chinese word segmentation training on collected texts by adopting a bidirectional long-time memory conditional random field model based on a deep learning algorithm to obtain a word segmentation and marking sequence;
the analysis module is used for training a Word segmentation and marking sequence through a Word2vec model to generate Word embedded vectors, then analyzing the weight of each Word vector in the text through improving a TF-IDF algorithm based on the frequency-like variance, and constructing text vector representation based on the Word vectors and the weight;
the classification module is used for classifying by adopting a KNN algorithm; after the characteristic data are obtained through classification, complementing the characteristic data through a knowledge representation rule of an expert system knowledge base, and storing the characteristic data into a feedback information database;
and the construction module is used for periodically processing the feedback information database by using a machine learning clustering algorithm to construct and complete the expert system knowledge base.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110197687.2A CN112836509A (en) | 2021-02-22 | 2021-02-22 | Expert system knowledge base construction method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110197687.2A CN112836509A (en) | 2021-02-22 | 2021-02-22 | Expert system knowledge base construction method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112836509A true CN112836509A (en) | 2021-05-25 |
Family
ID=75934199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110197687.2A Pending CN112836509A (en) | 2021-02-22 | 2021-02-22 | Expert system knowledge base construction method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112836509A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114969348A (en) * | 2022-07-27 | 2022-08-30 | 杭州电子科技大学 | Electronic file classification method and system based on inversion regulation knowledge base |
WO2022257201A1 (en) * | 2021-06-09 | 2022-12-15 | 山东交通学院 | Urban traffic safety early-warning method and system based on human-machine hybrid-augmented intelligence |
CN116069760A (en) * | 2023-01-09 | 2023-05-05 | 青岛中投创新技术转移有限公司 | Patent management data processing system, device and method |
TWI813448B (en) * | 2022-09-20 | 2023-08-21 | 世界先進積體電路股份有限公司 | Expert system and expert method |
CN116721730A (en) * | 2023-06-15 | 2023-09-08 | 医途(杭州)科技有限公司 | Whole-course patient management system based on digital therapy |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108415789A (en) * | 2018-01-24 | 2018-08-17 | 西安交通大学 | Node failure forecasting system and method towards extensive mixing heterogeneous storage system |
CN108460119A (en) * | 2018-02-13 | 2018-08-28 | 南京途牛科技有限公司 | A kind of system for supporting efficiency using machine learning lift technique |
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
WO2019214133A1 (en) * | 2018-05-08 | 2019-11-14 | 华南理工大学 | Method for automatically categorizing large-scale customer complaint data |
US20190370394A1 (en) * | 2018-05-31 | 2019-12-05 | Fmr Llc | Automated computer text classification and routing using artificial intelligence transfer learning |
CN111767741A (en) * | 2020-06-30 | 2020-10-13 | 福建农林大学 | Text emotion analysis method based on deep learning and TFIDF algorithm |
CN111767394A (en) * | 2020-06-24 | 2020-10-13 | 中国工商银行股份有限公司 | Abstract extraction method and device based on artificial intelligence expert system |
CN112101028A (en) * | 2020-08-17 | 2020-12-18 | 淮阴工学院 | Multi-feature bidirectional gating field expert entity extraction method and system |
CN112232079A (en) * | 2020-10-15 | 2021-01-15 | 燕山大学 | Microblog comment data classification method and system |
-
2021
- 2021-02-22 CN CN202110197687.2A patent/CN112836509A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108415789A (en) * | 2018-01-24 | 2018-08-17 | 西安交通大学 | Node failure forecasting system and method towards extensive mixing heterogeneous storage system |
CN108460119A (en) * | 2018-02-13 | 2018-08-28 | 南京途牛科技有限公司 | A kind of system for supporting efficiency using machine learning lift technique |
WO2019214133A1 (en) * | 2018-05-08 | 2019-11-14 | 华南理工大学 | Method for automatically categorizing large-scale customer complaint data |
US20190370394A1 (en) * | 2018-05-31 | 2019-12-05 | Fmr Llc | Automated computer text classification and routing using artificial intelligence transfer learning |
CN109359302A (en) * | 2018-10-26 | 2019-02-19 | 重庆大学 | A kind of optimization method of field term vector and fusion sort method based on it |
CN111767394A (en) * | 2020-06-24 | 2020-10-13 | 中国工商银行股份有限公司 | Abstract extraction method and device based on artificial intelligence expert system |
CN111767741A (en) * | 2020-06-30 | 2020-10-13 | 福建农林大学 | Text emotion analysis method based on deep learning and TFIDF algorithm |
CN112101028A (en) * | 2020-08-17 | 2020-12-18 | 淮阴工学院 | Multi-feature bidirectional gating field expert entity extraction method and system |
CN112232079A (en) * | 2020-10-15 | 2021-01-15 | 燕山大学 | Microblog comment data classification method and system |
Non-Patent Citations (1)
Title |
---|
乔宇峰: "基于专家知识库的继电保护定值智能辅助审核系统应用", 《内蒙古电力技术》, vol. 37, no. 1, 31 January 2019 (2019-01-31), pages 2 - 5 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022257201A1 (en) * | 2021-06-09 | 2022-12-15 | 山东交通学院 | Urban traffic safety early-warning method and system based on human-machine hybrid-augmented intelligence |
CN114969348A (en) * | 2022-07-27 | 2022-08-30 | 杭州电子科技大学 | Electronic file classification method and system based on inversion regulation knowledge base |
CN114969348B (en) * | 2022-07-27 | 2023-10-27 | 杭州电子科技大学 | Electronic file hierarchical classification method and system based on inversion adjustment knowledge base |
TWI813448B (en) * | 2022-09-20 | 2023-08-21 | 世界先進積體電路股份有限公司 | Expert system and expert method |
CN116069760A (en) * | 2023-01-09 | 2023-05-05 | 青岛中投创新技术转移有限公司 | Patent management data processing system, device and method |
CN116069760B (en) * | 2023-01-09 | 2023-12-15 | 青岛华慧泽知识产权代理有限公司 | Patent management data processing system, device and method |
CN116721730A (en) * | 2023-06-15 | 2023-09-08 | 医途(杭州)科技有限公司 | Whole-course patient management system based on digital therapy |
CN116721730B (en) * | 2023-06-15 | 2024-03-08 | 医途(杭州)科技有限公司 | Whole-course patient management system based on digital therapy |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111753060B (en) | Information retrieval method, apparatus, device and computer readable storage medium | |
CN110413780B (en) | Text emotion analysis method and electronic equipment | |
CN112836509A (en) | Expert system knowledge base construction method and system | |
US10089581B2 (en) | Data driven classification and data quality checking system | |
CN110415071B (en) | Automobile competitive product comparison method based on viewpoint mining analysis | |
CN112819023A (en) | Sample set acquisition method and device, computer equipment and storage medium | |
CN110008365B (en) | Image processing method, device and equipment and readable storage medium | |
US20170004414A1 (en) | Data driven classification and data quality checking method | |
CN113486664A (en) | Text data visualization analysis method, device, equipment and storage medium | |
CN110310012B (en) | Data analysis method, device, equipment and computer readable storage medium | |
CN115203338A (en) | Label and label example recommendation method | |
CN114691525A (en) | Test case selection method and device | |
CN114722198A (en) | Method, system and related device for determining product classification code | |
CN112579730A (en) | High-expansibility multi-label text classification method and device | |
CN110019820A (en) | Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history | |
CN113961666A (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
CN111930944B (en) | File label classification method and device | |
CN116629258B (en) | Structured analysis method and system for judicial document based on complex information item data | |
CN117420998A (en) | Client UI interaction component generation method, device, terminal and medium | |
CN116226747A (en) | Training method of data classification model, data classification method and electronic equipment | |
CN110765872A (en) | Online mathematical education resource classification method based on visual features | |
CN117235253A (en) | Truck user implicit demand mining method based on natural language processing technology | |
KR100809751B1 (en) | System and method for making analysis of document | |
CN114528378A (en) | Text classification method and device, electronic equipment and storage medium | |
CN115130453A (en) | Interactive information generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |