CN114238653B - Method for constructing programming education knowledge graph, completing and intelligently asking and answering - Google Patents

Method for constructing programming education knowledge graph, completing and intelligently asking and answering Download PDF

Info

Publication number
CN114238653B
CN114238653B CN202111491707.3A CN202111491707A CN114238653B CN 114238653 B CN114238653 B CN 114238653B CN 202111491707 A CN202111491707 A CN 202111491707A CN 114238653 B CN114238653 B CN 114238653B
Authority
CN
China
Prior art keywords
knowledge
question
knowledge graph
template
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111491707.3A
Other languages
Chinese (zh)
Other versions
CN114238653A (en
Inventor
冯博
王丽苹
宋培东
李逸飞
周琪丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202111491707.3A priority Critical patent/CN114238653B/en
Publication of CN114238653A publication Critical patent/CN114238653A/en
Application granted granted Critical
Publication of CN114238653B publication Critical patent/CN114238653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for constructing, complementing and intelligently asking and answering a programming education knowledge graph, which comprises the following steps: a) Processing knowledge in the programming field, constructing a semi-automatic recognition entity tool by combining BiLSTM-CRF with a manually constructed knowledge point term dictionary, and assisting in manually constructing a knowledge map in the programming field by means of the tool; b) Finding out the position where the knowledge graph needs to be complemented by a knowledge graph quality evaluation algorithm based on node centrality discrimination, and complementing the knowledge graph by using a direct perfection or crowdsourcing mode; c) And building an intelligent question-answering system in the programming education field based on the knowledge graph in the programming field. Compared with the traditional related method, the intelligent question-answering system in the programming education field is completed by relying on the constructed knowledge graph in the programming education field. The practical problem that the knowledge graph is applied to a question-answering system from construction to completion is solved.

Description

Method for constructing programming education knowledge graph, completing and intelligently asking and answering
Technical Field
The invention belongs to the field of computer science, and particularly relates to a method for constructing, completing and intelligently asking and answering a programming education knowledge graph.
Background
The knowledge graph can accurately and rapidly inquire needed information by using an intelligent and efficient knowledge organization method, provides a new thought for knowledge management, integrates information such as upper and lower relationships, attributes and the like of the knowledge graph, can provide support for a data mining and question-answering system, and has become one of core driving forces for promoting the development of artificial intelligence. In recent years, with the development of knowledge graph technology, the center of gravity of knowledge graph application and research gradually changes from an open domain knowledge graph to a domain knowledge graph, and the deep fusion of the knowledge graph and each industry becomes a trend. The domain knowledge graph has stronger pertinence and speciality due to the characteristic that the domain knowledge graph focuses on knowledge in a specific domain, and often has higher application value.
In the Freebase dataset, 75% of people have no nationality information, 71% of people have no birth place information, and the absence of entity type information can seriously affect the accuracy and recall rate of the application process. Knowledge graph completion is one of the hot problems in knowledge graph construction, and in the process of manually constructing the knowledge graph, the problems that knowledge is not perfect enough, a plurality of missing places exist, the hidden relations among entities are not completely mined and the like are unavoidable, so that a method is needed to solve the problems of incompleteness, sparsity and the like in the knowledge graph.
Intelligent question-answering is a very important research direction in the very popular natural language processing field nowadays, has gained wide attention from various industries, and has made related application attempts in the fields of internet, medical treatment, finance and the like.
Through research, no Chinese knowledge graph focused on the vertical field of computer science exists at present, but other Chinese knowledge graphs related to the relevant open field of the computer have the problems of low data quality and thin content, and no suitable Chinese question-answering system is available for people. The method constructs a knowledge graph in the vertical field of computer science, provides a knowledge graph evaluation algorithm and a solution for carrying out long-term stable updating on the knowledge graph, and has the advantages of behaviour of river and creativity.
Research shows that the current text-based question-answering system has a certain practical meaning and development space for research on solving the problem through a knowledge graph, wherein knowledge is often stored in a form of plain text or hypertext markup language, such as encyclopedia text and the like.
Disclosure of Invention
The invention aims to provide a method for constructing, complementing and intelligently asking and answering a programming education knowledge graph, which constructs the knowledge graph in the field of computer science and provides a knowledge graph quality assessment algorithm and a solution for stably updating the knowledge graph for a long time; by means of the dynamically updated knowledge graph, a question-answering system is built, mass knowledge resources can be organized in a more structured and associated mode, and questions can be answered more efficiently.
The specific technical scheme for realizing the aim of the invention is as follows:
A method for constructing, complementing and intelligently asking and answering a programming education knowledge graph comprises the following specific steps: step 1: constructing a programming vertical domain knowledge containing programming basis, data structure, algorithm course knowledge
The map specifically comprises the following components:
A1: extracting ontology modes and knowledge points from the book and website structured knowledge sources, obtaining ontology constraint in a top-down construction mode, and constructing an ontology constraint five-step method: determining the professional field and scope of the ontology, listing important terms in the ontology, defining the hierarchical relationship of classes and classes, defining the attribute of the classes and defining the relationship among the classes;
A2: each sentence of knowledge point in knowledge text corpus of books and websites in the programming field is used as a piece of corpus data, entities contained in each piece of corpus data are manually marked, the corpus data form a corpus data set, the marking method BIO is adopted to mark the data, namely, each word in a sentence is marked as the beginning of a knowledge point term entity, the middle of the knowledge point term or other non-knowledge point term vocabularies, and the knowledge point term entities arranged in the corpus are integrated to obtain a knowledge point term dictionary;
A3: performing entity matching recognition by using a model BiLSTM-CRF in combination with the knowledge point term dictionary constructed in the step A2, namely, taking a bidirectional LSTM network as a feature extractor, and performing named entity recognition output by using a serialization labeling algorithm CRF;
A4: manually extracting entities and relations from books and website knowledge bases in the programming field, and combining the entities identified in the step A3 to form a structured text in the form of a head entity, a relation and a tail entity;
A5: storing the structured text obtained in the step A4 into a Neo4j database as a data layer according to the ontology constraint constructed in the step A1, and obtaining a programming vertical domain knowledge graph containing a programming foundation, a data structure and algorithm course knowledge by the data layer;
Step 2: searching for the imperfection of the knowledge graph constructed in the step 1 based on the knowledge graph quality evaluation algorithm for judging the node centrality, wherein the knowledge graph quality evaluation algorithm specifically comprises the following steps:
B1: the node importance degree, namely the weight, is represented by an NI value, if the node A points to the node B, the NI value of the node B is added to the NI value of the node A, and each calculated NI value is used for the next iteration until the error between the two iterations is smaller than a threshold value;
B2: counting the existence of each node definition of the knowledge graph, the number of related operations of the data structure, the existence of operation codes and the number of related topics of the knowledge points, and calculating a score according to a counting result;
B3: multiplying the node NI value calculated in the step B1 and the statistical score of the step B2 to obtain a final node perfection score, wherein an entity with a low score is an imperfect place in the knowledge graph;
Step 3: and (3) completing the imperfect part of the knowledge graph searched in the step (2), which comprises the following steps:
C1: b3, for the nodes with low perfection scores calculated in the step B3, the data manager preferably complements the nodes in the Neo4j graph database;
c2: for complex missing conditions and huge completion tasks, the problem is solved through a crowdsourcing scheme, namely a data manager issues four tasks including adding questions, adding nodes, modifying nodes and others on a built crowdsourcing platform, the tasks are solved by a user, and the user actively issues the completion scheme on the crowdsourcing platform in a using platform;
And C3: an administrator reviews and passes the completion scheme issued by the user in the step C2 on the crowdsourcing platform, and the administrator completes the knowledge graph according to the passed completion scheme to obtain a complete programming vertical domain knowledge graph containing program design basis, data structure and algorithm course knowledge;
step 4: the construction of a template library supported by the intelligent question-answering system specifically comprises the following steps:
D1: the method comprises the steps of crawling original problem corpus data from a hundred-degree knowledge platform of questioning in a natural language form, cleaning the original data, and eliminating parts without practical significance in question sentences: the method comprises special symbols, repeated punctuation, guest-prompting phrases and mood words;
D2: the similar question has common semantics and structures and only has different keywords, the question preprocessed in the step D1 is subjected to keyword coverage, the question data after the keywords are covered is used for converting texts into fixed-length vectors by using a model Bert, features are extracted, a K-means algorithm and an adjacent transmission clustering algorithm are used for carrying out text clustering, and similar question structures are classified into one category;
D3: the format of each type of question definition template obtained by the text clustering in the step D2 comprises a question template and an answer template, the question is converted into cypher sentences which are searched in the knowledge graph constructed in the step C3 according to the characteristics of the type of question, and the answer template is filled with the result returned by the cypher sentences to finish the construction of a template library;
Step 5: the verification of the question-answering system is carried out, and the method specifically comprises the following steps:
e1: manually extracting knowledge points in the knowledge graph, constructing a knowledge point entity dictionary, and using HanLP word segmentation tools to segment words, label sequences and predict word types of questions, wherein the knowledge point entity matched with the entity dictionary in the questions is marked as a special part of speech;
e2: vectorizing the entity in the knowledge point entity dictionary in the step E1 and the word obtained by the word segmentation question in the step E1, and calculating the similarity of the entity and the word to perform entity link matching;
e3: calculating similarity between the question sentence after the entity link and the question sentence template in the template library constructed in the step D3, and performing template matching, if the matching degree is larger than a preset threshold value, then the template matching is successful, otherwise, selecting a default template as a matching result, searching in a knowledge graph according to cypher sentences given by the template, and filling the returned answer into the answer query to obtain the answer of the question sentence.
The invention has the beneficial effects that:
the invention constructs the knowledge graph in the field of computer science and provides a knowledge graph quality evaluation algorithm and a solution for stably updating the knowledge graph for a long time; by means of the dynamically updated knowledge graph, a question-answering system is built, mass knowledge resources can be organized in a more structured and associated mode, and questions can be answered more efficiently. The dynamic framework provided by the invention can completely execute the process from the knowledge graph in the vertical field of computer science to the construction of the question-answering system, and is continuously and dynamically updated to assist a programming learner in solving the problems in the related field.
Drawings
FIG. 1 is a logic structure diagram of a programming education knowledge graph;
FIG. 2 is a schematic diagram of the hierarchical relationship of the ontology of the programming education knowledge graph;
FIG. 3 is a hierarchical schematic of relationships in a programming education knowledge graph;
FIG. 4 is a schematic view of the scope of relationships in a programming education knowledge graph;
FIG. 5 is a flow chart of knowledge point term named entity recognition;
FIG. 6 is a BiLSTM-CRF model diagram for named entity recognition;
FIG. 7 is a schematic diagram of the training effect of BiLSTM-CRF model;
FIG. 8 is a diagram intent of an administrator publishing knowledge graph completion task activity;
FIG. 9 is a user complement knowledge graph activity diagram;
FIG. 10 is a schematic view of the PCA visualization of the layers of Bert on the UCI-News Aggregator dataset;
FIG. 11 is a schematic diagram of an example template annotation;
FIG. 12 is a data flow diagram of a question-answering system;
fig. 13 is a diagram of an intelligent question-answering system answer interface.
Detailed Description
1) Construction of knowledge graph
Firstly, initially constructing a knowledge graph in the programming field, manually extracting required knowledge point entities and relations from teaching materials in the programming education field, storing data in batches into the graph, and then, in order to facilitate subsequent further construction and complementation, the invention comprises a semi-automatic knowledge graph construction tool.
The invention adopts a manual mode to carry out the construction work of programming education knowledge graph from top to bottom, namely, by means of books and structured knowledge sources in a network, the ontology mode and knowledge point information are extracted from the knowledge sources, then the constructed ontology is taken as a constraint to add knowledge into a knowledge base, and the logic structure of the constructed knowledge graph is shown in figure 1. The Neo4j graph database is used as a data layer, and knowledge is stored in a graph form; the body model is used as a mode layer above the data layer to manage the data layer, controls the adding and deleting operations on the data layer, and plays a role similar to a die.
Since no relevant ontology can be reused, and the invention uses the ontology model as an abstraction and guidance layer on a higher level, the data is uniformly stored in the Neo4j graph database. Thus referring to the seven-step method partial step, the body is constructed in five steps: determining the professional field and scope of the ontology, listing important terms in the ontology, defining the hierarchical relationship of classes and classes, defining the attribute of the classes, and defining the relationship between the classes.
Aiming at the programming education field, the knowledge points in the computer science field are organized according to the subjects as modules, so that the method has good reusability and expansibility, and related terms are arranged. And constructing the hierarchical relationship of the domain ontology by adopting a top-down method, namely, firstly establishing an upper layer concept and then refining a lower layer concept. The constructed hierarchical relationship is shown in fig. 2.
In consideration of the knowledge graph to provide service for the intelligent system, the invention builds a question-answering system by relying on the knowledge graph, so that the number and the names of the node attributes are unified in order to reduce the complexity of processing each node in the question-answering system. Thus, as shown in table 1, a node is defined to have four internal attributes, "label" attribute to store the label of the node in the form of a string, i.e. the ontology class name to which the node instance belongs; the "id" attribute stores an int value for uniquely identifying the node; the "name" attribute stores the name of the node in the form of a string; the property attribute is used for storing a section of character string, and the specific meaning is defined by label and name of the node together.
TABLE 1 Property definition
Table 1 Properties’Definition
The relationships between classes have rich meanings and properties. Similar to class layering, relationships can be divided into multiple levels, and the layering has the advantages that the scope of a parent class relationship can be inherited for a child class relationship, and the granularity of the relationship expression in the knowledge graph is moderate. Eighteen relationships are defined together, the hierarchy of which is shown in FIG. 3.
Scope definition as shown in fig. 4, it can be found that the relationship of the related description is used to connect knowledge points and descriptions, and its sub-relationship also corresponds to sub-class of description class one-to-one.
In order to simplify the use of a knowledge graph management tool Neo4j and facilitate the process of constructing the knowledge graph, the invention designs a tool for constructing the knowledge graph, and the knowledge graph can be manipulated by editing a triplet format.
The invention further comprises a semi-automatic knowledge graph construction tool for carrying out named entity recognition, and because no data set which can be used in the programming education field of the invention is disclosed at present, the training data set needs to be manually marked and knowledge points are added into a knowledge point term dictionary. After the knowledge point term entity is extracted by the deep learning method, word segmentation matching is carried out on the input text corpus once according to the knowledge point term dictionary, so that the recognition effect is improved. The invention adopts BiLSTM-CRF model, and the specific flow of the named entity recognition method is shown in figure 5.
The data preprocessing process mainly comprises two parts of sequence labeling and feature extraction. For sequence labeling, the invention adopts a BIO labeling mode, B-KNO is used for representing the beginning of a knowledge point term entity, I-KNO represents the middle of the knowledge point term, and O represents other terms, namely other non-knowledge point term words in a text. After sequence labeling, text feature extraction is performed, word2vec is used for word-embedding, and CBOW model training is selected.
And carrying out named entity recognition by using BiLSTM-CRF model, adopting a bidirectional LSTM as a feature extractor, connecting a CRF layer as an output layer, wherein the model structure is shown in figure 6, and the accuracy of training set and verification set is shown in figure 7.
In order to improve the recognition effect, after the entity is extracted by using the model, the text corpus is matched for one time according to the constructed knowledge point term dictionary, the final recognition result is that the model extraction result and the text matching result are combined and de-duplicated, and a bidirectional maximum algorithm is used for text matching. After dictionary matching is added, the recognition effect is obviously improved, the precision is greatly improved, the recall rate is also slightly improved, the comprehensive evaluation index F1 value is obviously improved, and the results of the two are shown in a table 2.
Table 2 comparison of results
For convenient use, the entity is identified by the design platform after packaging.
2) Knowledge graph completion
The construction process of the knowledge graph is completed by a knowledge graph manager, and the problem that the consideration is not round and the question bank is not rich enough is avoided.
The knowledge graph quality assessment algorithm provided by the invention has centrality on the calculation degree of the nodes in the knowledge graph, namely the importance degree weight of the nodes in the knowledge graph; and then, formulating statistical rules of the missing conditions of various nodes, calculating the statistical conditions of the nodes according to the rules, and giving out scores of the missing conditions of the nodes by combining weights and statistical results.
The node degree centrality, i.e. importance, is represented by the NI value, if node a points to node B, indicating that node a is more important, such as the knowledge point "queue" points to the initialization of its operation "queue" node, then the latter is only one of the former operations, serving the former, and thus should be node a more important when node a points to node B, and thus the NI value of B should be added to the NI value of node a. The node importance value formula for each node in the calculated knowledge graph is as follows:
Where N is the number of node neighbors. Each calculated NI value needs to be updated for the next iteration until the error between the two iterations is smaller than the threshold, the operation can be stopped, and the calculation of each point is written into a matrix form to be expressed as:
the simplification is as follows:
NI2=C·NI1
Where α is a damping coefficient, matrix is a distribution proportion Matrix obtained by statistics, e is a unit Matrix, e T is a transpose of the unit Matrix, NI 1 is a current NI value Matrix, and NI 2 is a next NI value Matrix after iteration.
The actual conditions of all nodes in the knowledge graph are counted by making rules, the higher the node perfection degree is, the higher the score is, and related operations such as initialization, insertion, deletion and the like are taken as examples, wherein each related operation relationship is added with 2 scores, the upper limit is 15 scores, related realization such as chain realization, queue and the like, and each related operation relationship is added with 2 scores, and the upper limit is 15 scores.
The greater the center value NI calculated by the node, the higher the node weight, the greater the supplement necessity, the higher the calculated score, the more perfect the node, the smaller the supplement necessity, and in order to unify the two relations, the final node missing condition score calculation formula is:
score(x)=NI(x)×(full(x)-has(x))
Where x is a target node, full (x) is a full score based on the node class to which x belongs under the rule, has (x) is a statistical score based on the rule node x, and the higher score indicates that the node is more important and less perfect, so that the more necessary the complementation.
For the parts to be complemented which are discovered, an administrator can actively complete the parts, and for some complex tasks, crowdsourcing tasks can be issued to complete the parts to be complemented by the masses, as shown in fig. 8; it is considered that the user may actively find out the place where the knowledge graph needs to be completed in the process of using the knowledge graph application, so that the user can actively complete the knowledge graph through the crowdsourcing platform, as shown in fig. 9. The platform also establishes a closed-loop point management system to stimulate and encourage users to carry out knowledge graph completion work.
3) Knowledge graph-based Chinese question-answering system
The invention builds an intelligent question-answering system by utilizing the programming education knowledge graph. Starting from an original corpus, generating a template by adopting an unsupervised clustering and manual labeling mode, constructing a crawler to know the crawling problem from hundred degrees as the original corpus, and designating node names in a knowledge graph as search keywords, wherein the core workflow of the crawler is as follows:
(1) Extracting a keyword from the keyword list;
(2) Converting the key word into inquiry url and accessing;
(3) Acquiring the first 20 questions by regular matching;
(4) If the acquisition is empty, randomly dormancy for a period of time, increasing the value of a dormancy counter and the upper limit of random dormancy time, and executing the step (1);
(5) If the keyword is not empty, storing a question as a document by taking the keyword as a file name, and deleting the keyword from a keyword list;
(6) After actively random dormancy for a period of time, performing (1).
And cleaning and preprocessing the crawled data. Firstly, eliminating low quality problems, and deleting long questions with the tail fixed '…' mark through backward matching; screening ambiguous keywords to manually screen field independent questions; pattern matching "()" of selection questions, etc. non-realistic questions such as selections, questions, etc. are screened.
The following are the removal of nonsensical segments from the question, which are of the following classes: deleting special symbols and repeating punctuation by pattern matching; establishing a dictionary containing phrases such as "thank you", "please ask", etc. to recognize and reject the vocabulary; deleting the Chinese words through the Chinese word deactivating dictionary; and manually eliminating other small amount of semantic independent language segments.
The template extracts the semantic and structural commonalities between questions, for example, the definition of a question 'bubbling ordering method' and the definition of a question 'double linked list', which are different only in subject terms, are all definitions of inquiring a certain knowledge point, and can be generalized into the same template. Therefore, the subject word covering of the question according to the keyword list extracted above can enhance the similarity between the potential similar questions, so that the questions can be more easily classified.
Analysis of the pre-processed corpus data may reveal that almost all questions contain only one or two subject words, and that questions containing more than three subject words account for only 1%. Because the number of the keywords corresponds to one key of the template processing, the corpus can be firstly divided into three types according to the number of the keywords, each type is respectively subjected to unsupervised clustering, and similar questions can be better classified into one cluster, so that the template is convenient to refine.
And carrying out text clustering on the preprocessed question data, and dividing the preprocessed question data into different subsets, so that the internal elements of the preprocessed question data are similar as much as possible, and the elements among the different subsets are dissimilar as much as possible, and the preprocessed question data are used as the basis of a subsequent generation template. The preprocessed question corpus is a series of Chinese and English words and punctuation indefinite length lists, and in order to use a clustering algorithm for the question corpus, relevant features are required to be extracted and converted into fixed length vectors.
The invention adopts the Bert-wwm model, and has 12 layers of transformers, the hidden_size is 768, and the attribution_heads is 12. The output value of each layer of transformers can theoretically be regarded as a sentence vector. As shown in the Bert layer visualizations on UCI-News Aggregator dataset of fig. 10 (pool_strategy=reduce_mean), the last layer (pooling _layer= -1) is too close to the training object, there is a case of over-fitting, which can lead to semantic expression bias for model migration without fine tuning. The first layer (pooling _layer= -12) is too close to the original word embedding result, lacks semantic context information, and can obtain better effect by taking the penultimate layer as sentence vector in general. Therefore, the invention adopts the penultimate layer as a characteristic to convert each question into 768-dimensional sentence vectors.
For text clustering tasks, one simple and practical algorithm is the k-means algorithm (k-means), the basic idea being to derive from a given n vectorsAnd an integer z, finding z clusters …, and their respective centersThe parameter m is introduced such that the following formula is minimized:
the algorithm comprises the following steps:
(1) K points are selected as initial centroids;
(2) Assigning all data points to the cluster in which the nearest centroid is located;
(3) Recalculating a new centroid for each cluster;
(4) Repeating (2) and (3) until the centroid is no longer changing.
Another clustering method is a neighbor propagation clustering algorithm (Affinity Propagation), and has the advantages that compared with a k-means algorithm:
(1) The number of final cluster clusters is not required to be manually specified;
(2) The generated cluster centers are existing data points and not newly generated;
(3) Insensitive to the initial value of the data;
(4) Compared with the k-means clustering method, the square error of the result is smaller.
The invention mainly adopts a k-means clustering algorithm and a neighbor propagation clustering algorithm to carry out clustering experiments. And obtaining the profile coefficient of k-means clustering by taking different k values, wherein the profile coefficient is about 0.6 when the k value is 17, and the clustering effect is optimal.
The question sentences in each class obtained by clustering have similar semantic structures, so that the format of a template is defined firstly, then the labeling work of the template is carried out on each class of question sentences in a manual mode, and finally the construction of a template library is completed. The invention adopts JSON format to construct templates, each type of template can be used as a JSON object to be loaded into a question-answering system, and the structure is defined as follows:
Wherein the role of the template priority is that when a plurality of templates are successfully matched at the same time, the template with higher priority (priority value is close to 1) will be matched preferentially. The template subject word quantity is used as constraint, so that low subject word question sentences and high subject word template matching are avoided, and the template matching efficiency can be effectively improved. The template description can be used for distinguishing the templates, so that the readability of the templates is improved, and the management of labeling personnel and maintenance personnel on the templates is easier. The template example sentences are usually the corresponding cluster center sentences, and can also be a plurality of sentences to increase the matching range. After filling the subject term, the query template is converted into an executable cypher statement, and a database query task can be executed. Finally, the answer templates may contain multiple answer sentence patterns for generating multiple answers.
The labeling process of a cluster is described as an example. The process of template marking is shown in fig. 11. The original cluster belongs to a single subject entity class cluster, and the problems are relatively basic and very frequent, so that the original cluster has high priority. The original cluster is analyzed to find out that the problems are mainly related to the definition of the subject entity, two sentences which are closer to the center in the original cluster are selected as example sentences, and then one example sentence with larger sentence structure difference is selected to increase the matching stability. Next, a query statement is manually constructed cypher for the template, where < T1> can be filled in with the corresponding entity in the matched question, and a return query path is fixed for knowledge visualization in addition to the return query result. Then, two answer templates are constructed, and the topic entities and the query results are filled in the corresponding positions in the templates, so that answer sentences can be generated for returning to the user.
The above work has completed the generation of the template library, the subsequent template library upgrading work still follows the same flow, and it is necessary to construct the flow into a subsystem as a subsequent updating system of the template library, and the source of the original question is not the question of climbing from hundred degrees, but the question comes from the user question actually collected by the question answering system. After receiving the question, the subsystem can store the question library by only covering the identified subject entity. When the collected questions reach a certain scale, the steps of clustering analysis and template marking can be carried out on the questions in the question library so as to update the template library. It is also responsible for managing the template library, so it provides a reading interface for the template library to the question-answering system, which can directly call related methods to read in and convert the template file into template class examples for template matching. The construction has the advantages that when the template library is updated, the question-answering system can continue to provide service, and after the template library is updated, a new template file is directly loaded, so that the stability of the system is improved.
The data flow diagram of the system is shown in fig. 12, the natural language question sentence input by the user is firstly identified by the subject entity, the subject word is extracted, then template matching is carried out, the query sentence cypher is executed to query the knowledge graph after the matching is successful, and the query result is respectively assembled into a natural language answer and a query sub-graph to be displayed to the user.
As a first step in the natural language question processing flow, the main content of the topic entity recognition task can be divided into sequence tags and entity links.
The sequence labeling task refers to a sequence x=x 1x2…xn, and a label y=y 1y2…yn corresponding to each element in the sequence is found, wherein all possible value sets of y are called labeling sets, and the key information character strings are extracted from question sentences. And a lexical analysis system adopting a CRF model. For nouns with the same name as the knowledge point nodes in the knowledge graph, the CRF can be assisted to divide words in a dictionary mode. The method has the advantages that the time complexity of the merging and sorting of example sentences can be obtained after the dictionary is set, the merging and sorting of the result is obtained, the merging and sorting of the term can be well identified, the long conceptual nouns such as the time complexity can not be counted as a subject entity because the knowledge graph does not have specific noun nodes, however, in order to reduce the link workload of the subsequent entities, the processing steps of combining continuous non-subject nouns are adopted, and the finally obtained labeling sequence is as follows: "merge ordering/nto/u [ time/n complexity/n ]/n".
Some topic entities have been marked above with a dictionary labeled "nto", but the dictionary is not capable of efficient synonym recognition. For example, "quick rank" is a common synonym for "quick rank", where "quick rank" is marked with "n" as a common noun in the word segmentation result, and "quick rank" is marked with "nto" by dictionary matching as a subject entity. Therefore, it is necessary to link the "quick rank" with the "quick rank" in the knowledge graph, and convert the "quick rank" in the original sentence into the subject entity "quick rank".
The invention adopts a similarity method based on vectors to carry out entity linking. First, all the keywords extracted from TopicLoader are encoded by Bert, each word is encoded into 768-dimensional vectors, 496 keywords are collected in total, thus obtaining a 496×768 keyword matrix, candidate nouns are encoded, and then pearson correlation coefficients (Pearson correlation coefficient) are calculated for each line in the keyword matrix, and the coefficients can be used for measuring the linear correlation between the two vectors, and the value range is between-1 and 1. The calculation formula is as follows:
Where X i is the first word vector i dimension, Y i is the second word vector i dimension, AndThe first word vector and the second word vector mean, respectively. According to experiments, a better accuracy can be obtained by setting the correlation coefficient threshold to be about 0.8.
After the original question is identified by the topic entities, the number and the names of the topic entities contained in the original question are obtained, and template matching is needed. The template matching steps adopted by the invention are as follows:
(1) Selecting a candidate template set T, wherein TopicNum of templates in the T are smaller than or equal to the number of question topic entities;
(2) For each template T i in T, filling the subject entities in question sentences into example sentence templates in the template T Examples in sequence to generate an example sentence matrix S i, wherein each row is a generated natural language example sentence. Combining all example sentence matrixes into an example sentence set S;
(3) For each S i in S, carrying out similarity calculation on the user question sentence which is subjected to entity linking and each example sentence, selecting the value with the highest similarity as the matching degree of the template, and marking as c i;
(4) And selecting a template T j with the highest matching degree, and if the matching degree is larger than a preset threshold value rho, successfully matching the template. Otherwise, the template matching fails, and a default template is selected as a matching result.
And returning the related subgraphs of the subject entities in the user question in the knowledge graph by the default template after the matching is failed, and not generating natural language answers.
Examples
Multiple sets of test data are designed and constructed according to the real question-answer scene, the accuracy of a template matching algorithm is tested, and the result is shown in table 3. Wherein the first and second sets of test data are derived from questions collected from a questionnaire that truly want to be presented to the questionnaire system, and the third and fourth sets of data are derived from previously crawled hundred-degree awareness related questions, wherein about 50% of the third set are single-topic entity questions and about 70% of the fourth set are single-topic or dual-topic entity questions.
Table 3 accuracy test
The analysis test results can find that the accuracy of the first group and the second group is similar, and the accuracy of the third group and the fourth group is slightly higher because the third group and the template library have the same corpus source. In the input of the first and second groups, the proportion of the simple questions and the complex questions accords with the proportion in the actual situation, so that the result is closer to the actual result. The partial test data and results are shown in table 4.
Table 4 partial test data and results
Overall, the accuracy of the current template matching algorithm can be basically adequate for question-answering tasks, and the matching accuracy can be gradually improved in the future by iteratively upgrading the template library due to the characteristics that the template library is pluggable and convenient to update.
The intelligent question-answering system finally shows the results shown in fig. 13. Natural language answers are displayed in the middle answer box, matched templates and subject words are marked with light gray small words above the answers, and a knowledge graph sub-graph display is arranged below the answers.
The performance evaluation of the system was tested mainly from response time of the answer, 5 sets of test questions were designed in total, and tests were performed based on the Intel Core i7-8750H CPU (2.20 GHz) platform, and the results are shown in table 5. It can be found that when the repetition rate of the input question is high, the buffer memory can well improve the response rate of the system. The key word extraction and Neo4j connection query part have great influence on the performance, and the response time can be controlled within an acceptable interval finally by increasing the concurrent task quantity and expanding the cache duration.
TABLE 5 Performance test results

Claims (1)

1. The method for constructing, completing and intelligently asking and answering the programming education knowledge map is characterized by comprising the following specific steps:
step 1: constructing a programming vertical domain knowledge graph containing programming foundation, data structure and algorithm course knowledge, which specifically comprises the following steps:
A1: extracting ontology modes and knowledge points from the book and website structured knowledge sources, obtaining ontology constraint in a top-down construction mode, and constructing an ontology constraint five-step method: determining the professional field and scope of the ontology, listing important terms in the ontology, defining the hierarchical relationship of classes and classes, defining the attribute of the classes and defining the relationship among the classes;
A2: each sentence of knowledge point in knowledge text corpus of books and websites in the programming field is used as a piece of corpus data, entities contained in each piece of corpus data are manually marked, the corpus data form a corpus data set, the marking method BIO is adopted to mark the data, namely, each word in a sentence is marked as the beginning of a knowledge point term entity, the middle of the knowledge point term or other non-knowledge point term vocabularies, and the knowledge point term entities arranged in the corpus are integrated to obtain a knowledge point term dictionary;
A3: performing entity matching recognition by using a model BiLSTM-CRF in combination with the knowledge point term dictionary constructed in the step A2, namely, taking a bidirectional LSTM network as a feature extractor, and performing named entity recognition output by using a serialization labeling algorithm CRF;
A4: manually extracting entities and relations from books and website knowledge bases in the programming field, and combining the entities identified in the step A3 to form a structured text in the form of a head entity, a relation and a tail entity;
A5: storing the structured text obtained in the step A4 into a Neo4j database as a data layer according to the ontology constraint constructed in the step A1, and obtaining a programming vertical domain knowledge graph containing a programming foundation, a data structure and algorithm course knowledge by the data layer;
Step 2: searching for the imperfection of the knowledge graph constructed in the step 1 based on the knowledge graph quality evaluation algorithm for judging the node centrality, wherein the knowledge graph quality evaluation algorithm specifically comprises the following steps:
B1: the node importance degree, namely the weight, is represented by an NI value, if the node A points to the node B, the NI value of the node B is added to the NI value of the node A, and each calculated NI value is used for the next iteration until the error between the two iterations is smaller than a threshold value;
B2: counting the existence of each node definition of the knowledge graph, the number of related operations of the data structure, the existence of operation codes and the number of related topics of the knowledge points, and calculating a score according to a counting result;
B3: multiplying the node NI value calculated in the step B1 and the statistical score of the step B2 to obtain a final node perfection score, wherein an entity with a low score is an imperfect place in the knowledge graph;
Step 3: and (3) completing the imperfect part of the knowledge graph searched in the step (2), which comprises the following steps:
C1: b3, for the nodes with low perfection scores calculated in the step B3, the data manager preferably complements the nodes in the Neo4j graph database;
c2: the problem that the missing situation is complicated and the completion task is huge is solved through a crowdsourcing scheme, namely, a data manager issues four tasks of adding questions, adding nodes, modifying nodes and the like on the constructed crowdsourcing platform, the tasks are solved by a user, and the user actively issues the completion scheme on the crowdsourcing platform in a using platform;
And C3: an administrator reviews and passes the completion scheme issued by the user in the step C2 on the crowdsourcing platform, and the administrator completes the knowledge graph according to the passed completion scheme to obtain a complete programming vertical domain knowledge graph containing program design basis, data structure and algorithm course knowledge;
step 4: the construction of a template library supported by the intelligent question-answering system specifically comprises the following steps:
D1: the method comprises the steps of crawling original problem corpus data from a hundred-degree knowledge platform of questioning in a natural language form, cleaning the original data, and eliminating parts without practical significance in question sentences: the method comprises special symbols, repeated punctuation, guest-prompting phrases and mood words;
D2: the similar question has common semantics and structures and only has different keywords, the question preprocessed in the step D1 is subjected to keyword coverage, the question data after the keywords are covered is used for converting texts into fixed-length vectors by using a model Bert, features are extracted, a K-means algorithm and an adjacent transmission clustering algorithm are used for carrying out text clustering, and similar question structures are classified into one category;
D3: the format of each type of question definition template obtained by the text clustering in the step D2 comprises a question template and an answer template, the question is converted into cypher sentences which are searched in the knowledge graph constructed in the step C3 according to the characteristics of the type of question, and the answer template is filled with the result returned by the cypher sentences to finish the construction of a template library;
Step 5: the verification of the question-answering system is carried out, and the method specifically comprises the following steps:
e1: manually extracting knowledge points in the knowledge graph, constructing a knowledge point entity dictionary, and using HanLP word segmentation tools to segment words, label sequences and predict word types of questions, wherein the knowledge point entity of a matched entity dictionary in the questions is marked as a special part of speech;
e2: vectorizing the entity in the knowledge point entity dictionary in the step E1 and the word obtained by the word segmentation question in the step E1, and calculating the similarity of the entity and the word to perform entity link matching;
e3: calculating similarity between the question sentence after the entity link and the question sentence template in the template library constructed in the step D3, and performing template matching, if the matching degree is larger than a preset threshold value, then the template matching is successful, otherwise, selecting a default template as a matching result, searching in a knowledge graph according to cypher sentences given by the template, and filling the returned answer into the answer query to obtain the answer of the question sentence.
CN202111491707.3A 2021-12-08 2021-12-08 Method for constructing programming education knowledge graph, completing and intelligently asking and answering Active CN114238653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111491707.3A CN114238653B (en) 2021-12-08 2021-12-08 Method for constructing programming education knowledge graph, completing and intelligently asking and answering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111491707.3A CN114238653B (en) 2021-12-08 2021-12-08 Method for constructing programming education knowledge graph, completing and intelligently asking and answering

Publications (2)

Publication Number Publication Date
CN114238653A CN114238653A (en) 2022-03-25
CN114238653B true CN114238653B (en) 2024-05-24

Family

ID=80753974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111491707.3A Active CN114238653B (en) 2021-12-08 2021-12-08 Method for constructing programming education knowledge graph, completing and intelligently asking and answering

Country Status (1)

Country Link
CN (1) CN114238653B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309915B (en) * 2022-09-29 2022-12-09 北京如炬科技有限公司 Knowledge graph construction method, device, equipment and storage medium
CN115795056B (en) * 2023-01-04 2024-08-02 中国电子科技集团公司第十五研究所 Method, server and storage medium for constructing knowledge graph by unstructured information
CN117194633B (en) * 2023-09-12 2024-07-26 河海大学 Dam emergency response knowledge question-answering system based on multi-level multipath and implementation method
CN117390200A (en) * 2023-11-06 2024-01-12 南京题谱思信息科技有限公司 Method for identifying solution ideas of questions in knowledge field

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920556A (en) * 2018-06-20 2018-11-30 华东师范大学 Recommendation expert method based on subject knowledge map
CN109241290A (en) * 2017-07-10 2019-01-18 华东师范大学 A kind of knowledge mapping complementing method, device and storage medium
CN110275947A (en) * 2019-05-23 2019-09-24 中国人民解放军战略支援部队信息工程大学 Domain-specific knowledge map natural language querying method and device based on name Entity recognition
CN110347843A (en) * 2019-07-10 2019-10-18 陕西师范大学 A kind of Chinese tour field Knowledge Service Platform construction method of knowledge based map
CN111143539A (en) * 2019-12-31 2020-05-12 重庆和贯科技有限公司 Knowledge graph-based question-answering method in teaching field
WO2020228416A1 (en) * 2019-05-14 2020-11-19 京东数字科技控股有限公司 Responding method and device
CN112100351A (en) * 2020-09-11 2020-12-18 陕西师范大学 Method and equipment for constructing intelligent question-answering system through question generation data set
CN112417100A (en) * 2020-11-20 2021-02-26 大连民族大学 Knowledge graph in Liaodai historical culture field and construction method of intelligent question-answering system thereof
CN113312501A (en) * 2021-06-29 2021-08-27 中新国际联合研究院 Construction method and device of safety knowledge self-service query system based on knowledge graph
WO2021189921A1 (en) * 2020-10-19 2021-09-30 平安科技(深圳)有限公司 Intelligent question answering method and apparatus for multi-round dialog in medical field, and computer device
CN113535917A (en) * 2021-06-30 2021-10-22 山东师范大学 Intelligent question-answering method and system based on travel knowledge map
CN113704499A (en) * 2020-09-24 2021-11-26 广东昭阳信息技术有限公司 Accurate and efficient intelligent education knowledge map construction method

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241290A (en) * 2017-07-10 2019-01-18 华东师范大学 A kind of knowledge mapping complementing method, device and storage medium
CN108920556A (en) * 2018-06-20 2018-11-30 华东师范大学 Recommendation expert method based on subject knowledge map
WO2020228416A1 (en) * 2019-05-14 2020-11-19 京东数字科技控股有限公司 Responding method and device
CN110275947A (en) * 2019-05-23 2019-09-24 中国人民解放军战略支援部队信息工程大学 Domain-specific knowledge map natural language querying method and device based on name Entity recognition
CN110347843A (en) * 2019-07-10 2019-10-18 陕西师范大学 A kind of Chinese tour field Knowledge Service Platform construction method of knowledge based map
CN111143539A (en) * 2019-12-31 2020-05-12 重庆和贯科技有限公司 Knowledge graph-based question-answering method in teaching field
CN112100351A (en) * 2020-09-11 2020-12-18 陕西师范大学 Method and equipment for constructing intelligent question-answering system through question generation data set
CN113704499A (en) * 2020-09-24 2021-11-26 广东昭阳信息技术有限公司 Accurate and efficient intelligent education knowledge map construction method
WO2021189921A1 (en) * 2020-10-19 2021-09-30 平安科技(深圳)有限公司 Intelligent question answering method and apparatus for multi-round dialog in medical field, and computer device
CN112417100A (en) * 2020-11-20 2021-02-26 大连民族大学 Knowledge graph in Liaodai historical culture field and construction method of intelligent question-answering system thereof
CN113312501A (en) * 2021-06-29 2021-08-27 中新国际联合研究院 Construction method and device of safety knowledge self-service query system based on knowledge graph
CN113535917A (en) * 2021-06-30 2021-10-22 山东师范大学 Intelligent question-answering method and system based on travel knowledge map

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于结构与文本联合表示的知识图谱补全方法;鲍开放;顾君忠;杨静;;计算机工程;20170815(第07期);211-217 *
基于远程监督的关系抽取技术;王嘉宁;何怡;朱仁煜;刘婷婷;高明;;华东师范大学学报(自然科学版);20200925(第05期);122-139 *
融合对抗训练的端到端知识三元组联合抽取;黄培馨;赵翔;方阳;朱慧明;肖卫东;;计算机研究与发展;20191215(第12期);20-32 *

Also Published As

Publication number Publication date
CN114238653A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
CN114238653B (en) Method for constructing programming education knowledge graph, completing and intelligently asking and answering
CN107748757B (en) Question-answering method based on knowledge graph
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN110298033B (en) Keyword corpus labeling training extraction system
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
Lin et al. Knowledge map creation and maintenance for virtual communities of practice
KR100533810B1 (en) Semi-Automatic Construction Method for Knowledge of Encyclopedia Question Answering System
CN111625658A (en) Voice interaction method, device and equipment based on knowledge graph and storage medium
US20160378853A1 (en) Systems and methods for reducing search-ability of problem statement text
CN111858896B (en) Knowledge base question-answering method based on deep learning
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN116737967A (en) Knowledge graph construction and perfecting system and method based on natural language
Valčič et al. Information technology for management and promotion of sustainable cultural tourism
CN112686025A (en) Chinese choice question interference item generation method based on free text
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN111951079A (en) Credit rating method and device based on knowledge graph and electronic equipment
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph
Alshammari et al. TAQS: an Arabic question similarity system using transfer learning of BERT with BILSTM
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
CN118467985A (en) Training scoring method based on natural language
Atapattu et al. Automated extraction of semantic concepts from semi-structured data: Supporting computer-based education through the analysis of lecture notes
Korade et al. Strengthening Sentence Similarity Identification Through OpenAI Embeddings and Deep Learning.
Tapsai et al. Natural language interface to database for data retrieval and processing
CN117216221A (en) Intelligent question-answering system based on knowledge graph and construction method
CN117094390A (en) Knowledge graph construction and intelligent search method oriented to ocean engineering field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant