CN114238653B

CN114238653B - Method for constructing programming education knowledge graph, completing and intelligently asking and answering

Info

Publication number: CN114238653B
Application number: CN202111491707.3A
Authority: CN
Inventors: 冯博; 王丽苹; 宋培东; 李逸飞; 周琪丰
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2024-05-24
Anticipated expiration: 2041-12-08
Also published as: CN114238653A

Abstract

The invention discloses a method for constructing, complementing and intelligently asking and answering a programming education knowledge graph, which comprises the following steps: a) Processing knowledge in the programming field, constructing a semi-automatic recognition entity tool by combining BiLSTM-CRF with a manually constructed knowledge point term dictionary, and assisting in manually constructing a knowledge map in the programming field by means of the tool; b) Finding out the position where the knowledge graph needs to be complemented by a knowledge graph quality evaluation algorithm based on node centrality discrimination, and complementing the knowledge graph by using a direct perfection or crowdsourcing mode; c) And building an intelligent question-answering system in the programming education field based on the knowledge graph in the programming field. Compared with the traditional related method, the intelligent question-answering system in the programming education field is completed by relying on the constructed knowledge graph in the programming education field. The practical problem that the knowledge graph is applied to a question-answering system from construction to completion is solved.

Description

Method for constructing programming education knowledge graph, completing and intelligently asking and answering

Technical Field

The invention belongs to the field of computer science, and particularly relates to a method for constructing, completing and intelligently asking and answering a programming education knowledge graph.

Background

The knowledge graph can accurately and rapidly inquire needed information by using an intelligent and efficient knowledge organization method, provides a new thought for knowledge management, integrates information such as upper and lower relationships, attributes and the like of the knowledge graph, can provide support for a data mining and question-answering system, and has become one of core driving forces for promoting the development of artificial intelligence. In recent years, with the development of knowledge graph technology, the center of gravity of knowledge graph application and research gradually changes from an open domain knowledge graph to a domain knowledge graph, and the deep fusion of the knowledge graph and each industry becomes a trend. The domain knowledge graph has stronger pertinence and speciality due to the characteristic that the domain knowledge graph focuses on knowledge in a specific domain, and often has higher application value.

In the Freebase dataset, 75% of people have no nationality information, 71% of people have no birth place information, and the absence of entity type information can seriously affect the accuracy and recall rate of the application process. Knowledge graph completion is one of the hot problems in knowledge graph construction, and in the process of manually constructing the knowledge graph, the problems that knowledge is not perfect enough, a plurality of missing places exist, the hidden relations among entities are not completely mined and the like are unavoidable, so that a method is needed to solve the problems of incompleteness, sparsity and the like in the knowledge graph.

Intelligent question-answering is a very important research direction in the very popular natural language processing field nowadays, has gained wide attention from various industries, and has made related application attempts in the fields of internet, medical treatment, finance and the like.

Through research, no Chinese knowledge graph focused on the vertical field of computer science exists at present, but other Chinese knowledge graphs related to the relevant open field of the computer have the problems of low data quality and thin content, and no suitable Chinese question-answering system is available for people. The method constructs a knowledge graph in the vertical field of computer science, provides a knowledge graph evaluation algorithm and a solution for carrying out long-term stable updating on the knowledge graph, and has the advantages of behaviour of river and creativity.

Research shows that the current text-based question-answering system has a certain practical meaning and development space for research on solving the problem through a knowledge graph, wherein knowledge is often stored in a form of plain text or hypertext markup language, such as encyclopedia text and the like.

Disclosure of Invention

The invention aims to provide a method for constructing, complementing and intelligently asking and answering a programming education knowledge graph, which constructs the knowledge graph in the field of computer science and provides a knowledge graph quality assessment algorithm and a solution for stably updating the knowledge graph for a long time; by means of the dynamically updated knowledge graph, a question-answering system is built, mass knowledge resources can be organized in a more structured and associated mode, and questions can be answered more efficiently.

The specific technical scheme for realizing the aim of the invention is as follows:

A method for constructing, complementing and intelligently asking and answering a programming education knowledge graph comprises the following specific steps: step 1: constructing a programming vertical domain knowledge containing programming basis, data structure, algorithm course knowledge

The map specifically comprises the following components:

A1: extracting ontology modes and knowledge points from the book and website structured knowledge sources, obtaining ontology constraint in a top-down construction mode, and constructing an ontology constraint five-step method: determining the professional field and scope of the ontology, listing important terms in the ontology, defining the hierarchical relationship of classes and classes, defining the attribute of the classes and defining the relationship among the classes;

A2: each sentence of knowledge point in knowledge text corpus of books and websites in the programming field is used as a piece of corpus data, entities contained in each piece of corpus data are manually marked, the corpus data form a corpus data set, the marking method BIO is adopted to mark the data, namely, each word in a sentence is marked as the beginning of a knowledge point term entity, the middle of the knowledge point term or other non-knowledge point term vocabularies, and the knowledge point term entities arranged in the corpus are integrated to obtain a knowledge point term dictionary;

A3: performing entity matching recognition by using a model BiLSTM-CRF in combination with the knowledge point term dictionary constructed in the step A2, namely, taking a bidirectional LSTM network as a feature extractor, and performing named entity recognition output by using a serialization labeling algorithm CRF;

A4: manually extracting entities and relations from books and website knowledge bases in the programming field, and combining the entities identified in the step A3 to form a structured text in the form of a head entity, a relation and a tail entity;

A5: storing the structured text obtained in the step A4 into a Neo4j database as a data layer according to the ontology constraint constructed in the step A1, and obtaining a programming vertical domain knowledge graph containing a programming foundation, a data structure and algorithm course knowledge by the data layer;

Step 2: searching for the imperfection of the knowledge graph constructed in the step 1 based on the knowledge graph quality evaluation algorithm for judging the node centrality, wherein the knowledge graph quality evaluation algorithm specifically comprises the following steps:

B1: the node importance degree, namely the weight, is represented by an NI value, if the node A points to the node B, the NI value of the node B is added to the NI value of the node A, and each calculated NI value is used for the next iteration until the error between the two iterations is smaller than a threshold value;

B2: counting the existence of each node definition of the knowledge graph, the number of related operations of the data structure, the existence of operation codes and the number of related topics of the knowledge points, and calculating a score according to a counting result;

B3: multiplying the node NI value calculated in the step B1 and the statistical score of the step B2 to obtain a final node perfection score, wherein an entity with a low score is an imperfect place in the knowledge graph;

Step 3: and (3) completing the imperfect part of the knowledge graph searched in the step (2), which comprises the following steps:

C1: b3, for the nodes with low perfection scores calculated in the step B3, the data manager preferably complements the nodes in the Neo4j graph database;

c2: for complex missing conditions and huge completion tasks, the problem is solved through a crowdsourcing scheme, namely a data manager issues four tasks including adding questions, adding nodes, modifying nodes and others on a built crowdsourcing platform, the tasks are solved by a user, and the user actively issues the completion scheme on the crowdsourcing platform in a using platform;

And C3: an administrator reviews and passes the completion scheme issued by the user in the step C2 on the crowdsourcing platform, and the administrator completes the knowledge graph according to the passed completion scheme to obtain a complete programming vertical domain knowledge graph containing program design basis, data structure and algorithm course knowledge;

step 4: the construction of a template library supported by the intelligent question-answering system specifically comprises the following steps:

D1: the method comprises the steps of crawling original problem corpus data from a hundred-degree knowledge platform of questioning in a natural language form, cleaning the original data, and eliminating parts without practical significance in question sentences: the method comprises special symbols, repeated punctuation, guest-prompting phrases and mood words;

D2: the similar question has common semantics and structures and only has different keywords, the question preprocessed in the step D1 is subjected to keyword coverage, the question data after the keywords are covered is used for converting texts into fixed-length vectors by using a model Bert, features are extracted, a K-means algorithm and an adjacent transmission clustering algorithm are used for carrying out text clustering, and similar question structures are classified into one category;

D3: the format of each type of question definition template obtained by the text clustering in the step D2 comprises a question template and an answer template, the question is converted into cypher sentences which are searched in the knowledge graph constructed in the step C3 according to the characteristics of the type of question, and the answer template is filled with the result returned by the cypher sentences to finish the construction of a template library;

Step 5: the verification of the question-answering system is carried out, and the method specifically comprises the following steps:

e1: manually extracting knowledge points in the knowledge graph, constructing a knowledge point entity dictionary, and using HanLP word segmentation tools to segment words, label sequences and predict word types of questions, wherein the knowledge point entity matched with the entity dictionary in the questions is marked as a special part of speech;

e2: vectorizing the entity in the knowledge point entity dictionary in the step E1 and the word obtained by the word segmentation question in the step E1, and calculating the similarity of the entity and the word to perform entity link matching;

e3: calculating similarity between the question sentence after the entity link and the question sentence template in the template library constructed in the step D3, and performing template matching, if the matching degree is larger than a preset threshold value, then the template matching is successful, otherwise, selecting a default template as a matching result, searching in a knowledge graph according to cypher sentences given by the template, and filling the returned answer into the answer query to obtain the answer of the question sentence.

The invention has the beneficial effects that:

the invention constructs the knowledge graph in the field of computer science and provides a knowledge graph quality evaluation algorithm and a solution for stably updating the knowledge graph for a long time; by means of the dynamically updated knowledge graph, a question-answering system is built, mass knowledge resources can be organized in a more structured and associated mode, and questions can be answered more efficiently. The dynamic framework provided by the invention can completely execute the process from the knowledge graph in the vertical field of computer science to the construction of the question-answering system, and is continuously and dynamically updated to assist a programming learner in solving the problems in the related field.

Drawings

FIG. 1 is a logic structure diagram of a programming education knowledge graph;

FIG. 2 is a schematic diagram of the hierarchical relationship of the ontology of the programming education knowledge graph;

FIG. 3 is a hierarchical schematic of relationships in a programming education knowledge graph;

FIG. 4 is a schematic view of the scope of relationships in a programming education knowledge graph;

FIG. 5 is a flow chart of knowledge point term named entity recognition;

FIG. 6 is a BiLSTM-CRF model diagram for named entity recognition;

FIG. 7 is a schematic diagram of the training effect of BiLSTM-CRF model;

FIG. 8 is a diagram intent of an administrator publishing knowledge graph completion task activity;

FIG. 9 is a user complement knowledge graph activity diagram;

FIG. 10 is a schematic view of the PCA visualization of the layers of Bert on the UCI-News Aggregator dataset;

FIG. 11 is a schematic diagram of an example template annotation;

FIG. 12 is a data flow diagram of a question-answering system;

fig. 13 is a diagram of an intelligent question-answering system answer interface.

Detailed Description

1) Construction of knowledge graph

Firstly, initially constructing a knowledge graph in the programming field, manually extracting required knowledge point entities and relations from teaching materials in the programming education field, storing data in batches into the graph, and then, in order to facilitate subsequent further construction and complementation, the invention comprises a semi-automatic knowledge graph construction tool.

The invention adopts a manual mode to carry out the construction work of programming education knowledge graph from top to bottom, namely, by means of books and structured knowledge sources in a network, the ontology mode and knowledge point information are extracted from the knowledge sources, then the constructed ontology is taken as a constraint to add knowledge into a knowledge base, and the logic structure of the constructed knowledge graph is shown in figure 1. The Neo4j graph database is used as a data layer, and knowledge is stored in a graph form; the body model is used as a mode layer above the data layer to manage the data layer, controls the adding and deleting operations on the data layer, and plays a role similar to a die.

Since no relevant ontology can be reused, and the invention uses the ontology model as an abstraction and guidance layer on a higher level, the data is uniformly stored in the Neo4j graph database. Thus referring to the seven-step method partial step, the body is constructed in five steps: determining the professional field and scope of the ontology, listing important terms in the ontology, defining the hierarchical relationship of classes and classes, defining the attribute of the classes, and defining the relationship between the classes.

Aiming at the programming education field, the knowledge points in the computer science field are organized according to the subjects as modules, so that the method has good reusability and expansibility, and related terms are arranged. And constructing the hierarchical relationship of the domain ontology by adopting a top-down method, namely, firstly establishing an upper layer concept and then refining a lower layer concept. The constructed hierarchical relationship is shown in fig. 2.

In consideration of the knowledge graph to provide service for the intelligent system, the invention builds a question-answering system by relying on the knowledge graph, so that the number and the names of the node attributes are unified in order to reduce the complexity of processing each node in the question-answering system. Thus, as shown in table 1, a node is defined to have four internal attributes, "label" attribute to store the label of the node in the form of a string, i.e. the ontology class name to which the node instance belongs; the "id" attribute stores an int value for uniquely identifying the node; the "name" attribute stores the name of the node in the form of a string; the property attribute is used for storing a section of character string, and the specific meaning is defined by label and name of the node together.

TABLE 1 Property definition

Table 1 Properties’Definition

The relationships between classes have rich meanings and properties. Similar to class layering, relationships can be divided into multiple levels, and the layering has the advantages that the scope of a parent class relationship can be inherited for a child class relationship, and the granularity of the relationship expression in the knowledge graph is moderate. Eighteen relationships are defined together, the hierarchy of which is shown in FIG. 3.

Scope definition as shown in fig. 4, it can be found that the relationship of the related description is used to connect knowledge points and descriptions, and its sub-relationship also corresponds to sub-class of description class one-to-one.

In order to simplify the use of a knowledge graph management tool Neo4j and facilitate the process of constructing the knowledge graph, the invention designs a tool for constructing the knowledge graph, and the knowledge graph can be manipulated by editing a triplet format.

The invention further comprises a semi-automatic knowledge graph construction tool for carrying out named entity recognition, and because no data set which can be used in the programming education field of the invention is disclosed at present, the training data set needs to be manually marked and knowledge points are added into a knowledge point term dictionary. After the knowledge point term entity is extracted by the deep learning method, word segmentation matching is carried out on the input text corpus once according to the knowledge point term dictionary, so that the recognition effect is improved. The invention adopts BiLSTM-CRF model, and the specific flow of the named entity recognition method is shown in figure 5.

The data preprocessing process mainly comprises two parts of sequence labeling and feature extraction. For sequence labeling, the invention adopts a BIO labeling mode, B-KNO is used for representing the beginning of a knowledge point term entity, I-KNO represents the middle of the knowledge point term, and O represents other terms, namely other non-knowledge point term words in a text. After sequence labeling, text feature extraction is performed, word2vec is used for word-embedding, and CBOW model training is selected.

And carrying out named entity recognition by using BiLSTM-CRF model, adopting a bidirectional LSTM as a feature extractor, connecting a CRF layer as an output layer, wherein the model structure is shown in figure 6, and the accuracy of training set and verification set is shown in figure 7.

In order to improve the recognition effect, after the entity is extracted by using the model, the text corpus is matched for one time according to the constructed knowledge point term dictionary, the final recognition result is that the model extraction result and the text matching result are combined and de-duplicated, and a bidirectional maximum algorithm is used for text matching. After dictionary matching is added, the recognition effect is obviously improved, the precision is greatly improved, the recall rate is also slightly improved, the comprehensive evaluation index F1 value is obviously improved, and the results of the two are shown in a table 2.

Table 2 comparison of results

For convenient use, the entity is identified by the design platform after packaging.

2) Knowledge graph completion

The construction process of the knowledge graph is completed by a knowledge graph manager, and the problem that the consideration is not round and the question bank is not rich enough is avoided.

The knowledge graph quality assessment algorithm provided by the invention has centrality on the calculation degree of the nodes in the knowledge graph, namely the importance degree weight of the nodes in the knowledge graph; and then, formulating statistical rules of the missing conditions of various nodes, calculating the statistical conditions of the nodes according to the rules, and giving out scores of the missing conditions of the nodes by combining weights and statistical results.

The node degree centrality, i.e. importance, is represented by the NI value, if node a points to node B, indicating that node a is more important, such as the knowledge point "queue" points to the initialization of its operation "queue" node, then the latter is only one of the former operations, serving the former, and thus should be node a more important when node a points to node B, and thus the NI value of B should be added to the NI value of node a. The node importance value formula for each node in the calculated knowledge graph is as follows:

Where N is the number of node neighbors. Each calculated NI value needs to be updated for the next iteration until the error between the two iterations is smaller than the threshold, the operation can be stopped, and the calculation of each point is written into a matrix form to be expressed as:

the simplification is as follows:

NI₂＝C·NI₁

Where α is a damping coefficient, matrix is a distribution proportion Matrix obtained by statistics, e is a unit Matrix, e ^T is a transpose of the unit Matrix, NI ₁ is a current NI value Matrix, and NI ₂ is a next NI value Matrix after iteration.

The actual conditions of all nodes in the knowledge graph are counted by making rules, the higher the node perfection degree is, the higher the score is, and related operations such as initialization, insertion, deletion and the like are taken as examples, wherein each related operation relationship is added with 2 scores, the upper limit is 15 scores, related realization such as chain realization, queue and the like, and each related operation relationship is added with 2 scores, and the upper limit is 15 scores.

The greater the center value NI calculated by the node, the higher the node weight, the greater the supplement necessity, the higher the calculated score, the more perfect the node, the smaller the supplement necessity, and in order to unify the two relations, the final node missing condition score calculation formula is:

score(x)＝NI(x)×(full(x)-has(x))

Where x is a target node, full (x) is a full score based on the node class to which x belongs under the rule, has (x) is a statistical score based on the rule node x, and the higher score indicates that the node is more important and less perfect, so that the more necessary the complementation.

For the parts to be complemented which are discovered, an administrator can actively complete the parts, and for some complex tasks, crowdsourcing tasks can be issued to complete the parts to be complemented by the masses, as shown in fig. 8; it is considered that the user may actively find out the place where the knowledge graph needs to be completed in the process of using the knowledge graph application, so that the user can actively complete the knowledge graph through the crowdsourcing platform, as shown in fig. 9. The platform also establishes a closed-loop point management system to stimulate and encourage users to carry out knowledge graph completion work.

3) Knowledge graph-based Chinese question-answering system

The invention builds an intelligent question-answering system by utilizing the programming education knowledge graph. Starting from an original corpus, generating a template by adopting an unsupervised clustering and manual labeling mode, constructing a crawler to know the crawling problem from hundred degrees as the original corpus, and designating node names in a knowledge graph as search keywords, wherein the core workflow of the crawler is as follows:

(1) Extracting a keyword from the keyword list;

(2) Converting the key word into inquiry url and accessing;

(3) Acquiring the first 20 questions by regular matching;

(4) If the acquisition is empty, randomly dormancy for a period of time, increasing the value of a dormancy counter and the upper limit of random dormancy time, and executing the step (1);

(5) If the keyword is not empty, storing a question as a document by taking the keyword as a file name, and deleting the keyword from a keyword list;

(6) After actively random dormancy for a period of time, performing (1).

And cleaning and preprocessing the crawled data. Firstly, eliminating low quality problems, and deleting long questions with the tail fixed '…' mark through backward matching; screening ambiguous keywords to manually screen field independent questions; pattern matching "()" of selection questions, etc. non-realistic questions such as selections, questions, etc. are screened.

The following are the removal of nonsensical segments from the question, which are of the following classes: deleting special symbols and repeating punctuation by pattern matching; establishing a dictionary containing phrases such as "thank you", "please ask", etc. to recognize and reject the vocabulary; deleting the Chinese words through the Chinese word deactivating dictionary; and manually eliminating other small amount of semantic independent language segments.

The template extracts the semantic and structural commonalities between questions, for example, the definition of a question 'bubbling ordering method' and the definition of a question 'double linked list', which are different only in subject terms, are all definitions of inquiring a certain knowledge point, and can be generalized into the same template. Therefore, the subject word covering of the question according to the keyword list extracted above can enhance the similarity between the potential similar questions, so that the questions can be more easily classified.

Analysis of the pre-processed corpus data may reveal that almost all questions contain only one or two subject words, and that questions containing more than three subject words account for only 1%. Because the number of the keywords corresponds to one key of the template processing, the corpus can be firstly divided into three types according to the number of the keywords, each type is respectively subjected to unsupervised clustering, and similar questions can be better classified into one cluster, so that the template is convenient to refine.

And carrying out text clustering on the preprocessed question data, and dividing the preprocessed question data into different subsets, so that the internal elements of the preprocessed question data are similar as much as possible, and the elements among the different subsets are dissimilar as much as possible, and the preprocessed question data are used as the basis of a subsequent generation template. The preprocessed question corpus is a series of Chinese and English words and punctuation indefinite length lists, and in order to use a clustering algorithm for the question corpus, relevant features are required to be extracted and converted into fixed length vectors.

The invention adopts the Bert-wwm model, and has 12 layers of transformers, the hidden_size is 768, and the attribution_heads is 12. The output value of each layer of transformers can theoretically be regarded as a sentence vector. As shown in the Bert layer visualizations on UCI-News Aggregator dataset of fig. 10 (pool_strategy=reduce_mean), the last layer (pooling _layer= -1) is too close to the training object, there is a case of over-fitting, which can lead to semantic expression bias for model migration without fine tuning. The first layer (pooling _layer= -12) is too close to the original word embedding result, lacks semantic context information, and can obtain better effect by taking the penultimate layer as sentence vector in general. Therefore, the invention adopts the penultimate layer as a characteristic to convert each question into 768-dimensional sentence vectors.

For text clustering tasks, one simple and practical algorithm is the k-means algorithm (k-means), the basic idea being to derive from a given n vectorsAnd an integer z, finding z clusters …, and their respective centersThe parameter m is introduced such that the following formula is minimized:

the algorithm comprises the following steps:

(1) K points are selected as initial centroids;

(2) Assigning all data points to the cluster in which the nearest centroid is located;

(3) Recalculating a new centroid for each cluster;

(4) Repeating (2) and (3) until the centroid is no longer changing.

Another clustering method is a neighbor propagation clustering algorithm (Affinity Propagation), and has the advantages that compared with a k-means algorithm:

(1) The number of final cluster clusters is not required to be manually specified;

(2) The generated cluster centers are existing data points and not newly generated;

(3) Insensitive to the initial value of the data;

(4) Compared with the k-means clustering method, the square error of the result is smaller.

The invention mainly adopts a k-means clustering algorithm and a neighbor propagation clustering algorithm to carry out clustering experiments. And obtaining the profile coefficient of k-means clustering by taking different k values, wherein the profile coefficient is about 0.6 when the k value is 17, and the clustering effect is optimal.

The question sentences in each class obtained by clustering have similar semantic structures, so that the format of a template is defined firstly, then the labeling work of the template is carried out on each class of question sentences in a manual mode, and finally the construction of a template library is completed. The invention adopts JSON format to construct templates, each type of template can be used as a JSON object to be loaded into a question-answering system, and the structure is defined as follows:

Wherein the role of the template priority is that when a plurality of templates are successfully matched at the same time, the template with higher priority (priority value is close to 1) will be matched preferentially. The template subject word quantity is used as constraint, so that low subject word question sentences and high subject word template matching are avoided, and the template matching efficiency can be effectively improved. The template description can be used for distinguishing the templates, so that the readability of the templates is improved, and the management of labeling personnel and maintenance personnel on the templates is easier. The template example sentences are usually the corresponding cluster center sentences, and can also be a plurality of sentences to increase the matching range. After filling the subject term, the query template is converted into an executable cypher statement, and a database query task can be executed. Finally, the answer templates may contain multiple answer sentence patterns for generating multiple answers.

The labeling process of a cluster is described as an example. The process of template marking is shown in fig. 11. The original cluster belongs to a single subject entity class cluster, and the problems are relatively basic and very frequent, so that the original cluster has high priority. The original cluster is analyzed to find out that the problems are mainly related to the definition of the subject entity, two sentences which are closer to the center in the original cluster are selected as example sentences, and then one example sentence with larger sentence structure difference is selected to increase the matching stability. Next, a query statement is manually constructed cypher for the template, where < T1> can be filled in with the corresponding entity in the matched question, and a return query path is fixed for knowledge visualization in addition to the return query result. Then, two answer templates are constructed, and the topic entities and the query results are filled in the corresponding positions in the templates, so that answer sentences can be generated for returning to the user.

The above work has completed the generation of the template library, the subsequent template library upgrading work still follows the same flow, and it is necessary to construct the flow into a subsystem as a subsequent updating system of the template library, and the source of the original question is not the question of climbing from hundred degrees, but the question comes from the user question actually collected by the question answering system. After receiving the question, the subsystem can store the question library by only covering the identified subject entity. When the collected questions reach a certain scale, the steps of clustering analysis and template marking can be carried out on the questions in the question library so as to update the template library. It is also responsible for managing the template library, so it provides a reading interface for the template library to the question-answering system, which can directly call related methods to read in and convert the template file into template class examples for template matching. The construction has the advantages that when the template library is updated, the question-answering system can continue to provide service, and after the template library is updated, a new template file is directly loaded, so that the stability of the system is improved.

The data flow diagram of the system is shown in fig. 12, the natural language question sentence input by the user is firstly identified by the subject entity, the subject word is extracted, then template matching is carried out, the query sentence cypher is executed to query the knowledge graph after the matching is successful, and the query result is respectively assembled into a natural language answer and a query sub-graph to be displayed to the user.

As a first step in the natural language question processing flow, the main content of the topic entity recognition task can be divided into sequence tags and entity links.

The sequence labeling task refers to a sequence x=x ₁x₂…x_n, and a label y=y ₁y₂…y_n corresponding to each element in the sequence is found, wherein all possible value sets of y are called labeling sets, and the key information character strings are extracted from question sentences. And a lexical analysis system adopting a CRF model. For nouns with the same name as the knowledge point nodes in the knowledge graph, the CRF can be assisted to divide words in a dictionary mode. The method has the advantages that the time complexity of the merging and sorting of example sentences can be obtained after the dictionary is set, the merging and sorting of the result is obtained, the merging and sorting of the term can be well identified, the long conceptual nouns such as the time complexity can not be counted as a subject entity because the knowledge graph does not have specific noun nodes, however, in order to reduce the link workload of the subsequent entities, the processing steps of combining continuous non-subject nouns are adopted, and the finally obtained labeling sequence is as follows: "merge ordering/nto/u [ time/n complexity/n ]/n".

Some topic entities have been marked above with a dictionary labeled "nto", but the dictionary is not capable of efficient synonym recognition. For example, "quick rank" is a common synonym for "quick rank", where "quick rank" is marked with "n" as a common noun in the word segmentation result, and "quick rank" is marked with "nto" by dictionary matching as a subject entity. Therefore, it is necessary to link the "quick rank" with the "quick rank" in the knowledge graph, and convert the "quick rank" in the original sentence into the subject entity "quick rank".

The invention adopts a similarity method based on vectors to carry out entity linking. First, all the keywords extracted from TopicLoader are encoded by Bert, each word is encoded into 768-dimensional vectors, 496 keywords are collected in total, thus obtaining a 496×768 keyword matrix, candidate nouns are encoded, and then pearson correlation coefficients (Pearson correlation coefficient) are calculated for each line in the keyword matrix, and the coefficients can be used for measuring the linear correlation between the two vectors, and the value range is between-1 and 1. The calculation formula is as follows:

Where X _i is the first word vector i dimension, Y _i is the second word vector i dimension, AndThe first word vector and the second word vector mean, respectively. According to experiments, a better accuracy can be obtained by setting the correlation coefficient threshold to be about 0.8.

After the original question is identified by the topic entities, the number and the names of the topic entities contained in the original question are obtained, and template matching is needed. The template matching steps adopted by the invention are as follows:

(1) Selecting a candidate template set T, wherein TopicNum of templates in the T are smaller than or equal to the number of question topic entities;

(2) For each template T _i in T, filling the subject entities in question sentences into example sentence templates in the template T Examples in sequence to generate an example sentence matrix S _i, wherein each row is a generated natural language example sentence. Combining all example sentence matrixes into an example sentence set S;

(3) For each S _i in S, carrying out similarity calculation on the user question sentence which is subjected to entity linking and each example sentence, selecting the value with the highest similarity as the matching degree of the template, and marking as c _i;

(4) And selecting a template T _j with the highest matching degree, and if the matching degree is larger than a preset threshold value rho, successfully matching the template. Otherwise, the template matching fails, and a default template is selected as a matching result.

And returning the related subgraphs of the subject entities in the user question in the knowledge graph by the default template after the matching is failed, and not generating natural language answers.

Examples

Multiple sets of test data are designed and constructed according to the real question-answer scene, the accuracy of a template matching algorithm is tested, and the result is shown in table 3. Wherein the first and second sets of test data are derived from questions collected from a questionnaire that truly want to be presented to the questionnaire system, and the third and fourth sets of data are derived from previously crawled hundred-degree awareness related questions, wherein about 50% of the third set are single-topic entity questions and about 70% of the fourth set are single-topic or dual-topic entity questions.

Table 3 accuracy test

The analysis test results can find that the accuracy of the first group and the second group is similar, and the accuracy of the third group and the fourth group is slightly higher because the third group and the template library have the same corpus source. In the input of the first and second groups, the proportion of the simple questions and the complex questions accords with the proportion in the actual situation, so that the result is closer to the actual result. The partial test data and results are shown in table 4.

Table 4 partial test data and results

Overall, the accuracy of the current template matching algorithm can be basically adequate for question-answering tasks, and the matching accuracy can be gradually improved in the future by iteratively upgrading the template library due to the characteristics that the template library is pluggable and convenient to update.

The intelligent question-answering system finally shows the results shown in fig. 13. Natural language answers are displayed in the middle answer box, matched templates and subject words are marked with light gray small words above the answers, and a knowledge graph sub-graph display is arranged below the answers.

The performance evaluation of the system was tested mainly from response time of the answer, 5 sets of test questions were designed in total, and tests were performed based on the Intel Core i7-8750H CPU (2.20 GHz) platform, and the results are shown in table 5. It can be found that when the repetition rate of the input question is high, the buffer memory can well improve the response rate of the system. The key word extraction and Neo4j connection query part have great influence on the performance, and the response time can be controlled within an acceptable interval finally by increasing the concurrent task quantity and expanding the cache duration.

TABLE 5 Performance test results

Claims

1. The method for constructing, completing and intelligently asking and answering the programming education knowledge map is characterized by comprising the following specific steps:

step 1: constructing a programming vertical domain knowledge graph containing programming foundation, data structure and algorithm course knowledge, which specifically comprises the following steps:

c2: the problem that the missing situation is complicated and the completion task is huge is solved through a crowdsourcing scheme, namely, a data manager issues four tasks of adding questions, adding nodes, modifying nodes and the like on the constructed crowdsourcing platform, the tasks are solved by a user, and the user actively issues the completion scheme on the crowdsourcing platform in a using platform;

e1: manually extracting knowledge points in the knowledge graph, constructing a knowledge point entity dictionary, and using HanLP word segmentation tools to segment words, label sequences and predict word types of questions, wherein the knowledge point entity of a matched entity dictionary in the questions is marked as a special part of speech;