CN114238653A - Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education - Google Patents

Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education Download PDF

Info

Publication number
CN114238653A
CN114238653A CN202111491707.3A CN202111491707A CN114238653A CN 114238653 A CN114238653 A CN 114238653A CN 202111491707 A CN202111491707 A CN 202111491707A CN 114238653 A CN114238653 A CN 114238653A
Authority
CN
China
Prior art keywords
knowledge
question
knowledge graph
template
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111491707.3A
Other languages
Chinese (zh)
Inventor
冯博
王丽苹
宋培东
李逸飞
周琪丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202111491707.3A priority Critical patent/CN114238653A/en
Publication of CN114238653A publication Critical patent/CN114238653A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Abstract

The invention discloses a method for constructing, complementing and intelligently asking for answering a knowledge graph of programming education, which comprises the following steps: a) processing programming domain knowledge, constructing a semi-automatic recognition entity tool by combining BilSTM-CRF with a manually constructed knowledge point term dictionary, and assisting in manually constructing a programming domain knowledge graph by means of the tool; b) finding the part of the knowledge graph needing to be supplemented through a knowledge graph quality evaluation algorithm based on node centrality judgment, and supplementing the knowledge graph by using a direct improvement or crowdsourcing mode; c) and establishing an intelligent question-answering system in the programming education field based on the programming field knowledge graph. Compared with the traditional related method, the intelligent question-answering system in the programming education field is completed by means of the established knowledge graph in the programming education field. The practical problem that the knowledge graph is applied to a question-answering system from construction to completion is solved.

Description

Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
Technical Field
The invention belongs to the field of computer science, and particularly relates to a method for constructing, complementing and intelligently asking for answers of programming education knowledge maps.
Background
The knowledge graph organizes the knowledge intelligently and efficiently, so that the related applications can accurately and rapidly inquire the required information, a new idea is provided for knowledge management, the information such as the upper and lower relations, attributes and the like of the knowledge graph is integrated, support can be provided for a data mining and question-answering system, and the knowledge graph becomes one of the core driving forces for promoting the development of artificial intelligence. In recent years, with the development of knowledge graph technology, the gravity center of knowledge graph application and research gradually shifts from an open domain knowledge graph to a domain knowledge graph, and the deep fusion of the knowledge graph and each industry becomes a trend. The domain knowledge map has stronger pertinence and specialty due to the characteristic that the domain knowledge map focuses on knowledge in a specific domain, and is often more valuable in application.
In the Freebase data set, 75% of people have no nationality information, 71% of people have no place of birth information, and the lack of entity type information can seriously affect the accuracy and recall rate of the application process. The completion of the knowledge graph is one of the hot problems in the construction of the knowledge graph, and in the process of manually constructing the knowledge graph, the problems that the knowledge is not complete enough, a plurality of missing places exist, the implicit relation between entities is not completely mined and the like are inevitable, so that the problems of incompleteness, sparsity and the like in the knowledge graph need to be solved by means of a method.
The intelligent question answering is a very important research direction in the very popular natural language processing field at present, has obtained wide attention of various industries, and has made related application attempts in the fields of internet, medical treatment, finance and the like.
Through research, a Chinese knowledge graph focusing on the vertical field of computer science does not exist at present, and other Chinese knowledge graphs related to the computer in the open field have the problems of low data quality and thin content, and a proper Chinese question-answering system is not provided for people to use. The method constructs a knowledge graph in the vertical field of computer science, provides a solution for evaluating the knowledge graph and updating the knowledge graph stably for a long time, and has river-beginning property and creativity.
Research finds that in the current text-based question-answering system, knowledge is often stored in a form of plain text or hypertext markup language, such as encyclopedic text, and the like, and still has certain practical significance and development space for research on solving the problem through a knowledge graph.
Disclosure of Invention
The invention aims to provide a method for constructing, complementing and intelligently asking for answering a knowledge graph of programming education, which constructs the knowledge graph in the field of computer science and provides a knowledge graph quality evaluation algorithm and a solution for stably updating the knowledge graph for a long time; a question-answering system is constructed by means of a dynamically updated knowledge map, massive knowledge resources can be organized in a structured and correlated mode, and the question answering is more efficient.
The specific technical scheme for realizing the purpose of the invention is as follows:
a method for constructing, complementing and intelligently asking for answering a knowledge graph of programming education comprises the following specific steps: step 1: constructing a programming vertical domain knowledge containing programming basis, data structure and algorithm course knowledge
The map specifically comprises:
a1: extracting an ontology mode and knowledge points from the knowledge source by means of a book and website structured knowledge source, obtaining ontology constraint by a top-down construction mode, and constructing an ontology constraint five-step method: determining the professional field and category of the ontology, listing important terms in the ontology, defining the hierarchical relationship between classes and classes, defining the attributes of the classes, and defining the relationship between the classes;
a2: taking knowledge points of each sentence in knowledge text corpora of books and websites in the programming field as corpus data, manually marking entities contained in the corpus data, forming a corpus data set by the corpus data, marking the data by adopting a marking method BIO (building information organization), namely marking that each word in a sentence is the beginning of a knowledge point term entity, the middle of a knowledge point term or other non-knowledge point term vocabularies, and integrating the knowledge point term entities sorted out of the corpus to obtain a knowledge point term dictionary;
a3: performing entity matching recognition by using a model BilSTM-CRF in combination with the knowledge point term dictionary constructed in the step A2, namely, taking a bidirectional LSTM network as a feature extractor, and performing named entity recognition output by using a serialization labeling algorithm CRF;
a4: manually extracting entities and relations from the books and the website knowledge base in the programming field, and combining the entities identified in the step A3 to form a structured text in a head entity-relation-tail entity form;
a5: storing the structured text obtained in the step A4 into a Neo4j database as a data layer according to the ontology constraint constructed in the step A1, wherein the data layer obtains a programming vertical domain knowledge graph containing programming basis, data structure and algorithm course knowledge;
step 2: searching for the imperfection of the knowledge graph constructed in the step 1 based on the knowledge graph quality evaluation algorithm judged by the node centrality, and specifically comprising the following steps:
b1: the importance degree of the node, namely the weight, is represented by an NI value, if the node A points to the node B, the NI value of the node B is added to the NI value of the node A, and the calculated NI value is used for the next iteration each time until the error between two iterations is smaller than a threshold value;
b2: counting the number of the related operations of the data structure, the operation codes and the knowledge point associated questions of the knowledge map according to the presence or absence of the definition of each node of the knowledge map, and calculating the score according to the counting result;
b3: multiplying the node NI value obtained by calculation in the step B1 with the statistical score in the step B2 to obtain a final node perfection degree score, wherein an entity with a low score is an imperfect position in the knowledge graph;
and step 3: and (3) completing the imperfection of the knowledge graph found in the step (2), which specifically comprises the following steps:
c1: for the nodes with low perfection scores calculated in the step B3, the data administrator preferably completes the nodes in the Neo4j database;
c2: the problems of complicated missing situations and huge completion tasks are solved through a crowdsourcing scheme, namely, a data administrator issues tasks of adding topics, adding nodes, modifying nodes and other tasks of four types on a constructed crowdsourcing platform and submits the tasks to a user for solution, and the user actively issues the completion scheme on the crowdsourcing platform in a use platform;
c3: c2, the administrator audits and passes the completion scheme issued by the user on the crowdsourcing platform, and the administrator completes the knowledge graph according to the passed completion scheme to obtain a perfect programming vertical domain knowledge graph containing the program design basis, the data structure and the algorithm course knowledge;
and 4, step 4: the method for constructing the template base relied on by the intelligent question answering system specifically comprises the following steps:
d1: the method comprises the following steps of crawling original question corpus data on a platform from a hundred-degree knowledge of question asking in a natural language form, cleaning the original data, and removing parts without practical significance in question sentences: including special symbols, repeated punctuations, guest-package urging phrases and tone words;
d2: the semantics and the structures of similar question sentences are common and only have different subject terms, the subject terms of the question sentences preprocessed in the step D1 are covered, the question sentence data after the subject terms are covered are converted into fixed-length vectors by using a model Bert, the characteristics are extracted, the texts are clustered by using a K mean algorithm and an adjacent propagation clustering algorithm, and the similar question sentence structures are classified into one class;
d3: defining the format of a template for each type of question obtained by clustering the texts in the step D2, wherein the template comprises a question template and an answer template, converting the question into cypher sentences searched in the knowledge graph constructed in the step C3 according to the characteristics of the type of question, and filling the answer template with the results returned by the cypher sentences to complete the construction of a template library;
and 5: the verification of the question-answering system specifically comprises the following steps:
e1: manually extracting knowledge points in a knowledge map, constructing a knowledge point entity dictionary, performing word segmentation, sequence labeling and word type prediction on a question by using a HanLP word segmentation tool, and labeling the knowledge point entity matched with the entity dictionary in the question as a special part of speech;
e2: vectorizing the entities in the knowledge point entity dictionary in the step E1 and the words obtained by the word segmentation of the question sentence in the step E1, and calculating the similarity of the entities and the words to carry out entity link matching;
e3: and D3, calculating similarity between the question after entity linking and the question template in the template library constructed in the step D3, matching the templates, if the matching degree is greater than a preset threshold value, successfully matching the templates, otherwise, selecting a default template as a matching result, searching in a knowledge graph according to cypher sentences given by the template, and filling the returned answers into answer queries to obtain answers of the question.
The invention has the beneficial effects that:
the invention constructs the knowledge graph in the field of computer science and provides a knowledge graph quality evaluation algorithm and a solution for stably updating the knowledge graph for a long time; a question-answering system is constructed by means of a dynamically updated knowledge map, massive knowledge resources can be organized in a structured and correlated mode, and the question answering is more efficient. The dynamic framework provided by the invention can completely execute the process from the computer science vertical domain knowledge graph to the question-answering system building, and is continuously dynamically updated to assist a programmed learner in solving the problems in the related fields.
Drawings
FIG. 1 is a diagram of a programming education knowledge-graph logic structure;
FIG. 2 is a diagram of a programming education knowledge graph ontology hierarchy;
FIG. 3 is a hierarchical diagram of relationships in a programming education knowledge graph;
FIG. 4 is a schematic diagram of the scope of relationships in a programmatic educational knowledge graph;
FIG. 5 is a flow diagram of knowledge point term named entity recognition;
FIG. 6 is a BiLSTM-CRF model diagram for named entity recognition;
FIG. 7 is a schematic diagram of the training effect of the BilSTM-CRF model;
FIG. 8 is a schematic diagram of administrator release knowledge-graph completion task activity;
FIG. 9 is a diagram of user completion knowledge-graph activity;
FIG. 10 is a schematic view of the PCA visualization of layers on the UCI-News Aggregator dataset by Bert;
FIG. 11 is a diagram illustrating an exemplary labeling of a template;
FIG. 12 is a data flow diagram of a question-answering system;
FIG. 13 is a diagram of an intelligent question and answer system response interface.
Detailed Description
1) Construction of knowledge graph
The method comprises the steps of firstly, initially constructing a knowledge graph in the programming field, manually extracting required knowledge point entities and relations from teaching materials in the programming education field, storing data in the graph in batches, and then facilitating subsequent further construction and completion.
The invention adopts a manual mode to construct the knowledge graph of programming education from top to bottom, namely, an ontology mode and knowledge point information are extracted from the knowledge graph by means of books and structured knowledge sources in a network, then the constructed ontology is taken as a constraint to add knowledge into a knowledge base, and the logical structure of the constructed knowledge graph is shown in figure 1. A Neo4j graph database is used as a data layer, and knowledge is stored in a graph form; the ontology model is used as a mode layer above the data layer for managing the data layer, controls the operations of adding, deleting and modifying the data layer, and plays a role similar to a mold.
Because no related ontology can be reused, the ontology model is used as an abstract and guiding layer at a higher level, and data is uniformly stored in a Neo4j database. Thus, with reference to the seven method partial steps, the ontology is constructed in five steps: determining the professional field and category of the ontology, listing important terms in the ontology, defining the hierarchical relationship of classes and classes, defining the attributes of the classes, and defining the relationship among the classes.
Aiming at the field of programming education, knowledge points in the field of computer science are organized into modules according to subjects, so that the method has good reusability and expansibility, and related terms are arranged. And (3) constructing the hierarchical relationship of the domain ontology by adopting a top-down method, namely, firstly establishing an upper-layer concept and then subdividing a lower-layer concept. The constructed hierarchical relationship is shown in fig. 2.
Considering that the knowledge graph provides service for the intelligent system, the invention builds the question-answering system by relying on the knowledge graph, and the quantity and the name of the node attribute are unified for reducing the complexity of processing each node in the question-answering system. Thus, as shown in Table 1, defining that a node should have four internal attributes, the "label" attribute stores the node's label in the form of a string, i.e., the name of the ontology class to which the node instance belongs; the "id" attribute stores an int value to uniquely identify the node; the "name" attribute stores the name of the node in a string; the property attribute is used for storing a character string, and the specific meaning is jointly defined by label and name of the node.
Table 1 attribute definition
Table 1 Properties’Definition
Figure BDA0003398612990000051
The inter-class relationship has rich meaning and characteristics. Similar to the hierarchy of the classes, the relationship can also be divided into a plurality of hierarchies, and the hierarchy has the advantages that the scope of the parent class relationship can be inherited as the subclass relationship, and the granularity of the relationship expression in the knowledge graph is moderate. Eighteen relationships are defined, and the hierarchy is shown in fig. 3.
Scope definition as shown in fig. 4, it can be found that the relation of the related description is used to connect knowledge points and descriptions, and the sub-relations are in one-to-one correspondence with the sub-classes of the description classes.
In order to simplify the use of a knowledge graph management tool Neo4j and facilitate the knowledge graph construction process, the invention designs a knowledge graph construction tool, and the knowledge graph can be manipulated by editing a triple format.
At present, the knowledge graph is manually and preliminarily constructed, the invention also comprises a semi-automatic knowledge graph construction tool for named entity recognition, and because a data set which can be used in the programming education field of the invention is not disclosed at present, a training data set needs to be manually marked, and knowledge points are added into a knowledge point term dictionary. After the knowledge point term entity is extracted by the deep learning method, the input text corpus is subjected to word segmentation matching once according to the knowledge point term dictionary so as to improve the recognition effect. The specific flow of the named entity identification method of the BilSTM-CRF model adopted by the invention is shown in figure 5.
The data preprocessing process is mainly divided into two parts of sequence labeling and feature extraction. For sequence labeling, the method adopts a BIO labeling mode, uses B-KNO to represent the beginning of a knowledge point term entity, I-KNO to represent the middle of the knowledge point term, and O to represent other non-knowledge point term vocabularies in the text. And after the sequence is labeled, text feature extraction is carried out, word-embedding is carried out by using word2vec, and CBOW model training is selected.
A BilSTM-CRF model is used for named entity recognition, a bidirectional LSTM is used as a feature extractor, a CRF layer is connected to serve as an output layer, the structure of the model is shown in figure 6, and the accuracy of a training set and a verification set is shown in figure 7.
In order to improve the recognition effect, after the entity is extracted by using the model, the text corpus is matched once according to the constructed knowledge point term dictionary, the final recognition result is the combination and duplication removal of the model extraction result and the text matching result, and the text matching uses a bidirectional maximum algorithm. After dictionary matching is added, the recognition effect is obviously improved, the precision rate is greatly improved, the recall rate is also slightly improved, the comprehensive evaluation index F1 value is obviously improved, and the result comparison of the two results is shown in Table 2.
TABLE 2 comparison of results
Figure BDA0003398612990000061
For convenient use, packaging is carried out, and a platform is designed to identify the entity.
2) Intellectual graph completion
The construction process of the knowledge graph is completed by a knowledge graph manager, so that the problems of poor consideration and insufficient abundant question banks are inevitable.
The knowledge graph quality evaluation algorithm provided by the invention calculates the centrality of the node calculation degree in the knowledge graph, namely the importance degree weight of the node in the knowledge graph; and then formulating various node missing condition statistical rules, calculating the node statistical conditions according to the rules, and giving node missing condition scores by integrating the weights and the statistical results.
The node degree centrality, i.e. the importance, is represented by the NI value, if node a points to node B, meaning that node a is more important, e.g. the knowledge point "queue" points to its operation "initialization of the queue" a node, then the latter is only one of the former operations, which is served by the former, and thus when node a points to node B, it should be more important by node a, and thus the NI value of B should be added to the NI value of node a. The formula for calculating the node importance value of each node in the knowledge graph is as follows:
Figure BDA0003398612990000071
where N is the number of node neighbors. The calculated NI value needs to be updated every time for the next iteration, the operation can be stopped until the error between two iterations is smaller than a threshold value, and the calculation of each point is written into a matrix form and expressed as follows:
Figure BDA0003398612990000072
the method is simplified as follows:
NI2=C·NI1
Figure BDA0003398612990000073
where α is the damping coefficient, Matrix is a statistically derived distribution ratio Matrix, e is an identity Matrix, e is a MatrixTIs a unitTranspose of the matrix, NI1Is the NI value matrix of this time, NI2Is the next matrix of NI values after iteration.
The actual conditions of all nodes in the knowledge graph are counted by formulating rules, the higher the node perfection degree is, the higher the score is, taking a logic structure class as an example, the relevant operations such as initialization, insertion, deletion and the like are added by 2 for each connection of the 'relevant operations', the upper limit is 15 points, the relevant implementation such as chain implementation, queue and the like is added by 2 for each connection of the 'implementation', and the upper limit is 15 points.
The larger the centrality value NI calculated by the node is, the higher the node weight is, the greater the supplement necessity is, the higher the calculated score is, the more perfect the node is, and the supplement necessity is smaller, and in order to unify the two relations, the final node missing condition score calculation formula is as follows:
score(x)=NI(x)×(full(x)-has(x))
wherein, x is a target node, full (x) is the full score of the node category to which x belongs based on the formulated rule, has (x) is the statistical score of the node x based on the formulated rule, and the higher score indicates that the node is more important and less perfect, so that the completion necessity is higher.
For the found part to be completed, the administrator can actively perform the completion, and for some complex and tedious tasks, crowdsourcing tasks can be issued to be completed by the public, as shown in fig. 8; it is considered that the user may actively find the place where the knowledge graph needs to be supplemented in the process of using the knowledge graph application, so the user may also actively supplement the knowledge graph through the crowdsourcing platform, as shown in fig. 9. The platform can also make a closed-loop point management system, and stimulate and encourage users to carry out the work of complementing the knowledge map.
3) Chinese question-answering system based on knowledge graph
The invention utilizes the programming education knowledge graph to build an intelligent question-answering system. Starting from an original corpus, generating a template by adopting a mode of unsupervised clustering and artificial labeling, constructing a crawler to know the crawling problem from Baidu as the original corpus, designating node names in a knowledge graph as search keywords, and performing the core work flow of the crawler as follows:
(1) taking out a keyword from the keyword list;
(2) converting the keyword into a query url and accessing;
(3) obtaining the first 20 question sentences by regular matching;
(4) if the acquisition is null, randomly sleeping for a period of time, increasing the value of a sleep counter and the upper limit of the random sleep time, and executing the step (1);
(5) if the keyword is not null, the keyword is taken as a file name, a question is saved as a document, and the keyword is deleted from the keyword list;
(6) and (1) after the active random dormancy is performed for a period of time.
And cleaning and preprocessing the crawled data. Firstly, removing low-quality problems, and then, fixing a long question sentence marked by '…' at the tail end through backward matching and deleting the question sentence; manually screening field-independent question sentences by screening ambiguous key words; the "()" of pattern matching choice questions or the like screens for non-factual questions such as choices, questions and answers, and the like.
The following are the meaningless passages in the question sentence: deleting special symbols and repeated punctuation by pattern matching; establishing a passenger set phrase dictionary containing 'thank you', 'ask for questions' and the like, and identifying and removing the class of words; deleting the language meaning words through a language meaning word stopping dictionary; and manually removing other small semantic irrelevant word segments.
The template extracts the semantic and structural commonalities between the question sentences, for example, the question sentence 'definition of bubble sort method' and the question sentence 'definition of double linked list' are different only in subject terms, are both definitions for inquiring a certain knowledge point, and can be generalized to the same template. Therefore, the subject word covering is carried out on the question sentences according to the keyword list extracted from the above, so that the similarity between potential similar question sentences can be enhanced, and the question sentences can be classified into one category more easily.
Analysis of the preprocessed corpus data shows that almost all questions contain only one or two subject words, and the question containing more than three subject words accounts for only 1%. The main key of template processing is that the number of the subject words corresponds to each other, so the corpus can be divided into three categories according to the number of the subject words, and then each category is subjected to unsupervised clustering, similar question sentences can be better classified into one cluster, and the template extraction is convenient.
And performing text clustering on the preprocessed question data, dividing the preprocessed question data into different subsets, and taking the internal elements as similar as possible and the elements among the different subsets as dissimilar as possible as the basis of a subsequent generated template. The preprocessed question corpus is a series of indefinite long lists of Chinese and English words and punctuation, and in order to use a clustering algorithm for the question corpus, relevant features need to be extracted and converted into fixed-length vectors.
The invention uses the Bert-wwm model with 12 layers of transform, high _ size 768, and attention _ headers 12. The output value of each layer of transform can be theoretically used as a sentence vector. As shown in fig. 10 for the visualization of the Bert layers on the UCI-News Aggregator dataset (cool _ MEAN), the last layer (posing _ layer ═ 1) is too close to the training objects, and there are overfitting cases, which can lead to semantic expression bias for model migration without fine tuning. The first layer (posing _ layer) — 12 is too close to the original word embedding result, and lacks semantic context information, and generally, the second last layer is taken as a sentence vector to obtain a better effect. Therefore, the invention adopts the penultimate layer as the characteristic and converts each question sentence into a 768-dimensional sentence vector.
For the task of text clustering, a simple and practical algorithm is the k-means algorithm (k-means), the basic idea being to compute the number of n vectors from a given set
Figure BDA0003398612990000091
And an integer z, find z clusters, …, and their respective centers
Figure BDA0003398612990000092
The parameter m is introduced such that the following equation is minimized:
Figure BDA0003398612990000093
the algorithm steps are as follows:
(1) selecting k points as an initial centroid;
(2) assigning all data points to the cluster in which the closest centroid is located;
(3) recalculating a new centroid for each cluster;
(4) repeating (2) and (3) until the centroid is no longer changed.
Another clustering method is an Affinity Propagation clustering algorithm (Affinity Propagation), and compared with a k-means algorithm, the method has the following advantages:
(1) the number of the final clustering clusters does not need to be specified manually;
(2) the generated cluster centers are existing data points and not newly generated;
(3) is insensitive to the initial value of the data;
(4) compared with a k-means clustering method, the square error of the result is smaller.
The invention mainly adopts a k-means clustering algorithm and a neighbor propagation clustering algorithm to carry out clustering experiments. And (3) obtaining the contour coefficient of k-means clustering by taking different k values, and finding that the contour coefficient is about 0.6 when the k value is 17, wherein the clustering effect is optimal.
The questions in each category obtained by clustering all have similar semantic structures, so the format of the template is defined firstly, then the labeling work of the template is carried out on the questions in each category in a manual mode, and finally the construction of the template library is completed. The invention adopts JSON format to construct templates, each type of template can be used as a JSON object to be loaded into a question-answering system, and the structure definition is as follows:
Figure BDA0003398612990000094
Figure BDA0003398612990000101
the role of the template priority is that when a plurality of templates are successfully matched at the same time, a template with a higher priority (with a priority value close to 1) is matched preferentially. The number of the template subject terms is used as constraint, so that the template matching of low subject term question sentences and high subject terms is avoided, and the template matching efficiency can be effectively improved. The template description can be used for distinguishing the templates, so that the template readability is improved, and the templates can be managed by annotating personnel and maintenance personnel more easily. The template example sentence is usually a cluster center sentence corresponding to the template example sentence, and can also be a plurality of sentences to increase the matching range. After the subject term is filled in, the query template is converted into an executable cypher statement, and a database query task can be executed. The final answer template may contain multiple answer sentence patterns for generating multiple answers.
The labeling process of a certain cluster is taken as an example for introduction. The process of template labeling is shown in fig. 11. The original clusters belong to single subject entity class clusters, and such problems are more basic and are frequently asked, so that the original clusters have high priority. The original cluster is analyzed to find that the problem mainly relates to the definition of a subject entity, two sentences which are closer to the center in the original cluster are selected as example sentences, and then one example sentence with larger structural difference is selected to increase the matching stability. Then, manually constructing a cypher query statement for the template, wherein < T1> in the query statement can be filled in by the corresponding entity in the matched question sentence, and the return query path is fixed for knowledge visualization besides the return query result. Then two answer templates are constructed, and the answer sentence can be generated by filling the subject entity and the query result into the corresponding positions in the templates for returning to the user.
The generation of the template library is completed by the above work, the subsequent template library upgrading work still follows the same flow, the flow is necessary to be constructed into a subsystem as a subsequent updating system of the template library, and the source of the original question is not the question crawled from the hundredth knowledge, but the question comes from the user question actually collected by the question-answering system. After the subsystem receives the question, the identified subject entity is only needed to be covered and then can be stored in the question bank. And when the collected question reaches a certain scale, performing cluster analysis and template labeling on the question in the question library so as to update the template library. The method is also responsible for managing the template library, so that a reading interface of the template library is provided for a question-answering system, and the question-answering system can directly call related methods to read in the template files and convert the template files into template class instances so as to perform the work of template matching. The construction method has the advantages that when the template library is updated, the question-answering system can continue to provide services, and new template files are directly loaded after the template library is updated, so that the stability of the system is improved.
The dataflow graph of the system is shown in fig. 12, a natural language question sentence input by a user is firstly identified by a subject entity, a subject word is extracted and then template matching is carried out, a cypher query sentence is executed to query a knowledge graph after matching is successful, and query results are respectively assembled into a natural language answer and a query subgraph to be displayed to the user.
As a first step of the natural language question processing flow, the main contents of the topic entity identification task can be divided into sequence annotation and entity linking.
The sequence marking task means that x is given to a sequence x1x2…xnFinding out the label y which is corresponding to each element in the sequence1y2…ynAnd all possible value sets of y are called as a label set, and the key information character strings are extracted from the question. A lexical analysis system using a CRF model is provided. For nouns with the same name as the knowledge point nodes in the knowledge graph, the CRF can be assisted in word segmentation in a dictionary mode. After the dictionary is set, the example sentence 'merging and sorting time complexity' is operated again, the result 'merging and sorting/nto/u time/n complexity/n' can be obtained, the term 'merging and sorting' can be well recognized, however, for long conceptual nouns such as 'time complexity', the term cannot be calculated as a subject entity because no specific noun node exists in the knowledge graph, in order to reduce the workload of subsequent entity linkage, a processing step of combining continuous non-subject nouns is adopted, and the finally obtained tagging sequence is as follows: "merge sort/nto/u [ time/n complexity/n]/n”。
A part of the subject entities have been labeled above with a dictionary labeled "nto", but the dictionary cannot perform effective synonym recognition. For example, "quick ranking" is a common synonym for "quick ranking," which is labeled as a common noun with "n" in the segmentation result, and "quick ranking" will be dictionary-matched as a subject entity labeled as "nto". Therefore, it is necessary to establish a link between the fast ranking and the fast ranking in the knowledge graph, and convert the fast ranking in the original sentence into the fast ranking of the subject entity.
The invention adopts a vector-based similarity method to carry out entity linking. Firstly, all the subject terms extracted by TopicLoader are coded by Bert, each term is coded into 768-dimensional vectors, 496 subject terms are collected, therefore, a 496-768 subject term matrix is obtained, then candidate terms are coded, and then Pearson correlation coefficient (Pearson correlation coefficient) is calculated with each row in the subject term matrix, wherein the Pearson correlation coefficient can be used for measuring linear correlation between two vectors, and the value field is between-1 and 1. The calculation formula is as follows:
Figure BDA0003398612990000111
in the formula XiIs the ith dimension, Y, of the first word vectoriIs the ith dimension of the second word vector,
Figure BDA0003398612990000112
and
Figure BDA0003398612990000113
respectively, the first word vector and the second word vector mean. According to experiments, better accuracy can be obtained by setting the threshold value of the correlation coefficient to be about 0.8.
After the original question is identified by the subject entity, the number and name of the subject entities contained in the original question are obtained, and then template matching is required. The template matching steps adopted by the invention are as follows:
(1) selecting a candidate template set T, wherein TopicNum of a template in the T is less than or equal to the number of question subject entities;
(2) for each template T in TiFilling the subject entities in the question sentence into the example sentence template in order to generate the example sentence matrix SiEach of whichA line is a generated natural language illustrative sentence. Combining all example sentence matrixes into an example sentence set S;
(3) for each of SiSimilarity calculation is carried out on the question sentences of the users and each example sentence which are linked by the entity, the value with the highest similarity is selected as the matching degree of the template and is marked as ci
(4) Selecting the template T with the highest matching degreejAnd if the matching degree is greater than a preset threshold value rho, the template matching is successful. Otherwise, the template matching fails, and a default template is selected as a matching result.
And returning the relevant subgraph of the subject entity in the user question in the knowledge graph without generating natural language answers after the matching fails.
Examples
A plurality of groups of test data are designed and constructed according to a real question-answer scene, the accuracy of a template matching algorithm is tested, and the result is shown in Table 3. Wherein the first and second sets of test data are derived from questions collected from questionnaires that are truly intended for the question-answering system, and the third and fourth sets of data are derived from previously crawled stochastically-aware related questions, wherein about 50% of the third set are single-topic entity questions and about 70% of the fourth set are single-topic or double-topic entity questions.
TABLE 3 accuracy test
Figure BDA0003398612990000121
The analysis of the test results can find that the accuracy of the first group and the second group is similar, and the accuracy of the third group and the fourth group is slightly higher because the third group and the template library have the same corpus source. In the first and second groups of inputs, the proportion of simple question and complex question is in accordance with the proportion in the actual situation, so that the result is closer to the actual result. Some of the test data and results are shown in table 4.
Table 4 partial test data and results
Figure BDA0003398612990000131
Generally speaking, the accuracy of the current template matching algorithm is basically competent for the question and answer task, and due to the pluggable characteristic and the easy-to-update characteristic of the template library, the matching accuracy can be gradually improved in a mode of iteratively upgrading the template library in the future.
The intelligent question-answering system finally exhibits the results shown in fig. 13. The middle answer box displays the natural language answer, the matched template and subject word are marked out by light grey small characters on the upper part of the answer, and the knowledge map subpicture display is arranged on the lower part of the answer.
The performance evaluation of the system was mainly performed from the response time of the answers, 5 sets of test questions were designed, and the test was performed based on the Intel Core i7-8750H CPU (2.20GHz) platform, and the results are shown in Table 5. It can be found that when the repetition rate of the input question sentence is high, the response rate of the system can be well improved by the cache setting. The main topic word extraction and Neo4j connection query part which has great influence on the performance can finally control the response time within an acceptable interval by improving the concurrent task quantity and expanding the buffer duration.
TABLE 5 results of Performance test
Figure BDA0003398612990000141

Claims (1)

1. A method for programming education knowledge graph construction, completion and intelligent question answering is characterized by comprising the following specific steps:
step 1: constructing a programming vertical domain knowledge graph containing programming basis, data structure and algorithm course knowledge, which specifically comprises the following steps:
a1: extracting an ontology mode and knowledge points from the knowledge source by means of a book and website structured knowledge source, obtaining ontology constraint by a top-down construction mode, and constructing an ontology constraint five-step method: determining the professional field and category of the ontology, listing important terms in the ontology, defining the hierarchical relationship between classes and classes, defining the attributes of the classes, and defining the relationship between the classes;
a2: taking knowledge points of each sentence in knowledge text corpora of books and websites in the programming field as corpus data, manually marking entities contained in the corpus data, forming a corpus data set by the corpus data, marking the data by adopting a marking method BIO (building information organization), namely marking that each word in a sentence is the beginning of a knowledge point term entity, the middle of a knowledge point term or other non-knowledge point term vocabularies, and integrating the knowledge point term entities sorted out of the corpus to obtain a knowledge point term dictionary;
a3: performing entity matching recognition by using a model BilSTM-CRF in combination with the knowledge point term dictionary constructed in the step A2, namely, taking a bidirectional LSTM network as a feature extractor, and performing named entity recognition output by using a serialization labeling algorithm CRF;
a4: manually extracting entities and relations from the books and the website knowledge base in the programming field, and combining the entities identified in the step A3 to form a structured text in a head entity-relation-tail entity form;
a5: storing the structured text obtained in the step A4 into a Neo4j database as a data layer according to the ontology constraint constructed in the step A1, wherein the data layer obtains a programming vertical domain knowledge graph containing programming basis, data structure and algorithm course knowledge;
step 2: searching for the imperfection of the knowledge graph constructed in the step 1 based on the knowledge graph quality evaluation algorithm judged by the node centrality, and specifically comprising the following steps:
b1: the importance degree of the node, namely the weight, is represented by an NI value, if the node A points to the node B, the NI value of the node B is added to the NI value of the node A, and the calculated NI value is used for the next iteration each time until the error between two iterations is smaller than a threshold value;
b2: counting the number of the related operations of the data structure, the operation codes and the knowledge point associated questions of the knowledge map according to the presence or absence of the definition of each node of the knowledge map, and calculating the score according to the counting result;
b3: multiplying the node NI value obtained by calculation in the step B1 with the statistical score in the step B2 to obtain a final node perfection degree score, wherein an entity with a low score is an imperfect position in the knowledge graph;
and step 3: and (3) completing the imperfection of the knowledge graph found in the step (2), which specifically comprises the following steps:
c1: for the nodes with low perfection scores calculated in the step B3, the data administrator preferably completes the nodes in the Neo4j database;
c2: the problems of complex missing situations and huge completion tasks are solved through a crowdsourcing scheme, namely, a data administrator issues adding topics, adding nodes, modifying nodes and other four tasks on a built crowdsourcing platform and submits the tasks to a user for solution, and the user actively issues the completion scheme on the crowdsourcing platform in a use platform;
c3: c2, the administrator audits and passes the completion scheme issued by the user on the crowdsourcing platform, and the administrator completes the knowledge graph according to the passed completion scheme to obtain a perfect programming vertical domain knowledge graph containing the program design basis, the data structure and the algorithm course knowledge;
and 4, step 4: the method for constructing the template base relied on by the intelligent question answering system specifically comprises the following steps:
d1: the method comprises the following steps of crawling original question corpus data on a platform from a hundred-degree knowledge of question asking in a natural language form, cleaning the original data, and removing parts without practical significance in question sentences: including special symbols, repeated punctuations, guest-package urging phrases and tone words;
d2: the semantics and the structures of similar question sentences are common and only have different subject terms, the subject terms of the question sentences preprocessed in the step D1 are covered, the question sentence data after the subject terms are covered are converted into fixed-length vectors by using a model Bert, the characteristics are extracted, the texts are clustered by using a K mean algorithm and an adjacent propagation clustering algorithm, and the similar question sentence structures are classified into one class;
d3: defining the format of a template for each type of question obtained by clustering the texts in the step D2, wherein the template comprises a question template and an answer template, converting the question into cypher sentences searched in the knowledge graph constructed in the step C3 according to the characteristics of the type of question, and filling the answer template with the results returned by the cypher sentences to complete the construction of a template library;
and 5: the verification of the question-answering system specifically comprises the following steps:
e1: manually extracting knowledge points in a knowledge map, constructing a knowledge point entity dictionary, performing word segmentation, sequence labeling and word type prediction on a question by using a HanLP word segmentation tool, and labeling the knowledge point entity matched with the entity dictionary in the question as a special part of speech;
e2: vectorizing the entities in the knowledge point entity dictionary in the step E1 and the words obtained by the word segmentation of the question sentence in the step E1, and calculating the similarity of the entities and the words to carry out entity link matching;
e3: and D3, calculating similarity between the question after entity linking and the question template in the template library constructed in the step D3, matching the templates, if the matching degree is greater than a preset threshold value, successfully matching the templates, otherwise, selecting a default template as a matching result, searching in a knowledge graph according to cypher sentences given by the template, and filling the returned answers into answer queries to obtain answers of the question.
CN202111491707.3A 2021-12-08 2021-12-08 Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education Pending CN114238653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111491707.3A CN114238653A (en) 2021-12-08 2021-12-08 Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111491707.3A CN114238653A (en) 2021-12-08 2021-12-08 Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education

Publications (1)

Publication Number Publication Date
CN114238653A true CN114238653A (en) 2022-03-25

Family

ID=80753974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111491707.3A Pending CN114238653A (en) 2021-12-08 2021-12-08 Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education

Country Status (1)

Country Link
CN (1) CN114238653A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309915A (en) * 2022-09-29 2022-11-08 北京如炬科技有限公司 Knowledge graph construction method, device, equipment and storage medium
CN115795056A (en) * 2023-01-04 2023-03-14 中国电子科技集团公司第十五研究所 Method, server and storage medium for constructing knowledge graph by unstructured information

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309915A (en) * 2022-09-29 2022-11-08 北京如炬科技有限公司 Knowledge graph construction method, device, equipment and storage medium
CN115309915B (en) * 2022-09-29 2022-12-09 北京如炬科技有限公司 Knowledge graph construction method, device, equipment and storage medium
CN115795056A (en) * 2023-01-04 2023-03-14 中国电子科技集团公司第十五研究所 Method, server and storage medium for constructing knowledge graph by unstructured information

Similar Documents

Publication Publication Date Title
CN112199511B (en) Cross-language multi-source vertical domain knowledge graph construction method
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN109684448B (en) Intelligent question and answer method
Neculoiu et al. Learning text similarity with siamese recurrent networks
CN110298033B (en) Keyword corpus labeling training extraction system
WO2020000848A1 (en) Knowledge graph automatic construction method and system for massive unstructured text
KR100533810B1 (en) Semi-Automatic Construction Method for Knowledge of Encyclopedia Question Answering System
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN111858896B (en) Knowledge base question-answering method based on deep learning
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN114238653A (en) Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
CN114090861A (en) Education field search engine construction method based on knowledge graph
Jayaram et al. A review: Information extraction techniques from research papers
CN112036178A (en) Distribution network entity related semantic search method
CN112686025A (en) Chinese choice question interference item generation method based on free text
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN114840685A (en) Emergency plan knowledge graph construction method
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN112084312A (en) Intelligent customer service system constructed based on knowledge graph
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN115905554A (en) Chinese academic knowledge graph construction method based on multidisciplinary classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination