CN110147436B - Education knowledge map and text-based hybrid automatic question-answering method - Google Patents

Education knowledge map and text-based hybrid automatic question-answering method Download PDF

Info

Publication number
CN110147436B
CN110147436B CN201910203301.7A CN201910203301A CN110147436B CN 110147436 B CN110147436 B CN 110147436B CN 201910203301 A CN201910203301 A CN 201910203301A CN 110147436 B CN110147436 B CN 110147436B
Authority
CN
China
Prior art keywords
answer
question
subject
template
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910203301.7A
Other languages
Chinese (zh)
Other versions
CN110147436A (en
Inventor
许斌
刘阳
杨玉基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201910203301.7A priority Critical patent/CN110147436B/en
Publication of CN110147436A publication Critical patent/CN110147436A/en
Application granted granted Critical
Publication of CN110147436B publication Critical patent/CN110147436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of intelligent education question answering, and particularly relates to a mixed automatic question answering method based on an education knowledge graph and a text, which comprises the following steps of: constructing a basic education knowledge graph by constructing a basic education ontology, performing semantic annotation and extracting information; building a general template of the problem according to the combination of the keywords and the regular expression; building a full-text search engine, and preprocessing massive texts; taking the question-answer pairs as a training set, and training until the deep text matching model is converged; identifying the user problem to obtain a subject list and endowing confidence with the subject list; carrying out template matching to obtain a predicate list and endowing confidence coefficients; inquiring a knowledge graph according to the subject and predicate lists to obtain an answer list, and endowing confidence coefficients; obtaining keywords by using a part-of-speech tagging method, and performing coarse and fine granularity matching to obtain answers and sequencing; returning an answer based on the educational knowledge graph if the highest confidence of the answer exceeds a threshold; otherwise, returning the answer with the top ranking based on the text.

Description

Education knowledge map and text-based hybrid automatic question-answering method
Technical Field
The invention belongs to the technical field of intelligent education question answering, and particularly relates to a mixed automatic question answering method based on an education knowledge graph and a text.
Background
Smart Education (Smart Education) has become an important form of development in the field of Education in the background of the information age. The essence of intelligent education lies in that an intelligent environment is constructed by utilizing an intelligent technical means, so that students can acquire knowledge and answer questions more quickly and better. The automatic question-answering system is undoubtedly a very effective method. On one hand, the automatic question-answering system can help primary and secondary school students to answer questions and ask puzzles, so that the primary and secondary school students can obtain answers to the questions in time in the everyday learning process. On the other hand, the enthusiasm of students for learning knowledge can be obviously improved due to good human-computer interaction. Therefore, it is necessary to construct a question-answering system capable of accurately understanding the questions posed by students and rapidly giving accurate answers.
The early question-answering system is an 'expert system' based on a template, a method is to manually make rules aiming at a specific field to construct the template, and the most obvious defect of the system is that the system can only process a small amount of data in the specific field; with the development of search technology, an open domain search query-answer (IE-QA) is created, that is, answers to questions are extracted from a large number of texts according to keywords and semantic relations in the questions, such as "Waston", "TREC" of IBM, and the like, and the question-answer mode solves the problem of narrow coverage area to a certain extent, but the extracted answers are not accurate due to the inequality of texts; later, internet communities are gradually emerging, and many internet companies develop community-oriented questions and answers such as "know", "Stack Overflow", and the like, wherein the nature of the question and answer form means that a converged platform is provided for users, and the correctness of the answers needs to be judged by the users.
The concept of "knowledge graph" proposed by google defines a completely new knowledge organization mode. It attempts to convert unstructured data into structured data and concatenate the various data together to form a graphical model containing a large amount of structured data, starting from the data itself. The structured graph model data provides a new development direction for the development of the question-answering system, namely the question-answering system (KB-QA) based on the knowledge graph, and the structured graph model data can fully utilize the structured data in the knowledge graph to provide very concise and accurate answers for users, so the structured graph model data gradually becomes an important research direction of the question-answering system. Meanwhile, very effective help can be provided for the development of the next-generation intelligent retrieval and the humanoid robot.
Currently, some work has been done on question-answering systems in the basic education field, but the work has the following problems: only a single source such as a knowledge graph or a text is used for asking and answering, the respective advantages of the two sources cannot be comprehensively utilized, and the method is specifically embodied in that: knowledge in the knowledge graph is accurate and has high structuring degree, but the coverage rate of the knowledge is not as good as that of the text; all knowledge is contained in the text, but semantic analysis is difficult due to unstructured text; if the user's questions are answered based solely on the knowledge-graph, many of the questions will not be answered; if the user's question is answered based on text only, many questions will be answered incorrectly. Only by combining the knowledge of the two sources well and comprehensively sequencing the answers of the two sources, the advantages of the two sources can be fully utilized, and the most comprehensive and accurate answer can be returned for the question provided by the user. In addition, for the basic education field, teaching materials and teaching aids are the most authoritative resources, and the knowledge in the teaching materials and teaching aids is not finely mined and processed by the existing basic education question-answering system; the knowledge points in the basic education field have more interdisciplinary associations, and the existing basic education question-answering system does not comprehensively consider knowledge of all disciplines.
Disclosure of Invention
Aiming at the technical problems, the invention provides a mixed automatic question-answering method based on an education knowledge map and a text, which comprises the following steps:
step 1: constructing a basic education ontology, performing semantic annotation on teaching materials of various disciplines, and extracting information of the teaching materials and internet encyclopedia text resources to construct a full-discipline basic education knowledge map; constructing a general template of the problem according to the keywords and the regular expression grammar;
step 2: building a full-text search engine, and preprocessing massive texts of teaching materials and internet encyclopedias to accord with the index format of the search engine; taking the large-scale test question and answer pairs of the basic education as a training set, and training by using a deep text matching model until the model converges;
and step 3: carrying out entity recognition on the user problem to obtain a subject list, and giving a corresponding confidence coefficient to each subject; carrying out template matching on the user problem to obtain a predicate list, and endowing each predicate with a corresponding confidence coefficient; inquiring the knowledge graph according to the subject list and the predicate list to obtain an answer list based on the education knowledge graph, and giving a corresponding confidence coefficient to each answer;
and 4, step 4: obtaining keywords with different grades in the question by using a part-of-speech tagging method, inputting the keywords into the search engine to perform coarse-grained matching to obtain a text-based answer list; performing fine-grained matching on a text-based answer list by using a pre-trained deep text matching model to obtain answers and sequencing;
and 5: returning an answer based on the educational knowledge graph if the highest confidence of the answer exceeds a threshold; otherwise, returning the answer with the top ranking based on the text.
The basic education ontology is constructed through a semi-automatic ontology construction method.
The information extraction is used to augment instances, relationships, and attributes of knowledge.
The general template for the construction problem specifically includes:
forming a general template aiming at the type of problems by combining regular expression grammar based on the relation or attribute in the education knowledge graph as a keyword;
analyzing the problems in the large-scale education question-answer data set by using a syntactic analysis tool, extracting keywords, and forming a general template aiming at the type of problems by combining regular expression grammar;
generating a template based on the high-discrimination-degree questioning words;
a template is generated based on the general question words.
The full-text search engine is an extensible open-source full-text search and analysis engine elastic search.
The giving of the corresponding confidence to each subject specifically includes:
the method is completely matched with the examples in the example table, and the confidence coefficient is 1;
obtaining and removing the examples of stop words through template segmentation, wherein the confidence coefficient is 0.8;
and the confidence coefficient of an example obtained by fuzzy matching similarity calculation and longest common substring matching is 0.6.
The corresponding confidence given to each predicate specifically includes:
generating a template based on the relationship or attribute in the educational knowledge graph, wherein the confidence coefficient is 1;
generating a template based on the keywords extracted by the syntactic analysis, wherein the confidence coefficient is 1;
generating a template based on the high-discrimination query words, wherein the confidence coefficient is 2;
the confidence is 3 based on the template generated by the general question word.
The corresponding confidence given to each answer specifically includes:
combining the subject list and the predicate list one by one to generate a spark ql query statement;
inquiring an education knowledge map to obtain an answer list;
giving a corresponding confidence coefficient to each answer according to a preset rule, wherein the confidence coefficient calculation method comprises the following steps:
the calculation formula is as follows: score is subjectscore × pscore; pscore is the score for the predicate and subjectscore is the subject score;
determining the pscore by the template confidence coefficient, wherein the pscore is 1/the template confidence coefficient;
subjectScore is determined by the subject confidence, which is 20 × rate × subject confidence;
the rate is determined by the longest common substring of the subject and question:
rate ═ square root function math.sqrt (length of longest common substring/length of subject) × power function math.pow (length of subject, 1.0/2).
The part-of-speech tagging method specifically comprises the following steps:
setting words with parts of speech being noun, verb v, name nr and other subjects or predicates as primary keywords;
setting adverbs d, numerators m, noumenon Ng and other words of the modified subject or predicate as secondary keywords;
and setting the conjunctive words c, the paralinguistic Dg, the sigh words e, the direction words f and the words irrelevant to the keywords as the third-level keywords.
The coarse grain size matching specifically comprises:
carrying out strict phrase query on each primary keyword, carrying out logical connection on all the phrase queries, and setting queries at least matched with 50%;
carrying out strict phrase query on each secondary keyword, carrying out logical connection on all phrase queries, and not setting at least the number of matched queries;
no query is made for the third level keywords.
The invention has the beneficial effects that:
the invention realizes the complete coverage of nine basic education subjects of Chinese, mathematics, English, politics, history, geography, physics, chemistry and biology, takes teaching materials and teaching assistance as the main part and massive internet resources as the assistance, fully exerts the characteristics of high efficiency and accuracy of KB-QA answer and the characteristics of wide coverage of IE-QA, and ensures that the most accurate answer is returned aiming at the problems of the user.
Drawings
FIG. 1: the embodiment of the invention provides a mixed question-answering system structure diagram based on an educational knowledge graph.
FIG. 2: the embodiment of the invention provides a structure diagram of a deep text matching model.
Detailed Description
The embodiments are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart illustrating a hybrid automatic question-answering method based on an educational knowledge graph according to an embodiment of the present invention.
Referring to fig. 1, a method for constructing a basic education knowledge graph according to an embodiment of the present invention includes:
s1, constructing an educational knowledge map and a template;
s2, electronic paper teaching material teaching assistance and Internet text preprocessing;
s3, question answering and scoring based on the education knowledge map;
s4, question answering and grading based on texts;
s5, answer selection based on education knowledge map and text.
In this embodiment, the offline processing step in step S1 further includes the following steps shown in fig. 1:
s11, constructing a basic education knowledge graph by using the measures of ontology construction, semantic annotation, information extraction and the like, mainly teaching materials and auxiliary resources of the Internet.
S12, establishing a template base according to the knowledge graph of the existing basic education field, and establishing a one-to-many regular expression template for the relation (or attribute) in the knowledge graph.
In this embodiment, in step S11, the method of using ontology construction, semantic labeling, information extraction, and the like, mainly using teaching materials and teaching aids, and using internet resources as aids, constructs a basic education knowledge graph, and further includes the following steps not shown in the drawings:
using TF-IDF and TextRank algorithms to process the teaching material auxiliary texts to obtain candidate terms in the basic education field;
referring to knowledge graphs in general fields such as schema.
Determining concepts and relationships between the concepts and constraints thereof according to the encyclopedic website information boxes;
inviting experts and teachers in the education field to carry out examination and verification to complete the body construction process;
labeling the knowledge list of each subject in a mode of crowdsourcing semi-automatic semantic labeling according to the body to obtain the most core knowledge of each subject;
expanding required structural data from internet related websites, for example, obtaining Chinese administrative division information from national statistics bureau websites, and adding the information into a knowledge graph;
extracting information from the text by using a machine learning method, wherein the method comprises the steps of entity set expansion, relation extraction and the like;
in this embodiment, in step S12, a template library is established according to the knowledge graph of the existing basic education domain, and a one-to-many regular expression template is mainly established for the relationship (or attribute) in the knowledge graph, which further includes the following steps not shown in the figure:
the template is constructed by using the regular expression, and the main sources are two aspects:
1. and generating a corresponding template by combining the regular expression according to the relation and the attribute contained in the education knowledge graph constructed in the step S11.
2. And processing the pre-acquired problems, and acquiring corresponding keywords, mainly predicates, questioning words and the like according to syntactic analysis. And generating a corresponding template by combining the regular expression grammar.
In this embodiment, the templates are stored by using a mysql database, each template table has a plurality of fields, such as attributes and priorities, corresponding to the templates in addition to a specific regular template, and a specific structure is shown in table 2.
Table 1 provides part-of-speech priority information for IE-QA in accordance with an embodiment of the present invention.
Figure GDA0002716931470000061
Figure GDA0002716931470000071
The use of the various fields of each template is described in detail below:
the column of content is the contents of the template constructed in step S12, written in regular expressions. For example, there is a template "(? ", if a question matches the template, then the" geo-location "is considered the predicate that the question may be predicated. "(. For example, the question "geographic location of east mountain tai shan is? "the subject captured when matching this template is" east Yue Taishan ";
subject indicates whether the template subject is definite, false if the subject is unknown, and true for other defaults; such as "who is known to be sweaty in the day", which is unknown in the subject, is false.
Value indicates whether the object is determined;
type represents the relationship or attribute corresponding to this template. The relationship is the "edge" connecting two entities in the knowledge graph, for example, the connection is established between two entities of "China" and "Beijing" through the relationship of "capital". The attribute is some knowledge of the entity itself, for example, the entity "Beijing" has the attribute of "climate type", and the attribute value is "warm-zone continental monsoon climate".
Class denotes the class of the subject of the question, which is used for some special questions to define the type of subject. The Class mainly comprises: most of the time, person and the like are empty and mainly identify the subject type in a specific field;
and 6, when solving certain results which cannot be obtained through spark ql query, performing special processing on the problems, wherein the use is used for identifying the problems.
Priority identifies the priority of the template, which is mainly used to calculate the score of the predicate.
Table 2 is a schematic diagram of a problem template in the basic education field provided by an embodiment of the present invention.
Figure GDA0002716931470000072
Figure GDA0002716931470000081
There are three priorities for templates:
the first priority is a template generated specifically from the predicates of the problem, the relationships or attributes in the knowledge graph, and the specific type of problem, with a high confidence, such as "(? Condition (.? ", identified as" 1 "in the database;
the second priority is a template generated with a query word with distinct features, mainly for some questions about attributes that the first priority cannot match, such as "(? ", the confidence level of which is lower relative to the first priority template, is identified as" 2 "in the database;
the third priority is that when neither of the first and second priorities can be matched, it is matched with some broader query words, such as "(? ", the class template has the lowest confidence, identified as" 3 "in the database, compared to the first two priorities.
In this embodiment, the electronic paper teaching material and internet text preprocessing in step S2 further includes the following steps not shown in the figure:
s21, building a highly extensible open source full text search and analysis engine elastic search to support instant query and retrieval of massive texts.
S22, preprocessing mass texts such as teaching materials, encyclopedia and the like, and adding an elastic search index according to an elastic search index format.
S23, taking the basic education large-scale question and answer pairs as a training set, and training by using a deep text matching model until the model converges;
in this embodiment, in step S22, mass texts such as teaching materials, encyclopedia, and the like are preprocessed, and an elastic search index is added according to an elastic search index format, which further includes the following steps not shown in the drawings:
the teaching materials are electronized, and webpage elements such as html tags and texts with irrelevant knowledge are filtered;
acquiring encyclopedia website text resources such as encyclopedia and the like;
segmenting the texts according to paragraphs to form paragraph texts;
if the segmented text can be linked with the entity in the knowledge base, adding the segmented text into an elastic search index;
connecting the triple knowledge in the knowledge base and adding the triple knowledge into the elasticsearch index;
in this embodiment, the step S23 of using the basic education large-scale question and answer pairs as a training set and training the basic education large-scale question and answer pairs to model convergence by using the deep text matching model further includes the following steps not shown in the figure:
electronizing the test question teaching aid, and filtering out webpage elements such as html labels and texts with irrelevant knowledge;
selecting a selection question and a filling-in-blank question from the question, replacing a blank part in the question with the most appropriate question word to be used as a question, and using a correct answer in the question as an answer to generate a question-answer pair;
according to the following steps: 3, dividing the question-answer pairs into a training set and a verification set;
inputting the question-answer pairs into the deep text matching model shown in FIG. 2, and training until the model converges;
referring to fig. 2, the deep text matching model includes an Embedding layer, a plurality of intermediate layers and an output layer, the intermediate layers may adopt a multi-layer perceptron or LSTM module, and the output layer finally outputs a confidence level indicating whether the input answer is a correct answer to the input question.
In this embodiment, the question answering and scoring based on the educational knowledge graph in step S3 further includes the following steps shown in fig. 1:
and S31, performing entity recognition and entity linkage on the user questions to obtain a possible subject list, and giving corresponding confidence to each subject according to preset rules.
And S32, carrying out template matching on the user problem and the template library to obtain a possible predicate list, and endowing each predicate with a corresponding confidence coefficient according to a preset rule.
S33, generating sparql sentences according to the obtained subject lists and predicate lists, inquiring the knowledge graph to obtain answer lists, and endowing each answer with a corresponding confidence coefficient according to preset rules;
in this embodiment, in step S31, the entity recognition and entity linking are performed on the user question to obtain a possible subject list, and a corresponding confidence is given to each subject according to a preset rule, and the method further includes the following steps that are not shown in the figure:
carrying out entity recognition and entity linkage on natural language questions input by a user to obtain a possible subject list, and giving corresponding confidence to each subject according to a preset rule; the method mainly adopts the methods of example table matching, template segmentation, synonym forest query, similarity calculation, longest common substring matching and the like, and sets the priority according to the confidence of each method to obtain a candidate entity set. Each priority setting rule is as follows:
matching the example table, namely completely matching the example table with a certain entity in the knowledge graph, wherein the confidence coefficient of the example table is 1;
template segmentation matching, namely acquiring a subject such as 'who is an author of' quiet night thinking? "first matched to template" (? ";
acquiring a capturing group 'meditation night thought' through a regular expression, and obtaining a subject 'meditation night thought' after stop words are removed, wherein the confidence coefficient of the method is 0.8;
synonym forest query, similarity calculation and longest common substring matching all use similar ideas, so that the confidence coefficient of the synonym forest query, the similarity calculation and the longest common substring matching is set to be 0.6.
In this embodiment, the template matching of the user question and the template library in step S32 to obtain a list of possible predicates, and assigning a confidence corresponding to each predicate according to a preset rule, further includes the following steps not shown in the figure:
carrying out template matching on the user problem and a template library to obtain a possible predicate list, and endowing each predicate with a corresponding confidence coefficient according to a preset rule;
the process of determining the predicates is to match the templates one by one, and if the predicates are matched, the attributes corresponding to the templates are regarded as problems. For example, the question "habitually, what mountains are used as a boundary to divide our country into monsoon regions and non-monsoon regions" is matched to a template "(? For border ", determine its corresponding attribute as [ borderline ].
The corresponding confidence level formulation rule is as follows:
for the template directly generated by using the relation (or attribute) in the knowledge graph and the template formulated aiming at the special type of problems, the confidence coefficient is set to be 1;
for the template generated by using the query words with higher discrimination (such as "who, when") and the like, the confidence coefficient is set to be 2;
for templates generated with ambiguous phrases or interrogative words (e.g., "what"), the confidence is set to 3;
in this embodiment, in step S33, a spark ql statement is generated according to the obtained subject list and predicate list, a knowledge graph is queried to obtain an answer list, and a corresponding confidence is given to each answer according to a preset rule, which further includes the following steps not shown in the figure: generating a spark ql statement according to the subject list and the predicate list obtained in the steps S22 and S23, querying a knowledge graph to obtain an answer list, and giving a corresponding confidence coefficient to each answer according to a preset rule; there may be a plurality of subjects and predicates, and when generating query statements, the query statements are combined one by one into triples, each query statement is generated, and the score of each query statement is determined. For example, a query sentence for dividing our country into a monsoon region and a non-monsoon region by what mountain is used as a boundary is:
Figure GDA0002716931470000111
and according to the confidence degrees of the entities and the predicates obtained in the steps S31 and S32 and the respective types of the entities and the predicates, scoring and sequencing the candidate answers in the candidate answer set, and screening answers reaching a threshold value as correct answers. The scoring according to the query result of the template is mainly scored according to the priorities of the subject and the template, and the calculation formula is as follows: score is reject score pscore. The pscore refers to the score of a predicate and is determined by the priority of a template, and the specific rule is as follows:
1 pscore ═ 1/template priority;
SubjectScore is the score of the subject, which is formulated as: confidence of subject score 20 rate subject;
rate is determined by the longest common substring of the subject and question:
rate ═ Math.sqrt (length of longest common substring/length of subject). Math.pow (length of subject, 1.0/2)
In this embodiment, the text-based question answering and scoring in step S4 further includes the following steps shown in fig. 1:
and S41, obtaining keywords with different grades in the question according to a preset strategy by using a part-of-speech tagging method.
And S42, inputting the keywords with different grades in the semantic parsing step into an elasticsearch engine, and performing coarse-grained matching on the massive indexes according to a preset query strategy to obtain a coarse-grained answer list.
And S43, performing fine-grained matching on the coarse-grained answer list obtained in the step S23 by using the trained deep text matching model in the step S23, obtaining answers, sorting, and returning the answer with the highest sorting order.
In this embodiment, the step S41 of obtaining the keywords with different levels in the question according to the preset policy by using the part-of-speech tagging method further includes the following steps not shown in the figure:
firstly, performing word segmentation and part-of-speech tagging on a user input problem to obtain part-of-speech information of each word;
adding each word in the question into a corresponding key level list by using the key level information of each part of speech shown in the table 1;
in this embodiment, in step S42, the method includes inputting keywords at different levels in the semantic parsing step into an elasticsearch engine, and performing coarse-grained matching on the massive indexes according to a preset query policy to obtain a coarse-grained answer list, and further includes the following steps not shown in the figure:
carrying out stricter phrase query on each primary keyword, carrying out logical connection on all phrase queries, and setting queries matched with at least 50%;
carrying out stricter phrase query on each secondary keyword, carrying out logical connection on all phrase queries, and not setting at least the number of matched queries;
no query is made for the third-level keywords;
the elastic search gives candidate answers and a corresponding confidence score of each candidate answer according to the strategy;
in this embodiment, in step S43, the deep text matching model trained in step S23 is used to perform fine-grained matching on the coarse-grained answer list obtained in the above step, so as to obtain answers and sort, and return the answer with the highest rank, which further includes the following steps not shown in the figure:
obtaining 10 answers with the highest confidence scores of the candidate answers obtained in the step S42;
inputting each answer and the question into a deep text matching model trained in S23 to obtain a confidence score of each answer;
and selecting the answer with the highest confidence score and returning the answer to the user.
In this embodiment, the answer selection based on both the educational knowledge graph and the text in step S5 further includes the following steps not shown in the figure, including:
sorting the knowledge-graph-based answers by score;
ranking the text-based answers by score;
if the highest scoring answer based on the knowledge-graph source exceeds a preset threshold, the answer is returned.
If the highest scoring answer based on the knowledge-graph source does not exceed the preset threshold, returning the highest scoring answer based on the text source.
The system is a mixed automatic question-answering system constructed on the basis of a basic education knowledge map and a large number of electronic texts. The basic education knowledge graph comprises 2200 million triples, 162 million instances, 1000 concepts and 4000 attributes. The knowledge source comprises a marking library and an external source library, wherein the marking library is obtained from marking knowledge points in teaching materials, and the external source library is extracted from encyclopedia and internet data. Basically covers all knowledge points of nine subjects in the middle and primary school. The electronic text mainly comprises 1300 books of basic education teaching materials of the current Chinese main basic education publisher and 10011 books of electronic out-of-class readings.
In the early preparation work, a large number of test questions are obtained from the existing teaching materials and auxiliary test paper through digitalization, and meanwhile, a large number of test questions are collected from the Internet. The question types mainly comprise blank filling questions, selection questions, reading and understanding questions, composition questions and the like, the questions cannot be directly analyzed by the KB-QA system, and the questions need to be sampled and extracted, and simultaneously subject modification is carried out to convert the questions into questions capable of being analyzed by the system. For example, "about ()" the world land-to-sea ratio is converted to "about how much about is the world land-to-sea ratio? ".
The details of the subject matters are shown in Table 3 after rule transformation.
Table 3 provides statistical information for the test case of the nine subjects in the basic education field according to the example of the present invention.
Figure GDA0002716931470000131
And (3) taking the answer accuracy as an evaluation index, recording answers given by the question-answering system when subject questions are input into the question-answering system for testing aiming at each subject question library, and designing test cases respectively aiming at each study. The disciplines comprise Chinese, mathematics, English, physics, chemistry, history, geography, biology and politics, and a total of 9020 test cases are designed, and the test results are shown in Table 4.
Table 4 provides the test results of the test cases of the nine subjects in the basic education field for the example of the present invention.
Test subject Total number of use cases Actual execution use case Correct use case Error case Accuracy rate
Chinese language 1007 1007 787 220 78.15%
Mathematics, and 926 926 862 64 93.09%
english language 1033 1033 887 146 85.87%
Physics of physics 1000 1000 911 89 88.40%
Chemistry 1001 1001 897 104 89.61%
History of 1040 1040 904 136 83.17%
Geography 1017 1017 739 278 72.66%
Biological organisms 1000 1000 860 140 85.5%
Politics 996 996 885 111 88.86%
Total up to 9020 9020 7732 1288 85.72%
Example (c):
in political discipline, is the question "concurrent meaning of enterprises? Because the knowledge graph comprises the entity 'enterprise combines' and the entity has the attribute 'meaning', the KB-QA method can be directly used to obtain an accurate answer 'manage the superior enterprises with good management and good economic benefit and combine the economic phenomena of the enterprises with relative disadvantages'. But for the national institution with the highest status of our country? Because the knowledge graph lacks entities and relations related to the knowledge graph, through the search and screening matching of IE-QA, the answer 'the world people representative is in the highest position in national institutions of China, and other central national institutions are generated by the knowledge graph, responsible for and supervised by the knowledge graph' can be obtained.
The present invention is not limited to the above embodiments, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A mixed automatic question-answering method based on an educational knowledge graph and a text is characterized by comprising the following steps:
step 1: constructing a basic education ontology, performing semantic annotation on teaching materials of various disciplines, and extracting information of the teaching materials and internet encyclopedia text resources to construct a full-discipline basic education knowledge map; constructing a general template of the problem according to the keywords and the regular expression grammar;
step 2: building a full-text search engine, and preprocessing massive texts of teaching materials and internet encyclopedias to accord with the index format of the search engine; taking the large-scale test question and answer pairs of the basic education as a training set, and training by using a deep text matching model until the model converges;
and step 3: carrying out entity recognition on the user problem to obtain a subject list, and giving a corresponding confidence coefficient to each subject; carrying out template matching on the user problem to obtain a predicate list, and endowing each predicate with a corresponding confidence coefficient; inquiring the knowledge graph according to the subject list and the predicate list to obtain an answer list based on the education knowledge graph, and giving a corresponding confidence coefficient to each answer;
and 4, step 4: obtaining keywords with different grades in the question by using a part-of-speech tagging method, inputting the keywords into the search engine to perform coarse-grained matching to obtain a text-based answer list; performing fine-grained matching on a text-based answer list by using a pre-trained deep text matching model to obtain answers and sequencing;
and 5: returning an answer based on the educational knowledge graph if the highest confidence of the answer exceeds a threshold; otherwise, returning the answer with the top ranking based on the text;
the corresponding confidence given to each answer specifically includes:
combining the subject list and the predicate list one by one to generate a spark ql query statement;
inquiring an education knowledge map to obtain an answer list;
giving a corresponding confidence coefficient to each answer according to a preset rule, wherein the confidence coefficient calculation method comprises the following steps:
the calculation formula is as follows: score is subjectscore × pscore; pscore is the score for the predicate and subjectscore is the subject score;
determining the pscore by the template confidence coefficient, wherein the pscore is 1/the template confidence coefficient;
the subject core is determined by the subject confidence, and the subject core is 20 × rate × subject confidence;
the rate is determined by the longest common substring of the subject and question:
rate ═ square root function math.sqrt (length of longest common substring/length of subject) × power function math.pow (length of subject, 1.0/2).
2. The automatic question-answering method according to claim 1, wherein the basic education ontology is constructed by a semi-automatic ontology construction method.
3. The automated question-answering method according to claim 1, wherein the information extraction is for augmenting instances, relationships and attributes of knowledge.
4. The automatic question-answering method according to claim 1, characterized in that said building a generic template of questions specifically comprises:
forming a general template aiming at the problems by combining regular expression grammar based on the relation or the attribute in the education knowledge graph as a keyword;
analyzing the problems in the large-scale education question-answer data set by using a syntactic analysis tool, extracting keywords, and forming a general template aiming at the problems by combining regular expression grammar;
generating a template based on the high-discrimination-degree questioning words;
a template is generated based on the general question words.
5. The automated question-answering method according to claim 1, wherein the full-text search engine is an extensible open-source full-text search and analysis engine elastic search.
6. The automatic question-answering method according to claim 1, characterized in that said assigning to each subject a respective confidence level specifically comprises:
the method is completely matched with the examples in the example table, and the confidence coefficient is 1;
obtaining and removing the examples of stop words through template segmentation, wherein the confidence coefficient is 0.8;
and the confidence coefficient of an example obtained by fuzzy matching similarity calculation and longest common substring matching is 0.6.
7. The auto-quiz method according to claim 1, wherein the assigning each predicate has a corresponding confidence level that specifically comprises:
generating a template based on the relationship or attribute in the educational knowledge graph, wherein the confidence coefficient is 1;
generating a template based on the keywords extracted by the syntactic analysis, wherein the confidence coefficient is 1;
generating a template based on the high-discrimination query words, wherein the confidence coefficient is 2;
the confidence is 3 based on the template generated by the general question word.
8. The automatic question answering method according to claim 1, wherein the part-of-speech tagging method specifically comprises:
setting words with parts of speech being noun, verb v, name nr and other subjects or predicates as primary keywords;
setting adverbs d, numerators m, noumenon Ng and other words of the modified subject or predicate as secondary keywords;
and setting the conjunctive words c, the paralinguistic Dg, the sigh words e, the direction words f and the words irrelevant to the keywords as the third-level keywords.
9. The automatic question-answering method according to claim 1, wherein the coarse-grained matching specifically comprises:
carrying out strict phrase query on each primary keyword, carrying out logical connection on all the phrase queries, and setting queries at least matched with 50%;
carrying out strict phrase query on each secondary keyword, carrying out logical connection on all phrase queries, and not setting at least the number of matched queries;
no query is made for the third level keywords.
CN201910203301.7A 2019-03-18 2019-03-18 Education knowledge map and text-based hybrid automatic question-answering method Active CN110147436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910203301.7A CN110147436B (en) 2019-03-18 2019-03-18 Education knowledge map and text-based hybrid automatic question-answering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910203301.7A CN110147436B (en) 2019-03-18 2019-03-18 Education knowledge map and text-based hybrid automatic question-answering method

Publications (2)

Publication Number Publication Date
CN110147436A CN110147436A (en) 2019-08-20
CN110147436B true CN110147436B (en) 2021-02-26

Family

ID=67588923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910203301.7A Active CN110147436B (en) 2019-03-18 2019-03-18 Education knowledge map and text-based hybrid automatic question-answering method

Country Status (1)

Country Link
CN (1) CN110147436B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597957B (en) * 2019-09-11 2022-04-22 腾讯科技(深圳)有限公司 Text information retrieval method and related device
CN110717025B (en) * 2019-10-08 2022-08-12 北京百度网讯科技有限公司 Question answering method and device, electronic equipment and storage medium
CN110688838B (en) * 2019-10-08 2023-07-18 北京金山数字娱乐科技有限公司 Idiom synonym list generation method and device
CN110807325B (en) * 2019-10-18 2023-05-26 腾讯科技(深圳)有限公司 Predicate identification method, predicate identification device and storage medium
CN111026834B (en) * 2019-12-10 2022-07-08 思必驰科技股份有限公司 Question and answer corpus generation method and system
CN113010632A (en) * 2019-12-20 2021-06-22 中兴通讯股份有限公司 Intelligent question answering method and device, computer equipment and computer readable medium
CN111178770B (en) * 2019-12-31 2023-11-10 安徽知学科技有限公司 Answer data evaluation and learning image construction method, device and storage medium
CN111339269B (en) * 2020-02-20 2023-09-26 来康科技有限责任公司 Knowledge graph question-answering training and application service system capable of automatically generating templates
CN111460119B (en) * 2020-03-27 2024-04-12 海信集团有限公司 Intelligent question-answering method and system for economic knowledge and intelligent equipment
CN111475629A (en) * 2020-03-31 2020-07-31 渤海大学 Knowledge graph construction method and system for math tutoring question-answering system
CN111475623B (en) * 2020-04-09 2023-08-22 北京北大软件工程股份有限公司 Case Information Semantic Retrieval Method and Device Based on Knowledge Graph
CN111639171B (en) * 2020-06-08 2023-10-27 吉林大学 Knowledge graph question-answering method and device
CN111666425B (en) * 2020-06-10 2023-04-18 深圳开思时代科技有限公司 Automobile accessory searching method based on semantic knowledge
CN112037905A (en) * 2020-07-16 2020-12-04 朱卫国 Medical question answering method, equipment and storage medium
CN111782824B (en) * 2020-08-14 2024-04-19 中国工商银行股份有限公司 Information query method, device, system and medium
CN112182150A (en) * 2020-09-23 2021-01-05 中国建设银行股份有限公司 Aggregation retrieval method, device, equipment and storage medium based on multivariate data
CN113704499A (en) * 2020-09-24 2021-11-26 广东昭阳信息技术有限公司 Accurate and efficient intelligent education knowledge map construction method
CN112307171B (en) * 2020-10-30 2022-02-11 中国电力科学研究院有限公司 Institutional standard retrieval method and system based on power knowledge base and readable storage medium
CN112883151A (en) * 2021-01-25 2021-06-01 济南浪潮高新科技投资发展有限公司 Intelligent question-answering implementation method and intelligent question-answering system
CN112905806B (en) * 2021-03-25 2022-11-01 哈尔滨工业大学 Knowledge graph materialized view generator based on reinforcement learning and generation method
CN113688269B (en) * 2021-07-21 2023-05-02 北京三快在线科技有限公司 Image-text matching result determining method and device, electronic equipment and readable storage medium
CN114610954B (en) * 2022-03-09 2022-11-25 上海弘玑信息技术有限公司 Information processing method and device, storage medium and electronic equipment
CN114861112B (en) * 2022-07-05 2022-09-20 广州趣米网络科技有限公司 Information distribution method and system based on data access and big data classification
CN116028614B (en) * 2023-03-29 2023-06-16 北京中关村科金技术有限公司 Information processing method, device, equipment and readable storage medium
CN117149988B (en) * 2023-11-01 2024-02-27 广州市威士丹利智能科技有限公司 Data management processing method and system based on education digitization
CN117708306B (en) * 2024-02-06 2024-05-03 神州医疗科技股份有限公司 Medical question-answering architecture generation method and system based on layered question-answering structure

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN107766483A (en) * 2017-10-13 2018-03-06 华中科技大学 The interactive answering method and system of a kind of knowledge based collection of illustrative plates

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104484433B (en) * 2014-12-19 2017-06-30 东南大学 A kind of books Ontology Matching method based on machine learning
CN104933027B (en) * 2015-06-12 2017-10-27 华东师范大学 A kind of open Chinese entity relation extraction method of utilization dependency analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN107766483A (en) * 2017-10-13 2018-03-06 华中科技大学 The interactive answering method and system of a kind of knowledge based collection of illustrative plates

Also Published As

Publication number Publication date
CN110147436A (en) 2019-08-20

Similar Documents

Publication Publication Date Title
CN110147436B (en) Education knowledge map and text-based hybrid automatic question-answering method
CN111475623B (en) Case Information Semantic Retrieval Method and Device Based on Knowledge Graph
Sukkarieh et al. Automarking: using computational linguistics to score short ‚free− text responses
CN110727779A (en) Question-answering method and system based on multi-model fusion
US10503830B2 (en) Natural language processing with adaptable rules based on user inputs
CN105677822A (en) Enrollment automatic question-answering method and system based on conversation robot
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN113806563A (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
WO2010132790A1 (en) Methods and systems for knowledge discovery
CN112328800A (en) System and method for automatically generating programming specification question answers
Abainia DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
Althagafi et al. Arabic tweets sentiment analysis about online learning during COVID-19 in Saudi Arabia
CN112487202A (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
Yogish et al. Survey on trends and methods of an intelligent answering system
NEAMAH et al. QUESTION ANSWERING SYSTEM SUPPORTING VECTOR MACHINE METHOD FOR HADITH DOMAIN.
CN113157887A (en) Knowledge question-answering intention identification method and device and computer equipment
Riza et al. Natural language processing and levenshtein distance for generating error identification typed questions on TOEFL
Almotairi et al. Developing a Semantic Question Answering System for E-Learning Environments Using Linguistic Resources.
Zadgaonkar et al. An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information Extraction
Alwaneen et al. Stacked dynamic memory-coattention network for answering why-questions in Arabic
Lee Natural Language Processing: A Textbook with Python Implementation
BALEW Amharic textual entailment recognition
Saty et al. A New Spell-Checking Approach Based on the User Profile
Arbizu Extracting knowledge from documents to construct concept maps
Kankhar et al. Word level similarity auto-evaluation for an online question answering system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant