CN117540063A

CN117540063A - Education field knowledge base searching optimization method and device based on problem generation

Info

Publication number: CN117540063A
Application number: CN202311273406.2A
Authority: CN
Inventors: 王琪皓; 黄程韦; 朱晓明; 曹柳; 巨然
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-02-09

Abstract

The invention discloses an education domain knowledge base searching and optimizing method based on problem generation, which comprises the steps of firstly obtaining domain corpus analysis in the direction of an education knowledge base to obtain education domain text; performing pre-training language model transfer learning by using the education field text to obtain a semantic model; designing a fixed question-answer pair template based on the existing structured text information in the knowledge base to obtain a knowledge base question-answer pair; generating a model for the data training problems by utilizing knowledge base questions and answers and Chinese open source questions and answers, and deploying problems to generate reasoning services; generating a question-answer pair expansion knowledge base; simultaneously semantically encoding entity node texts in knowledge base structural information and question texts in question-answer pairs by using a semantic model, constructing a vector base, and carrying out semantic similarity calculation after user query input; on-line semantic matching recalls the optimal result; the invention greatly improves the recall rate of the return result of the search behavior of the user, improves the learning efficiency and improves the user experience.

Description

Education field knowledge base searching optimization method and device based on problem generation

Technical Field

The invention relates to the technical fields of education knowledge base, searching and problem generation, in particular to a method and a device for optimizing the knowledge base searching in the education field based on problem generation.

Background

Education is fundamental to China, and along with the advancement of education informatization, the education intellectualization guided by the internet technology enters a development expressway. The education and the knowledge have natural connection, and the knowledge graph serves as a key technical basis of cognitive intelligence and plays a decisive role in the education and the intelligence. The intelligent application based on the education knowledge graph, namely knowledge search question and answer, is a key ring of education intelligence, and is one of the most dependent functions of the educators and the educators. At present, the knowledge base search of the education field built by the knowledge map triples is limited, but the actual demands of users cannot be met, for example, when an educated person is in an unfamiliar new knowledge field, the users are often difficult to recall due to inaccurate search words, and the direct demands of the users cannot be met. Therefore, the knowledge base searching effect is effectively improved, the user use experience is greatly improved, better learning of the educated person is assisted, and the educated person is more convenient to prepare lessons.

Common knowledge base search optimization methods, such as direct use of techniques such as fuzzy search, query hints or semantic matching, have little effect in education fields requiring precision. There are also methods of searching by combining existing search techniques with a knowledge base, such as: 1. matching the character strings processed by the search word and the knowledge base content; 2. after unified coding through the optimized semantic model, carrying out similar calculation on the search word and knowledge in the library; 3. and searching by utilizing the existing graph structure of the knowledge base to carry out a multi-hop auxiliary two-step method. The technical scheme still stays in the current knowledge content to promote recall results, and has limited effects on ambiguity, ambiguity and sparseness problems.

Educational domain knowledge bases often contain only thin unified term content due to the described solution of normalized knowledge representation. The current method is limited to matching or similarity calculation of texts contained in a knowledge base, and can process accurate term search requirements, and the actual search words of users are often difficult to match, such as the conventional problem search words of the instructor, the pain point problem is difficult to be greatly improved only by the scheme for optimizing the matching effect type, and the user requirements are met. According to the method, the matching effect is optimized, meanwhile, the to-be-retrieved library is expanded by considering the problem generation technology, and the recall problem is greatly improved.

Others also exist partial dynamic knowledge base schemes that attempt to ameliorate the problem of recall inefficiency based on user feedback. But such schemes sacrifice the user experience early and require long periods of accumulation with little success due to the user's inertness.

Disclosure of Invention

Aiming at overcoming the defects of the prior art, the invention provides a knowledge base searching optimization method and device in the education field based on problem generation, and a method for improving the problem of low recall rate of the knowledge base searching result based on the problem generation technology.

The aim of the invention is realized by the following technical scheme: an education domain knowledge base searching and optimizing method based on problem generation comprises the following steps:

(1) Obtaining a domain corpus in the direction of an education knowledge base, and analyzing to obtain an education domain text;

(2) Performing pre-training language model transfer learning by using the education field text to obtain a semantic model;

(3) Designing a fixed question-answer pair template based on the structured text information in the knowledge base to obtain a knowledge base question-answer pair;

(4) Generating a model for the data training problems by utilizing knowledge base questions and answers and Chinese open source questions and answers, and deploying problems to generate reasoning services;

(5) Generating a question-answer pair expansion knowledge base: processing the education field text by using the question generation reasoning service, generating questions and obtaining question-answer pair data; simultaneously semantically encoding entity node texts in the structured text information of the knowledge base and question texts in question-answer pairs by using a semantic model, constructing a vector base, and carrying out semantic similarity calculation after user query input;

(6) On-line semantic matching recalls the best results: encoding the user query online by using a semantic model, and searching a result with highest semantic similarity in a vector library by using query encoding; if the text is the knowledge base entity node text, returning the structured text information corresponding to the knowledge base entity node; and if the answer text is the question text, returning the corresponding answer text.

Further, the step (1) specifically comprises the following steps: and acquiring teaching materials, courseware and professional related webpage contents corresponding to the subject directions, performing ocr identification or layout analysis processing on the teaching materials and the courseware, performing crawler analysis processing on the webpage contents to obtain complete text contents, and removing invalid characters, repeated contents and syntactic corpus to obtain the education field text.

Further, the step (2) specifically comprises: based on BERT, ELECTRA, ALBert natural language understanding class pre-training language model, MLM and MPNetd pre-training tasks are carried out by using education field texts, and a transfer learning experiment is carried out, so that a semantic model is obtained.

Further, in the step (3), the structured text information is triple structure data.

Further, a fixed question-answer pair template is designed by utilizing the relation or attribute among the entities in the triple structure data.

Further, in the step (3), the method further includes: and designing the problem by utilizing the front-back relation of the knowledge points to obtain a knowledge base question-answer pair.

Further, the step (4) includes the following sub-steps:

and (4.1) searching open source question-answer pairs corpus, carrying out format processing on different corpus sets, converting into a query-answer format consistent with knowledge base question-answer pairs, storing, and combining with the knowledge base question-answer pairs to serve as a question-answer pair training sample.

(4.2) question-answering is performed to enhance training samples: constructing a negative sample by taking the question-answer pair training sample in the step (4.1) as positive sample data, constructing a difficult negative sample by using a contrast learning method, and constructing a simple negative sample by replacing keywords or replacing a sample word sequence method; the ratio of knowledge base questions and answers to positive and negative sample data is increased, and the domain capacity is improved.

And (4.3) constructing a template of promt, performing problem generation task learning based on the generation type pre-training language model, dividing the enhanced training sample into a training set and an evaluation set, inputting the model according to the formats of SEP characters, answer texts, CLS characters and promt template characters, taking the problem text as an output result, performing model learning by using the training set, and performing model evaluation by the evaluation set to obtain the optimal problem generation model.

(4.4) constructing a problem creation service using the optimal problem creation model in combination with the score threshold; specific: outputting a problem text and a corresponding score by the optimal problem generation model, screening according to a score threshold, reserving if the score threshold is larger than or equal to the score threshold, otherwise removing; the score threshold is 0.7-0.9;

further, in the step (4.3), the generated class pre-training language model is a GPT, T5 or BART model.

The educational-field knowledge base searching and optimizing device based on the problem generation comprises one or more processors, wherein the processors are used for realizing the educational-field knowledge base searching and optimizing method based on the problem generation.

A computer readable storage medium having stored thereon a program which, when executed by a processor, is adapted to carry out a method of optimizing a knowledge base search in an educational domain based on problem generation as described above.

The beneficial effects of the invention are as follows: the recall rate of the return result of the user search behavior is greatly improved, the learning efficiency is improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic diagram of the overall framework of the method of the present invention;

FIG. 2 is a diagram of a problem creation service construction and reasoning structure;

FIG. 3 is a schematic diagram of the internal structure of the problem-generating model

FIG. 4 is a diagram of the overall architecture of the online search module;

fig. 5 is a hardware configuration diagram of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The present invention will be described in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.

The method and the device acquire the domain text by processing the domain corpus which is easy to acquire in the education domain. And the problem generation technology is utilized to expand the knowledge base in the education field, the semantic model of the data source is used to optimize the retrieval effect, and the strategies such as data enhancement and threshold screening are matched to realize the optimization solution of the problem of low recall rate of the knowledge base search in the education field.

The invention relates to an education domain knowledge base searching and optimizing method based on problem generation, which is shown in figure 1 and comprises the following steps:

taking robotics profession as an example, the field resources are mined. The relevant courses of the robotics are collected, teaching materials are used, text formats are directly used, and the pdf format teaching material data are subjected to text extraction by using a layout analysis technology or a ocr technology; other field resources, such as professional robotics website content, encyclopedia introduction pages, school intranet education resources and the like, can download content ppt, video and the like, extract texts through an analysis technology, and cannot download the content to be crawled through a crawler technology. After the original text is obtained, a series of text cleaning and analyzing work is uniformly carried out, invalid characters, repeated contents, syntactic corpus and the like are removed, and the text is arranged to be used as the robot field text.

based on the pre-training language model, a semantic model with better robot domain effect is obtained by utilizing a transfer learning method. And using the obtained robot field text, using a Chinese sense-bert model as a backstbone model to carry out finishing, respectively using different pre-training tasks such as MLM, MPNet and the like and different super parameters to carry out full training, and testing to obtain a semantic model with more excellent semantic understanding capability in the robot field. The testing method is that a small amount of manually prepared word similarity and sentence similarity testing sets are used, such as: and comparing the relative sizes of the similarity scores of the robot dynamics of the entity words and the robot statics and the rigid body, and testing the trained different semantic models.

the robot knowledge base mainly comprises robot knowledge map triplet data which are constructed and generated, and mainly comprises relations among different entities and attribute information corresponding to the different entities. Such as: the theoretical knowledge point class entity has an attribute of "definition", which can generate a question: definition of entity? Then the corresponding attribute value is an answer to obtain the question-answer data of the attribute knowledge base; what are the theoretical knowledge point class entities and the actual skill class knowledge points in supporting relation, which can generate questions, what are the relationships between the entities? And then the corresponding relation category is an answer, and the relation knowledge base question-answer data are obtained. And the front-back relation of knowledge points is utilized to obtain the question pattern for asking the front-back knowledge points. Examples are as follows: the attribute of the entity 'rigid body' is a prepositioned knowledge point of the entity 'rigid body dynamics', and the problem can be obtained: which pre-knowledge points are the robot dynamics involved? Answer: rigid body dynamics, robot motion. Other structured information can be similarly designed into question-answer pairs, and finally knowledge base question-answer pairs are obtained.

the research selects open source question-answer pair corpus to obtain Chinese open source question-answer pair corpus data with higher quality, which is similar to the robotics field. The most commonly used SQUAD-2.0 data set in the question and answer field is obtained to learn question and answer pair semantic logic, the webQA data set learns the real user question mode, and the question and answer pair training sample is formed by combining the constructed robotics knowledge base question and answer data. Further, the training samples are enhanced by question-answering. Constructing a negative sample by taking the question-answer pair training sample as positive sample data, and constructing a difficult negative sample by using a contrast learning method, for example, performing countermeasure training by using a PGD (Projected Gradient Descent) method; the simple negative sample can also be constructed by using methods such as rule construction prevention and the like, for example, replacing key entity words in the question text, adjusting the word sequence of answer text phrases and the like; in addition, as the number of the open source corpus of the questions and answers relative to the text in the robotics field is small, the ratio of the questions and answers relative to positive and negative sample data in the knowledge base is increased, and the field capability is improved.

Preferably, the problem-generating model training is performed using the p-tuning method. The investigation and selection are carried out by taking an open source generation type pre-training model GPT3, T5 or BART as a backup model to be respectively tested, as shown in figure 2, an encoder-decoder structure of a type multi-layer transducer is used as input, a corresponding query is used as output text, and question habits of a real user are learned, and question generation task learning is carried out based on the generation type pre-training large model. The training sample after data enhancement is divided into a training set and an evaluation set, a model is input by adopting a mode of SEP characters, answer texts, CLS characters and template characters according to a template method of template, and SEP characters, question texts are output as label. And (3) performing model learning by using the training set, and performing a plurality of comparison experiments of different models and parameters to obtain a plurality of problem generation models to be evaluated. And (3) carrying out model evaluation by using an evaluation set, calculating the blu and rouge indexes, and screening to obtain an optimal problem generation model.

And constructing a question generation reasoning service based on the question generation model reasoning capacity and the rule. And constructing a deployment model reasoning module, using the text test effect in the education field to obtain a plurality of problem texts and corresponding scores, evaluating the scores and corresponding effect indexes, and adjusting a generation rule scheme, such as setting a score threshold (the value range of the threshold is 0.7-0.9, and the specific value is taken according to the scores and the effects), removing the result lower than the threshold, and the like. The overall architecture of the problem generation service is shown in fig. 3, and includes offline model training and online reasoning validation. And constructing a sample by using the open source data set and the knowledge base data set offline to perform model training experiments, acquiring an optimal model, deploying service online, and constructing a database by encoding the text and the knowledge base nodes in the robot field, wherein the database is effective as a search database of a semantic engine.

(5) Generating a question-answer pair expansion knowledge base: using the constructed question generation reasoning service to process factual statement sentences in the education field text, generating questions and obtaining question-answer pair data; and (3) carrying out semantic coding on entity node texts in the structured text information of the knowledge base and the question texts in the question-answer pair by using the semantic model obtained in the step (2), and constructing a vector library. As shown in the whole online search architecture of fig. 4, the offline part uses a semantic model to perform coding and library establishment, and after the online part uses query input for user, the online part performs coding and performs high-dimensional dense vector retrieval in the complete vector library, and returns the result with highest cosine semantic similarity.

(6) The on-line part recalls the best result by semantic matching: and encoding the user query online by using the semantic model, and searching a result with the highest semantic similarity in a vector knowledge base by using query encoding. If the text is the knowledge base node text, returning the structured text information corresponding to the knowledge base node; and if the answer is the question text, returning a corresponding answer text field.

Corresponding to the embodiment of the education domain knowledge base searching and optimizing method based on the problem generation, the invention also provides an embodiment of the education domain knowledge base searching and optimizing device based on the problem generation.

Referring to fig. 5, an educational-field knowledge base search optimization device based on problem generation according to an embodiment of the present invention includes one or more processors configured to implement an educational-field knowledge base search optimization method based on problem generation in the above embodiment.

The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The embodiment of the knowledge base searching and optimizing device in the education field based on the problem generation can be applied to any device with data processing capability, such as a computer or the like. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of an apparatus with any data processing capability where the knowledge base search optimization device in education field based on problem generation of the present invention is located is shown in fig. 5, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, any apparatus with data processing capability where the apparatus is located in an embodiment generally includes other hardware according to the actual function of the apparatus with any data processing capability, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements an educational domain knowledge base search optimization method based on problem generation in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Claims

1. The education domain knowledge base searching and optimizing method based on the problem generation is characterized by comprising the following steps:

2. The method for optimizing the search of knowledge base in educational field based on question generation according to claim 1, wherein the step (1) is specifically: and acquiring teaching materials, courseware and professional related webpage contents corresponding to the subject directions, performing ocr identification or layout analysis processing on the teaching materials and the courseware, performing crawler analysis processing on the webpage contents to obtain complete text contents, and removing invalid characters, repeated contents and syntactic corpus to obtain the education field text.

3. The method for optimizing the knowledge base search in the educational field based on question generation according to claim 1, wherein the step (2) is specifically: based on BERT, ELECTRA, ALBert natural language understanding class pre-training language model, MLM and MPNetd pre-training tasks are carried out by using education field texts, and a transfer learning experiment is carried out, so that a semantic model is obtained.

4. The method of claim 1, wherein in the step (3), the structured text information is triple structure data.

5. The method for optimizing question-based knowledge base search in an educational field of claim 4, wherein a fixed question-answer pair template is designed using relationships or attributes between entities in the triplet structure data.

6. The method for optimizing the search of knowledge base in educational field based on question generation according to claim 1, wherein in said step (3), further comprising: and designing the problem by utilizing the front-back relation of the knowledge points to obtain a knowledge base question-answer pair.

7. The method of optimizing a knowledge base search in an educational field based on question generation according to claim 1, wherein said step (4) comprises the sub-steps of:

(4.4) constructing a problem creation service using the optimal problem creation model in combination with the score threshold; specific: outputting a problem text and a corresponding score by the optimal problem generation model, screening according to a score threshold, reserving if the score threshold is larger than or equal to the score threshold, otherwise removing; the score threshold is 0.7-0.9.

8. The method of claim 7, wherein in the step (4.3), the generated class pre-training language model is a GPT, T5 or BART model.

9. An educational-field knowledge base searching and optimizing device based on problem generation, characterized by comprising one or more processors for implementing an educational-field knowledge base searching and optimizing method based on problem generation as set forth in any one of claims 1-8.

10. A computer-readable storage medium having a program stored thereon, which when executed by a processor is adapted to carry out a method of optimizing a knowledge base search for educational fields based on problem generation according to any one of claims 1-8.