English exercise recommendation method based on large-scale knowledge point labeling result
Technical Field
The invention belongs to the field of personalized intelligent recommendation, and particularly relates to a knowledge point analysis and tracking method and a recommendation method based on knowledge point analysis and tracking results.
Background
Conventional recommendation techniques are generally classified into two modes, behavior-based recommendation techniques and content-based recommendation techniques. The recommendation technology based on behavior analysis generally adopts an algorithm taking collaborative filtering as a core to perform modeling, the algorithm mainly finds out similar users or similar articles by mining user group behaviors to further form feature description of the users or the articles, and recommends based on the similarity measurement capability provided by the feature description, and the method is widely suitable for scenes with dominant interests such as e-commerce recommendation and information recommendation. The technology based on content recommendation generally analyzes and describes the recommended item itself to a certain extent, and then models the portrait based on the feedback information of the item from the user, so as to recommend the suitable item. In this way, the recommendation result is generally better interpretable because the recommendation result is only related to the user's own behavior. The disadvantage is that the recommendation result is only related to the user's own behavior, and the diversity of the recommendation result is generally poor. In an intelligent recommendation scene related to teaching, students generally follow classes and are guided by teachers to conduct a learning process in a unified mode, so that a large number of invalid similar users exist in recommendation based on a collaborative filtering algorithm, and effective personalized analysis cannot be completed. Meanwhile, as the personal ability is generally improved gradually along with the learning process, the recommendation based on the collaborative filtering algorithm cannot be simply carried out in consideration of the front-back dependency relationship of the learning knowledge. Therefore, the recommendation algorithm of the teaching scene is generally researched from the dimension based on content analysis. The main core of the content recommendation technology is the analysis of the content of an article, and the common analysis dimensions in the english teaching field include relatively coarse granularity such as exercise question types, related vocabularies, listening, speaking, reading and writing capabilities and the like. Meanwhile, some analysis modes can analyze knowledge points on a large scale aiming at the topics to form fine-grained descriptions of thousands of knowledge points, and due to the expansion of the number of the knowledge points, new challenges are provided for a user portrait technology and an associated mining technology in a recommendation technology.
In summary, there are various feature analysis techniques for the conventional knowledge point-based recommendation techniques, for example, some techniques calculate TF-IDF language features based on the word surface content of an english exercise to perform modeling, or perform type modeling around different subject types and applicable scenes of the exercise, or perform exercise difficulty modeling based on an item reaction theory IRT, or the like. After the knowledge point modeling is completed, user portrait description is carried out based on a question making result of a user, or user capacity depiction is carried out based on capacity evaluation of an Item Reaction Theory (IRT), and finally English exercises are recommended by calculating relevance between user characteristics or capacity portraits and questions. These ways are similar to the present invention, and the subject is a content-based recommendation technique that is developed around problem modeling and user modeling dimensions.
In the prior art, a recommendation method for user portrait modeling based on English exercise font content calculation features is a recommendation mode similar to article interest points, and cannot effectively perform deep knowledge characterization and subsequent utilization on exercises. A modeling method based on measures such as IRT (intelligent resilient test) and the like is a recommendation type based on subject difficulty, and a single or few characterization modes of audiological reading and writing capabilities cannot provide enough personalized and accurate recommendation results for users. If the problem analysis means is replaced by the result of large-scale knowledge point analysis, due to the fact that the granularity of the problem portrayal changes, the corresponding user portrait algorithm cannot transition naturally due to the fact that necessary dimensionality is lost or data becomes sparse. For this reason, a corresponding technical scheme needs to be designed for solution.
Disclosure of Invention
The invention aims to solve the problem that after fine-grained knowledge point labeling is carried out on exercises, an effective recommendation system is constructed based on description results of large-scale knowledge points, and the recommendation effect is improved.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a method for recommending English exercises based on large-scale knowledge point labeling results comprises the following steps:
step 1, marking fine-grained knowledge points for each exercise:
marking results from expert knowledge, a rule engine and a natural language processing model, and finally marking a plurality of associated knowledge points on each question to form data of related types;
step 2, collecting historical results of cleaning users and exercises:
the construction system collects the historical results of the user-exercises in a period of historical time by using a log or API means to form data of relevant types;
step 3, calculating the difficulty and the popularity of the knowledge points:
counting Difficulty information of the obtained knowledge points and recording the Difficulty information as Difficulty and the popularity of the knowledge points as IDF by using historical results of the user exercises to form data of related types;
step 4, constructing a problem knowledge point model described by the knowledge points:
calculating and obtaining the weight of the knowledge points contained in each question by using the knowledge point result marked by the question and the difficulty and the universality of the knowledge points calculated in the step 3, and marking the weight as a question knowledge point model:
e i ={(T 1 ,w 1 ),(T 2 ,w 2 ),(T 3 ,w 3 ),...}
wherein, the first and the second end of the pipe are connected with each other,
e i to exercise i
T j Is a knowledge point j
w i Describing the weight of each topic for the knowledge point i;
step 5, calculating the user portrait of the user knowledge point grasping condition:
modeling the user portrait by using the historical exercise information of the user obtained by calculation in the step 2 and the exercise knowledge point model obtained by calculation in the step 4, wherein the modeling process is calculated based on a Rocchio algorithm, and comprises the following steps:
wherein, the first and the second end of the pipe are connected with each other,
I r set for correct subject
I w Set for error problem
t 0 Calculating a reference time for the current time
t i Recording occurrence time of question for the current ith question
All of alpha, beta and gamma are hyper-parameters
Specifically, a knowledge point description portrait of a user is established through calculation of correct subject knowledge point vectors and wrong knowledge point vectors in a problem history, attenuation of occurrence time of problem behaviors is calculated in the processing process, and certain weighting processing is carried out on the user portrait;
step 6, calculating the user portrait of the user knowledge point capability:
based on the user exercise history information obtained in the step 2 and exercise labeling results obtained in the step 1, filtering out a user set containing full exercise results, and modeling the mastery degree of the behavior based on a project reaction theory:
wherein the content of the first and second substances,
p(θ i ) For the user to knowProbability of correctness of doing a question on point i
θ i For the degree of grasp of the user on the knowledge point i
b i Is the difficulty of knowledge point i
The hypothetical distribution of each parameter is:
θ~N(0,1)
log(a)~N(0,1)
b~N(0,1)
the model estimates the capability parameters of the model through an EM algorithm, and the log-maximum likelihood function is as follows:
step 7, calculating a user knowledge point capability tracking model:
based on the historical information of the user exercise obtained in the step 2 and the exercise labeling result obtained in the step 1, filtering out the user with sufficient information and the exercise result, and carrying out a Deep Knowledge Tracking model (Deep Knowledge Tracking):
h t =tanh(W h x·x t +W h h·h t -1+b h )
y t =σ(W h y·h t +b y )
wherein the content of the first and second substances,
y t for the degree of grasp of a certain knowledge point by the user at the moment t
h t Model hidden state for time t
x t To input information, the one-hot processed question making result is
The deep knowledge tracking model is an RNN model, and the training process is based on a gradient descent algorithm to finish training;
step 8, calculating the matching degree based on the user model and the title to obtain a recommended rough ranking result:
calculating the matching degree of the user portrait information obtained in the steps 5, 6 and 7 and the problem model described by the knowledge points respectively, and obtaining the coarse layout results of 3 branches:
the candidate problem set only keeps the problems that the user knowledge points completely contain the problem knowledge points;
step 9, training a correct rate estimation model based on the historical results of the user and the exercises:
training a classification model to obtain a correct rate estimation model based on historical results of users and exercises, self attributes of the users, user models and exercise knowledge point model construction characteristics; specifically, the model is selected as a GBDT model, the correct question making record in the historical result is taken as a positive example, the wrong question making record is taken as a negative example, the user side information and the question side information are taken as feature training models, and the model expression is as follows:
wherein the content of the first and second substances,
h m (x) The GBDT model forms a regression (classification) tree with strong classification capability by integrating the capabilities of a plurality of weak classifiers, and each tree of the regression (classification) tree is learned from the residual errors of all the trees trained at the last time;
step 10, carrying out accuracy estimation on the coarse typesetting result based on an accuracy estimation model, and finishing fine typesetting:
aiming at the exercises in the coarse arrangement results of different branches in the step 8, the accuracy rate estimation model trained in the step 9 is used for predicting, final sequencing is completed based on the prediction results, the average exercise accuracy rate is controlled to be about 90 through the hyper-parameter during sequencing, and the user experience is improved;
step 11, obtaining feedback of on-line user question making results through stream data collection, and updating a model mastered by the user knowledge points in real time:
after the user finishes the exercise, the recovery result is fed back through the streaming data, and the user portrait result in the steps 5, 6 and 7 is updated, so that personalized accurate exercise pushing experience is formed.
Compared with the prior art, the invention has the following implementation effects:
compared with other technologies, the recommendation result obtained by the technology can support accurate recommendation under the condition of finer-grained knowledge point description, more delicately depict and track the knowledge mastered by the user compared with a common difficulty matching mode or an interest point mode, and provide personalized questions more suitable for the user. In addition, the model branch based on the variant Rocchio algorithm can depict the knowledge point mastering condition of sparse data, give feedback of rapid knowledge point mastering change, and can well support recommendation of cold knowledge points and new users.
According to the method, the model branch based on the project reaction theory can provide accurate knowledge point mastering degree portrayal for users with sufficient information, the conventional knowledge points and old users can well support the recommended DKT-based model to fully mine the correlation between the knowledge points, the knowledge point portrayal in an urgent and long term is well supported by the knowledge point portrayal, the expansion and consolidation of the knowledge points are well supported, and finally the accurate model is fused aiming at various model results, so that accurate recommendation service and good user experience are provided for the users.
Drawings
FIG. 1 is a schematic diagram of the framework structure of the present invention.
Detailed Description
The present invention will be described with reference to specific examples.
A method for recommending English exercises based on large-scale knowledge point labeling results comprises the following steps:
step 1, marking fine-grained knowledge points for each exercise:
the labeling result can come from means such as expert knowledge, a rule engine, a natural language processing model and the like, and finally, the following types of data are formed for a plurality of associated knowledge points on each topic label:
exercise ID
|
Knowledge points
|
Exercise 1
|
EN23g4v2
|
Exercise 1
|
EN7oer22
|
Problem 2
|
EN23g4v2 |
Where En23g4v2, etc. is a knowledge point code, which means a word knowledge point of "word/noun/adjective about a person/adjective/bill describing physiological characteristics of a person", or a grammatical knowledge point of "grammar/syntax/predicate/identity/classification/subject-to-subject rule/usage/a + as well as + B, predicate verb form and part a remain the same".
Step 2, collecting and cleaning user and exercise history results:
the construction system collects the historical results of the user-problem over a historical period of time by means of a log or API, forming the following types of data:
wherein the data of each line represents that the student does a problem at a certain time, and the result score of the exercise is recorded.
Step 3, calculating difficulty and universality of knowledge points:
by using the historical results of the user exercises, the Difficulty information of the knowledge points is counted and recorded as Difficulty, and the popularity of the knowledge points is recorded as IDF, which is in the form of the following types of data:
knowledge point
|
Difficulty
|
IDF
|
EN23g4v2
|
0.854
|
0.5587
|
EN7oer22
|
0.771
|
0.4478 |
The Difficulty measures the Difficulty of the knowledge points, and can be obtained by counting the correct rate of the problem associated with the knowledge points, or can be obtained by calculating through an algorithm of parameter estimation based on the historical behaviors counted in the step 2 by using a project reaction theory.
The popularity of the knowledge points is measured by the IDF (inverse Document frequency), and can be obtained by calculating the occurrence frequency of the knowledge points through statistics of exercises of associated knowledge points.
Step 4, constructing a problem knowledge point model described by the knowledge points:
calculating the weight of the knowledge points contained in each question by using the knowledge point result marked by the question and the difficulty and popularity of the knowledge points calculated in the step 3, and recording the weight as a question knowledge point model:
e i ={(T 1 ,w 1 ),(T 2 ,w 2 ),(T 3 ,w 3 ),...}
wherein, the first and the second end of the pipe are connected with each other,
e i for exercise i
T j Is a knowledge point j
w i The weight at which each topic is described for a knowledge point i.
Step 5, calculating the user portrait of the user knowledge point grasping condition:
modeling the user portrait by utilizing the user exercise history information obtained by calculation in the step 2 and the exercise knowledge point model obtained by calculation in the step 4, wherein the modeling process is calculated on the basis of a Rocchio algorithm, and comprises the following steps of:
wherein the content of the first and second substances,
I r set for correct subject
I w Set for error problem
t 0 Calculating a reference time for the current time
t i Recording the occurrence time of the current ith question
All of alpha, beta and gamma are hyperparameters
Specifically, a knowledge point description portrait of a user is established through calculation of correct subject knowledge point vectors and wrong knowledge point vectors in a problem history, attenuation of occurrence time of problem behaviors is calculated in the processing process, and certain weighting processing is carried out on the user portrait.
Step 6, calculating the user portrait of the user knowledge point capability:
and obtaining exercise labeling results based on the user exercise history information obtained in the step 2 and the exercise labeling results obtained in the step 1. Filtering out a user set containing a result of fully solving questions, and modeling the behavior of the user set based on the mastery degree of the project reaction theory;
wherein the content of the first and second substances,
p(θ i ) Probability of correctness for user to do a certain question on knowledge point i
θ i For the degree of grasp of the user on the knowledge point i
b i Difficulty of knowledge point i
The hypothetical distribution of each parameter is:
θ~N(0,1)
log(a)~N(0,1)
b~N(0,1)
the model estimates the capability parameters of the model through an EM algorithm, and the log-maximum likelihood function is as follows:
and 7, calculating a user knowledge point capability tracking model:
based on the historical information of the user exercises obtained in the step 2 and the exercise labeling results obtained in the step 1, filtering out the users with sufficient information and exercise results, and carrying out Deep Knowledge Tracking (Deep Knowledge Tracking) on the users;
h t =tanh(W h x·x t +W h h·h t -1+b h )
y t =σ(W h y·h t +b y )
wherein the content of the first and second substances,
y t for the degree of grasp of a certain knowledge point by the user at the moment t
h t Model hidden state for time t
x t To input information, the one-hot processed question making result is
The deep knowledge tracking model is an RNN model, and the training process is based on a gradient descent algorithm to complete training.
Step 8, calculating the matching degree based on the user model and the title to obtain a recommended rough ranking result:
calculating the matching degree of the user portrait information obtained in the steps 5, 6 and 7 and the problem model described by the knowledge points respectively, and obtaining the coarse layout results of 3 branches;
its candidate problem set defines problems that retain only user knowledge points and contain completely problem knowledge points.
Step 9, training a correct rate estimation model based on the user and the exercise history result:
training a classification model to obtain a correct rate estimation model based on historical results of users and exercises, self attributes of the users, user models and exercise knowledge point model construction characteristics; specifically, the model is selected as a GBDT model, the correct exercise record in the historical result is taken as a positive example, the wrong exercise record in the historical result is taken as a negative example, the user side information and the exercise side information are taken as characteristic training models, and the model expression is as follows:
wherein the content of the first and second substances,
h m (x) For a fixed-size decision tree to be used as the weak classifier of the problem, the GBDT model forms a regression (classification) tree with a strong classification ability by integrating the abilities of several weak classifiers, and each tree of the GBDT model learns from the residual errors of all the trees trained last time.
Step 10, carrying out accuracy estimation on the coarse typesetting result based on an accuracy estimation model, and finishing fine typesetting:
and (4) aiming at the problems in the coarse ranking results of different branches in the step (8), predicting by using the accuracy estimation model trained in the step (9), and finishing final ranking based on the prediction result. During sorting, the average question making accuracy is controlled to be about 90 through the hyper-parameter, and the user experience is improved.
Step 11, obtaining feedback of the on-line user question making result through streaming data collection, and instantly updating a model mastered by the user knowledge points:
and after the user finishes the exercises, feeding back a recovery result through the streaming data, and updating the user portrait result in the steps 5, 6 and 7. And forming personalized accurate question pushing experience.
Compared with other technologies, the recommendation result obtained by the technology can support accurate recommendation under the condition of fine-grained knowledge point description, more meticulously depict and track the knowledge mastered by the user compared with a common difficulty matching mode or an interest point mode, and provide personalized questions more suitable for the user.
The model branch based on the variant Rocchio algorithm can depict knowledge point mastering conditions of sparse data, give feedback of rapid knowledge point mastering changes, and can well support recommendation of cold knowledge points and new users.
The model branch based on the project reaction theory can provide accurate knowledge point mastering degree portrayal for users with sufficient information, and can well support recommendation of conventional knowledge points and old users.
The DKT-based model can fully mine the association between knowledge points, describe the knowledge points in an urgent and long term and well support the expansion and consolidation of the knowledge points.
And finally, the fine model is fused according to various model results, so that accurate recommendation service and good user experience are provided for the user.
The foregoing is a detailed description of the invention with reference to specific examples, which are not to be construed as limiting the practice of the invention. It will be apparent to those skilled in the art that various modifications and variations can be made without departing from the spirit of the invention.