CN113268611B - Learning path optimization method based on deep knowledge tracking and reinforcement learning - Google Patents

Learning path optimization method based on deep knowledge tracking and reinforcement learning Download PDF

Info

Publication number
CN113268611B
CN113268611B CN202110706088.9A CN202110706088A CN113268611B CN 113268611 B CN113268611 B CN 113268611B CN 202110706088 A CN202110706088 A CN 202110706088A CN 113268611 B CN113268611 B CN 113268611B
Authority
CN
China
Prior art keywords
learning
knowledge point
state
knowledge
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110706088.9A
Other languages
Chinese (zh)
Other versions
CN113268611A (en
Inventor
李建伟
李领康
于玉杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing Sikai Technology Co ltd
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sikai Technology Co ltd, Beijing University of Posts and Telecommunications filed Critical Beijing Sikai Technology Co ltd
Priority to CN202110706088.9A priority Critical patent/CN113268611B/en
Publication of CN113268611A publication Critical patent/CN113268611A/en
Application granted granted Critical
Publication of CN113268611B publication Critical patent/CN113268611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Abstract

The invention discloses a learning path optimization method based on deep knowledge tracking and reinforcement learning, belonging to the field of self-adaptive learning; the method specifically comprises the following steps: aiming at a certain student, selecting all unlearned knowledge points and knowledge points which are not repaired in advance as to-be-selected knowledge points; and performing one-hot coding by using the knowledge points of historical learning, inputting the one-hot coding into the DKT model, and outputting the grasp level prediction value of each to-be-selected knowledge point. Then, selecting a knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning; the learning process is realized by using a learning path optimization algorithm in the knowledge points; after the current knowledge point K passes the learning, judging whether a subsequent knowledge point exists or not, if so, adding the subsequent knowledge point into a to-be-selected knowledge point set, and moving out the current knowledge point K; otherwise, directly moving out the current knowledge point K, selecting the next knowledge point to predict and learn again until the set of the knowledge points to be selected is empty. The invention can greatly improve the recommendation precision and improve the efficiency under the condition of obtaining the same learning effect.

Description

Learning path optimization method based on deep knowledge tracking and reinforcement learning
Technical Field
The invention belongs to the field of adaptive learning, and particularly relates to a learning path optimization method based on deep knowledge tracking and reinforcement learning.
Background
In the adaptive learning process, one of the key problems to be solved is to recommend an optimal learning path for the student according to the knowledge point mastering level of the student, so as to obtain the optimal learning efficiency and effect.
The learned path recommendations include learned path recommendations between knowledge points and learned path recommendations within knowledge points.
For the recommendation of learning paths among knowledge points, the probability map model technology is most commonly used at present, and the specific implementation process is to adopt a Markov network of a probability map model to track the mastery degree of a single knowledge point of a single learner; and then, predicting the mastery degree of the unknown knowledge points by adopting a Bayesian network of a probability graph model according to the mastery degree of the learned knowledge points of the learner, thereby providing personalized learning path recommendation and predicting weak knowledge points of the learner. Most adaptive learning systems such as Knewton, squirrel AI or vipsid use this technique to implement personalized learning path recommendations. However, the above method needs to label domain knowledge (e.g. difficulty, distinction degree, knowledge points to which the learner belongs, etc.) and cannot comprehensively analyze the current overall knowledge state and past learning performance of the learner, and the recommended performance is generally expressed.
For learning path recommendation in a knowledge point, currently, a collaborative filtering algorithm and a genetic algorithm are most commonly used, the collaborative filtering algorithm is the most commonly used recommendation algorithm in a personalized recommendation system, and the basic idea is to find the nearest resource or user through a similarity algorithm according to a scoring matrix of a learner on the learning resource, predict an unscored target learning resource according to the nearest resource or user, and recommend a more accurate learning resource to the learner according to a prediction result.
For example, knewton adopts a collaborative filtering algorithm to quickly locate the information required by the learner from the learning objective, the cognitive structure and the learning input degree of the learner, so as to present the optimal learning content for the future learning of the learner. The genetic algorithm is one of evolutionary algorithms, and a preference attribute value of a user is extracted through a series of operations according to an initial population, so that the learning resources are recommended. The squirrel AI recommends proper learning resources for the learner in the global scope by using a genetic algorithm on the basis of tracking and analyzing learning data. However, both of the above algorithms aim to satisfy user preferences as a recommendation, rather than aim to obtain optimal learning efficiency and effect, learning is a painful process, and students can be continuously motivated to generate learning motivation only if a high learning return is obtained after effort is made.
Knowledge tracking is the modeling of the knowledge of a student based on time in order to accurately predict how well the student will master the knowledge point at the next moment. The Deep Knowledge tracking algorithm (DKT) is a Knowledge tracking model established based on the LSTM (Long Short-Term Memory), and a Knowledge point mastery prediction model is trained by using the historical learning data of the user, and the Knowledge point mastery level of the student is predicted and estimated according to the trained model.
Reinforcement learning mainly comprises four elements, intelligent agents, environmental states, actions and rewards. The intelligent agent selects an action for the environment, the state of the environment changes after receiving the action, and simultaneously a reinforcement signal (reward or punishment) is generated and fed back to the intelligent agent, and the goal of reinforcement learning is to obtain the most accumulated reward. And updating the knowledge point mastering state of the student by using a reinforcement learning algorithm according to the action of 'doing the right test questions' or 'doing the wrong test questions' of the student, and establishing a reward mechanism according to the target mastering state, so that a recommendation strategy of the test questions and the learning content of the knowledge points is established, and the student can efficiently reach the target mastering degree of the knowledge points.
In the prior art, deep knowledge tracking has strong perception capability to perceive the current learning state of students, but lacks certain decision-making capability; while reinforcement learning has decision-making ability but lacks the perception of state. The perception capability tracked by the deep knowledge is combined with the decision-making capability of the reinforcement learning, the deep knowledge tracks and perceives the learning state of students, the reinforcement learning makes a decision by taking the best learning efficiency and effect target as a guide according to the perceived learning state, and the learning path recommendation effect with the optimal performance can be obtained.
Disclosure of Invention
In order to find the learning sequence of the knowledge points and the learning sequence of the learning contents in the knowledge points which are most suitable for students, the invention provides a learning path optimization method based on deep knowledge tracking and reinforcement learning, and the learning path optimization method is recommended to the students which is most suitable for efficient learning paths, so that the students can efficiently master the knowledge points.
The learning path optimization method comprises a learning path optimization process between knowledge and a learning path optimization process in a knowledge point, and specifically comprises the following steps:
the learning path optimization process aiming at the knowledge specifically comprises the following steps:
step one, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected;
modifying knowledge points first refers to learning knowledge points that need to be learned first before learning the current knowledge points.
And secondly, performing one-hot coding on the knowledge points which have been learned by the student according to historical learning data of the knowledge points, inputting the one-hot coding into a trained DKT model, and outputting a grasp level prediction value of the student on each knowledge point to be selected.
Thirdly, sorting the prediction results of the knowledge points to be selected in the order from high to low, and selecting the knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning;
the learning process is realized by using a learning path optimization algorithm in knowledge points, and is specifically divided into two stages:
the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;
step 301, initializing each parameter: learning rate α =0.1, discount factor γ =0.9, training round counter elispodes =0;
step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;
the element in the Q matrix is Q (s, a), which represents the expectation that the action a can obtain the benefit under the s state at a certain moment;
the Q matrix is represented as follows:
Figure BDA0003132069660000031
wherein, Q(s)j,ai0) Shows the mastery level of the knowledge points of the students at sjIn the state, answer wrong test question aiA desire for revenue can be obtained.
Q(sj,ai1) Shows the mastery level of the knowledge points of the students at sjIn the state, answer the right test question aiA desire for revenue can be obtained.
The line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal within the range of value [0,1], and the decimal is represented by s.
The column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a isi0An action indicating that the student wrongly answered the ith test question, ai1Showing the student answering the i-th test question.
The knowledge point K and the prior knowledge point are provided with n test questions, the size of the corresponding action set A is 2n, the action set A comprises actions of answering and answering wrong test questions, the initialized matrix Q is a matrix with 1 row and 2n columns of element values all being 0, and the action corresponding to each state is initially completely corresponded.
Reward feedback Reward is defined as: from the current state sjPerform a certain action ai0Or ai1The latter state reaches the target mastery degree of the knowledge point, namely, the target state value s is reachedtThe Reward value Reward is 1, otherwise Reward is 0.
The concrete formula is as follows:
Figure BDA0003132069660000032
(s, a) is the current state and action,
Figure BDA0003132069660000033
is the current state sjA state after taking action;
step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, initializing state s =0.5, and entering step 304;
step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, step 305 is entered.
305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, entering step 306, and if not, ending the current round, entering step 311;
the selection method for selecting action a according to the current state of the Q matrix is as follows:
firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with 90% probability, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.
Step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the student, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is predicted to be the next state
Figure BDA0003132069660000034
Step 307, judging the state
Figure BDA0003132069660000041
If it is in the state set of the Q matrix, if so, go to step 308, otherwise, add the state set, add a row of data in the Q matrix, and initialize each element to 0.
Step 308, convert the state
Figure BDA0003132069660000042
Substituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;
step 309, updating the Q matrix by using the current Q matrix and Reward feedback;
the update formula is as follows:
Figure BDA0003132069660000043
Figure BDA0003132069660000044
representing the next state and corresponding behavior;
Figure BDA0003132069660000045
means that
Figure BDA0003132069660000046
Maximum Q values corresponding to all actions in the state;
step 310, returning to step 304, continuing to judge the next state
Figure BDA0003132069660000047
Whether the target value is manually set or not is achieved, and the Q matrix is continuously updated;
step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged and step 303 is entered.
Step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next training round.
The probability P is obtained by counting the number of awards 1 obtained in M rounds and dividing by M.
And a second stage: and using the trained Q matrix for learning path recommendation in the knowledge point.
Step 3.1: setting a target state s of mastershiptThe current knowledge point mastery level s of the student is initialized to 0.5.
Step 3.2: and selecting the action a with the maximum Q value in the state s according to the Q matrix trained in the first stage, and recommending the test question corresponding to the action to the student for learning.
Step 3.3: after the student finishes learning, calculating the next state of the student by using the trained DKT algorithm model according to the response record of the current knowledge point of the student
Figure BDA0003132069660000048
And updating the current state of the student;
step 3.4: judging whether the updated current state reaches a target state value stIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.
Step four, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, entering the step five; otherwise, moving out the current knowledge point from the knowledge point set to be selected, and entering the step six;
adding subsequent knowledge points of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set;
step six, judging whether the knowledge point set to be selected is empty, and if so, terminating the circulation; otherwise, returning to the step two, and continuing to learn the next knowledge point.
The invention has the advantages that:
1) The deep knowledge tracking algorithm not only analyzes the current knowledge state of the learner, but also considers all previous learning performances of the learner, so that the accuracy of predicting the future learning capacity performance of the learner is higher. Compared with a learning path recommendation method based on a probability map model, the personalized learning path recommendation method based on the knowledge map and the deep knowledge tracking can greatly improve the recommendation precision.
2) The method combines the perception capability of deep knowledge tracking and the decision-making capability of reinforcement learning, and realizes recommendation aiming at obtaining the best learning efficiency and effect.
Drawings
FIG. 1 is a schematic diagram of a knowledge point mastery level prediction model constructed by a depth knowledge tracking algorithm according to the present invention;
FIG. 2 is a flowchart of a learning path optimization method based on deep knowledge tracking and reinforcement learning according to the present invention;
FIG. 3 is a schematic view of a knowledge graph of the labeled knowledge points first modifying successor relationships and knowledge point content employed in the present invention;
FIG. 4 is a graph comparing learning efficiency of the present invention compared to collaborative filtering and genetic algorithms.
Detailed Description
The following describes embodiments of the present invention in detail and clearly with reference to the examples and the accompanying drawings.
The Deep Knowledge tracking algorithm (DKT) is a Knowledge tracking model established based on a Deep neural network LSTM (Long Short-Term Memory network), a Knowledge point mastery level prediction model is trained by using the model and user historical learning data to predict the mastery state of a student on an unknown Knowledge point, and the predicted mastery degree of the Knowledge point is [0,1%]The value range of (a). As shown in FIG. 1, the knowledge point grasp level prediction model is applied to an input vector sequence x1…xTBy calculating a series of "hidden" states h1…hTTo the output vector sequence y1…yTIt can be seen as a continuous coding of the relevant information for past observations that are useful for future predictions.
The specific formula is as follows:
ht=tanh(Whxxt+Whhht-1+bh) (1)
yt=σ(Wyhht+by) (2)
Whxis the state-input weight; whhIs state of-a state weight; bhIs a bias term for the hidden unit; σ is a sigmoid function; wyhIs the information readout weight; b is a mixture ofyIs an offset term of the information reading unit;
the method comprises the steps of pre-processing cleaned data according to a one-hot coding format by collecting sample data of a user and removing samples with sequences less than 50% of the number of knowledge point databases in each sample data;
assuming that the knowledge points have n questions, the one-hot coding length is 2n, the front n bits represent wrong test questions, and the rear n bits represent right test questions; for example, if the user answers the ith test question, if the answer is wrong, the position of the one-hot code index i-1 is 1, and the other positions are 0, the one-hot code is shown in table 1:
TABLE 1
Index 0 1 ... i-1 ... n-1 n ... 2n-1
Encoding 0 0 0 1 ... 0 0 ... 0
If the user answers correctly, the position of the one-hot code index n + (i-1) is 1, the rest positions are 0, and the one-hot code is shown in table 1:
TABLE 2
Index 0 1 ... n-1 n ... n+i-1 ... 2n-1
Encoding 0 0 0 0 0 ... 1 ... 0
Using the preprocessed data as input data of a DKT model, training and storing a knowledge point prediction model; and inputting the history record of the user doing questions into the prediction model, and predicting the mastery level of the unknown knowledge points of the user in real time.
The invention provides a learning path optimization method based on inter-knowledge point and intra-knowledge point learning paths, and combined with theories and technologies such as knowledge maps, deep learning and reinforcement learning, and the like, and the learning path optimization method based on deep knowledge tracking and reinforcement learning, comprises an inter-knowledge learning path optimization process and an intra-knowledge point learning path optimization process, and as shown in fig. 2, the specific processes are as follows:
step one, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected;
modifying knowledge points first refers to learning the knowledge points which need to be learned before learning the current knowledge points; as shown in fig. 3, by establishing the knowledge graph, the labeled knowledge points modify the subsequent relationship first and the content structure of the knowledge points is simple; in the figure, k1, k2, k3, k4 and k5 represent knowledge points, k1 is a correction knowledge point of k2 and k5, k2 is a correction knowledge point of k3, and t1 and t2 represent test questions and learning contents belonging to the knowledge point k1 (the test questions may be embedded in the learning contents). k4 is a discrete knowledge point, no prior knowledge point, and no subsequent knowledge point
And secondly, performing one-hot coding on the knowledge points which have been learned by the student according to historical learning data of the knowledge points, inputting the one-hot coding into a trained DKT model, and outputting a grasp level prediction value of the student on each knowledge point to be selected.
Thirdly, sorting the prediction results of the knowledge points to be selected in a sequence from high to low, and selecting the knowledge point K with the highest prediction result and recommending the knowledge point K to the student for learning;
the learning process is realized by using a learning path optimization algorithm in the knowledge points, and assuming that a student starts learning, firstly, a state value s of a mastery degree target is settThe student's initial mastery level (i.e., initial state s) is initialized to 0.5. Then, based on a learning path recommendation algorithm among knowledge points, selecting a knowledge point K to start learning; then, recommending test questions according to the Q matrix trained by the learning path recommendation algorithm in the knowledge points until the state stAnd when the target state value is reached, the learning of the knowledge point K is finished, and the learning of the next knowledge point is started based on the learning path recommendation algorithm among the knowledge points until all the knowledge points are completely learned.
The method is specifically divided into two stages:
the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;
step 301, initializing each parameter: learning rate α =0.1, discount factor γ =0.9, training round counter eliscodes =0 continuously reaching the end state;
step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;
the element in the Q matrix is Q (S, a), and the Q matrix represents the expectation that the profit can be obtained by taking the action a (a belongs to A) under the state of S at a certain moment (S belongs to S); the environment can obtain corresponding reward (reward) according to the Action feedback of an intelligent agent (agent), so the main idea of the algorithm is to construct a Q-table by a State (State) and an Action (Action) to store a Q value, and then to select the Action capable of obtaining the maximum profit according to the Q value;
the Q matrix is represented as follows:
Figure BDA0003132069660000071
wherein, Q(s)j,ai0) Shows the mastery level of the knowledge points of the students at sjIn the state, answer wrong test question aiA desire for revenue can be obtained.
Q(sj,ai1) Indicates the mastery level of the knowledge points of the students at sjIn the state, answer a to test questioniA desire for revenue can be obtained.
The line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal within the range of value [0,1], and the decimal is represented by s.
The column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a isi0An action indicating that the student wrongly answered the ith test question, ai1Showing the student answering the i-th test question.
The knowledge point K and the prior knowledge point are provided with n test questions, the size of the corresponding action set A is 2n, the action set A comprises actions of answering and answering wrong test questions, the initialized matrix Q is a matrix with 1 row and 2n columns of element values all being 0, and the action corresponding to each state is initially completely corresponded.
Reward, i.e., reward feedback given by the environment, is defined as: setting the target mastery degree of the knowledge point, namely the target state value as stFrom the current state sjPerform a certain action ai0Or ai1The later state reaches the target mastery degree of the knowledge point, namely the target state value s is reachedtThe Reward value Reward is 1, otherwise Reward is 0.
The concrete formula is as follows:
Figure BDA0003132069660000072
(s, a) is the current state and action,
Figure BDA0003132069660000073
is the current state sjA state after taking action; s istIs the target state value.
Step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, initializing the state s =0.5, and entering step 304;
assuming that the initial state of the student is 0.5 middle ability level, the state set S is initialized with 0.5;
step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, step 305 is entered.
305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, entering step 306, and if not, ending the current round, entering step 311;
the selection method for selecting action a according to the current state of the Q matrix is as follows:
firstly, selecting Q values of all actions which are not executed in the current round under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0 or not, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.
Step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the student, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is predicted to be the next state
Figure BDA0003132069660000081
Step 307, determining the status
Figure BDA0003132069660000082
If it is in the state set of the Q matrix, and if so, step 308 is entered,otherwise, add state set, add a row of data in Q matrix, and initialize each element to 0.
Step 308, convert the state
Figure BDA0003132069660000083
Substituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;
if R =1, the round count and the bonus value are stored, if R =0, whether the current round has any unexecuted action is judged, and if not, the round count and the bonus value are stored.
Intelligent agents (agents) continually transition from one state to another to explore until a target state is reached. Each exploration of an intelligent Agent (Agent) is called a round (epicode), in each round (epicode), the intelligent Agent (Agent) reaches a target state from any initial state, after the Agent reaches the target state, one round (epicode) is ended, and then the next round (epicode) is entered. And adding the new state into the Q table when the new state is found in the exploration process.
Step 309, updating the Q matrix by using the current Q matrix and Reward feedback;
the update formula is as follows:
Figure BDA0003132069660000084
Figure BDA0003132069660000085
representing the next state and corresponding behavior;
Figure BDA0003132069660000086
means for
Figure BDA0003132069660000087
Maximum Q values corresponding to all actions in the state;
Figure BDA0003132069660000088
is in a state
Figure BDA0003132069660000089
Among all the Q values of (1), the operation corresponding to the maximum Q value.
Step 310, returning to step 304, and continuing to judge the next state
Figure BDA00031320696600000810
Whether the target value is manually set or not is achieved, and the Q matrix is continuously updated;
step 311, determining whether the number of rounds EPISODES currently completed is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged and step 303 is entered.
Step 312, counting the probability P of obtaining the reward value R of 1 in the last M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, enabling the Q matrix to be close to convergence enough, stopping the algorithm, and storing the Q matrix after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next training round.
The probability P is obtained by counting the number of awards obtained in M (e.g., M = 1000) rounds with a value of 1, and then dividing by M.
And a second stage: and using the trained Q matrix for learning path recommendation in the knowledge point.
Step 3.1: setting a target state s of mastershiptThe current knowledge point mastery level s of the student is initialized to 0.5.
Step 3.2: and selecting the action a with the maximum Q value in the state s according to the Q matrix trained in the first stage, and recommending the test question corresponding to the action to the student for learning.
Step 3.3: after the student finishes learning, calculating the next state of the student by using the trained DKT algorithm model according to the response record of the current knowledge point of the student
Figure BDA0003132069660000091
And updating the current state of the student;
step 3.4: judging whether the updated current state reaches the target state value stIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.
Finally, a personalized optimal question making path is recommended for the students.
Step four, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, and if so, entering the step five; otherwise, moving out the current knowledge point from the knowledge point set to be selected, and entering the step six;
adding subsequent knowledge points of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set;
step six, judging whether the knowledge point set to be selected is empty, and if so, terminating the circulation; otherwise, returning to the step two, and continuing to learn the next knowledge point.
The inter-knowledge-point learning path recommendation method based on the deep knowledge tracking determines the learning sequence of knowledge points by combining the knowledge map constructed by domain experts and the deep knowledge tracking. Knowledge tracking based on the deep neural network not only analyzes the current knowledge state of the learner, but also considers all previous learning performances of the learner, so that the accuracy of predicting the future learning capacity performance of the learner is higher, and the recommendation accuracy of the learning path between knowledge points can be greatly improved.
Secondly, the method for recommending the learning path in the knowledge point based on the reinforcement learning and the deep knowledge tracking is used for tracking and sensing the current learning state of a student by using the deep knowledge and deciding the next learning content to be learned according to the current learning state by using the reinforcement learning.
As shown in fig. 4, compared with the collaborative filtering and genetic algorithm method, the learning efficiency of the method of the present invention is improved by more than 20% under the condition of obtaining the same learning effect.

Claims (4)

1. A learning path optimization method based on deep knowledge tracking and reinforcement learning is characterized by comprising a learning path optimization process among knowledge points and a learning path optimization process in the knowledge points; the method comprises the following specific steps:
firstly, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected; performing one-hot coding on each learned knowledge point of the student according to historical learning data, inputting the one-hot coded knowledge point into a trained DKT model, and outputting a grasping level prediction value of the student on each knowledge point to be selected;
then, sorting the prediction results of the knowledge points to be selected from high to low, and selecting the knowledge point K with the highest prediction result to recommend the knowledge point K to the student for learning; the learning process is realized by using a learning path optimization algorithm in the knowledge points;
the learning path optimization algorithm in the knowledge points is specifically divided into two stages:
the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;
step 301, initializing a learning rate alpha, a discount factor gamma and a counter EPISODES =0 of a training round;
step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;
the Q matrix is 1 row and 2n columns, and the action corresponding to each state is initially finished; 2n is the knowledge point K and n test questions under the prior knowledge point correction, and the corresponding action set number;
reward feedback Reward is: the state after executing a certain action from the current state reaches the knowledge point target state value stThe Reward value Reward is 1, otherwise, reward is 0; the concrete formula is as follows:
Figure FDA0003759127540000011
(s, a) is the current stateAnd an action to be taken in the form of,
Figure FDA0003759127540000012
a state after taking action for the current state;
step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, initializing state s =0.5, and entering step 304;
step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, go to step 305;
step 305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, and entering step 306, otherwise, ending the current round, and entering step 311;
step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the students, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is obtained through prediction and is the next state
Figure FDA0003759127540000013
Step 307, determining the status
Figure FDA0003759127540000015
If it is in the state set of the Q matrix, if yes, go to step 308; otherwise, adding a state set, adding a row of data in the Q matrix, and initializing each element to be 0;
step 308, convert the state
Figure FDA0003759127540000014
Substituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;
step 309, updating the Q matrix by using the current Q matrix and Reward feedback;
the update formula is as follows:
the update formula is as follows:
Figure FDA0003759127540000021
Figure FDA0003759127540000028
representing the next state and corresponding behavior;
Figure FDA0003759127540000023
means that
Figure FDA0003759127540000024
Maximum Q values corresponding to all actions in the state;
step 310, returning to step 304, and continuing to judge the next state
Figure FDA0003759127540000025
Whether the target value is artificially set or not is judged, and the Q matrix is continuously updated;
step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged, and the step 303 is entered;
step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next round of training;
and a second stage: using the trained Q matrix for recommending the learning path in the knowledge point; the method specifically comprises the following steps:
step 3.1: setting a target state s of mastershiptThe current knowledge point mastery level s of the student is initialized to 0.5;
step 3.2: selecting an action a with the maximum Q value in a state s according to the Q matrix trained in the first stage, and recommending test questions corresponding to the action to students for learning;
step 3.3: after the students finish learning, calculating the next state of the students by using the trained DKT algorithm model according to the answering records of the current knowledge points of the students
Figure FDA0003759127540000027
And updating the current state of the student;
step 3.4: judging whether the updated current state reaches the target state value stIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K;
finally, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, adding the subsequent knowledge point of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set; otherwise, directly moving the current knowledge point K out of the knowledge point set to be selected, judging whether the knowledge point set to be selected is empty, and if so, terminating the cycle; otherwise, the learning of the next knowledge point is continued.
2. The learning path optimization method based on deep knowledge tracking and reinforcement learning of claim 1, wherein the Q matrix is represented as follows:
Figure FDA0003759127540000026
wherein, Q(s)j,ai0) Shows the mastery level of the knowledge points of the students at sjIn the state, answer wrong test question aiThe expectation of being able to gain revenue;
Q(sj,ai1) Shows the mastery level of the knowledge points of the students at sjIn the state, answer a to test questioniThe expectation of being able to gain revenue;
the line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal within the range of value [0,1], and the decimal is represented by s;
the column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a isi0An action indicating that the student wrongly answered the ith test question, ai1Showing the student answering the i-th test question.
3. The learning path optimization method based on deep knowledge tracking and reinforcement learning of claim 1, wherein in step 305, the selection method of the action a according to the current state of the Q matrix is as follows:
firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.
4. The method as claimed in claim 1, wherein the probability P in step 312 is obtained by counting the number of awarded values 1 obtained in M rounds and then dividing by M.
CN202110706088.9A 2021-06-24 2021-06-24 Learning path optimization method based on deep knowledge tracking and reinforcement learning Active CN113268611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110706088.9A CN113268611B (en) 2021-06-24 2021-06-24 Learning path optimization method based on deep knowledge tracking and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110706088.9A CN113268611B (en) 2021-06-24 2021-06-24 Learning path optimization method based on deep knowledge tracking and reinforcement learning

Publications (2)

Publication Number Publication Date
CN113268611A CN113268611A (en) 2021-08-17
CN113268611B true CN113268611B (en) 2022-11-01

Family

ID=77235833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110706088.9A Active CN113268611B (en) 2021-06-24 2021-06-24 Learning path optimization method based on deep knowledge tracking and reinforcement learning

Country Status (1)

Country Link
CN (1) CN113268611B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114155124B (en) * 2022-02-07 2022-07-12 山东建筑大学 Test question resource recommendation method and system
CN114461786B (en) * 2022-04-13 2022-10-21 北京东大正保科技有限公司 Learning path generation method and system
CN115640410B (en) * 2022-12-06 2023-03-14 南京航空航天大学 Knowledge map multi-hop question-answering method based on reinforcement learning path reasoning
CN116796041B (en) * 2023-05-15 2024-04-02 华南师范大学 Learning path recommendation method, system, device and medium based on knowledge tracking
CN117672027B (en) * 2024-02-01 2024-04-30 青岛培诺教育科技股份有限公司 VR teaching method, device, equipment and medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8366449B2 (en) * 2008-08-13 2013-02-05 Chi Wang Method and system for knowledge diagnosis and tutoring
CN109948054A (en) * 2019-03-11 2019-06-28 北京航空航天大学 A kind of adaptive learning path planning system based on intensified learning
CN110264091B (en) * 2019-06-24 2023-10-20 中国科学技术大学 Student Cognitive Diagnosis Method
CN110378818B (en) * 2019-07-22 2022-03-11 广西大学 Personalized exercise recommendation method, system and medium based on difficulty
CN110516116B (en) * 2019-08-27 2022-12-02 华中师范大学 Multi-step hierarchical learner cognitive level mining method and system
CN110991645B (en) * 2019-11-18 2024-03-29 广东宜学通教育科技有限公司 Self-adaptive learning method, system and storage medium based on knowledge model
CN111461442B (en) * 2020-04-07 2023-08-29 中国科学技术大学 Knowledge tracking method and system based on federal learning

Also Published As

Publication number Publication date
CN113268611A (en) 2021-08-17

Similar Documents

Publication Publication Date Title
CN113268611B (en) Learning path optimization method based on deep knowledge tracking and reinforcement learning
CN111460249B (en) Personalized learning resource recommendation method based on learner preference modeling
CN110569443B (en) Self-adaptive learning path planning system based on reinforcement learning
CN111582694B (en) Learning evaluation method and device
US20200372362A1 (en) Method of continual-learning of data sets and apparatus thereof
CN110363282B (en) Network node label active learning method and system based on graph convolution network
Liu et al. Automated feature selection: A reinforcement learning perspective
CN112800323B (en) Intelligent teaching system based on deep learning
CN112116092B (en) Interpretable knowledge level tracking method, system and storage medium
CN109840595B (en) Knowledge tracking method based on group learning behavior characteristics
CN112529155B (en) Dynamic knowledge mastering modeling method, modeling system, storage medium and processing terminal
CN111860989A (en) Ant colony algorithm optimization-based LSTM neural network short-time traffic flow prediction method
CN111191722B (en) Method and device for training prediction model through computer
CN114021722A (en) Attention knowledge tracking method integrating cognitive portrayal
CN110110899A (en) Prediction technique, adaptive learning method and the electronic equipment of acquisition of knowledge degree
CN115249072A (en) Reinforced learning path planning method based on generation of confrontation user model
CN115618101A (en) Streaming media content recommendation method and device based on negative feedback and electronic equipment
CN110263136B (en) Method and device for pushing object to user based on reinforcement learning model
CN113449182B (en) Knowledge information personalized recommendation method and system
CN115238169A (en) Mu course interpretable recommendation method, terminal device and storage medium
CN117035074B (en) Multi-modal knowledge generation method and device based on feedback reinforcement
CN116595245A (en) Hierarchical reinforcement learning-based lesson admiring course recommendation system method
CN115688863A (en) Depth knowledge tracking method based on residual connection and student near-condition feature fusion
CN115757464A (en) Intelligent materialized view query method based on deep reinforcement learning
CN115272015A (en) Course recommendation method and system based on abnormal picture and cooperative attenuation attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230717

Address after: 100876 Beijing city Haidian District Xitucheng Road No. 10

Patentee after: Beijing University of Posts and Telecommunications

Address before: 100876 Beijing city Haidian District Xitucheng Road No. 10

Patentee before: Beijing University of Posts and Telecommunications

Patentee before: Beijing Sikai Technology Co.,Ltd.