CN113268611A - Learning path optimization method based on deep knowledge tracking and reinforcement learning - Google Patents
Learning path optimization method based on deep knowledge tracking and reinforcement learning Download PDFInfo
- Publication number
- CN113268611A CN113268611A CN202110706088.9A CN202110706088A CN113268611A CN 113268611 A CN113268611 A CN 113268611A CN 202110706088 A CN202110706088 A CN 202110706088A CN 113268611 A CN113268611 A CN 113268611A
- Authority
- CN
- China
- Prior art keywords
- learning
- knowledge point
- state
- knowledge
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000002787 reinforcement Effects 0.000 title claims abstract description 28
- 238000005457 optimization Methods 0.000 title claims abstract description 26
- 230000008569 process Effects 0.000 claims abstract description 19
- 230000009471 action Effects 0.000 claims description 87
- 239000011159 matrix material Substances 0.000 claims description 70
- 238000012360 testing method Methods 0.000 claims description 39
- 230000000875 corresponding effect Effects 0.000 claims description 26
- 238000012549 training Methods 0.000 claims description 13
- 230000006399 behavior Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000010187 selection method Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 7
- 239000003795 chemical substances by application Substances 0.000 description 12
- 238000001914 filtration Methods 0.000 description 5
- 230000002068 genetic effect Effects 0.000 description 4
- 230000008447 perception Effects 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007786 learning performance Effects 0.000 description 3
- 241000555745 Sciuridae Species 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000002986 genetic algorithm method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008450 motivation Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- Human Resources & Organizations (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Marketing (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Primary Health Care (AREA)
- Animal Behavior & Ethology (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a learning path optimization method based on deep knowledge tracking and reinforcement learning, belonging to the field of self-adaptive learning; the method specifically comprises the following steps: aiming at a certain student, selecting all unlearned knowledge points and knowledge points which are not repaired in advance as to-be-selected knowledge points; and performing one-hot coding by using the knowledge points of historical learning, inputting the one-hot coding into the DKT model, and outputting the grasp level prediction value of each to-be-selected knowledge point. Then, selecting a knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning; the learning process is realized by using a learning path optimization algorithm in the knowledge points; after the current knowledge point K passes the learning, judging whether a subsequent knowledge point exists or not, if so, adding the subsequent knowledge point into a to-be-selected knowledge point set, and moving out the current knowledge point K; otherwise, directly moving out the current knowledge point K, selecting the next knowledge point to predict and learn again until the set of the knowledge points to be selected is empty. The invention can greatly improve the recommendation precision and improve the efficiency under the condition of obtaining the same learning effect.
Description
Technical Field
The invention belongs to the field of adaptive learning, and particularly relates to a learning path optimization method based on deep knowledge tracking and reinforcement learning.
Background
In the adaptive learning process, one of the key problems to be solved is to recommend an optimal learning path for the student according to the knowledge point mastering level of the student, so as to obtain the optimal learning efficiency and effect.
The learned path recommendations include learned path recommendations between knowledge points and learned path recommendations within knowledge points.
For the recommendation of learning paths among knowledge points, the probability graph model technology is most commonly used at present, and the specific implementation process is to adopt a Markov network of the probability graph model to track the mastery degree of a single knowledge point of a single learner; and then, predicting the mastery degree of the unknown knowledge points by adopting a Bayesian network of a probability graph model according to the mastery degree of the learned knowledge points of the learner, thereby providing personalized learning path recommendation and predicting weak knowledge points of the learner. Most adaptive learning systems such as Knewton, squirrel AI or VIPKID adopt the technology to realize individual learning path recommendation. However, the above method needs to label domain knowledge (e.g. difficulty, differentiation, belonging knowledge points, etc. of test questions), and cannot comprehensively analyze the current overall knowledge state and past learning performance of the learner, and the recommended performance is general.
For learning path recommendation in a knowledge point, currently, a collaborative filtering algorithm and a genetic algorithm are most commonly used, the collaborative filtering algorithm is the most commonly used recommendation algorithm in a personalized recommendation system, and the basic idea is to find the nearest resource or user through a similarity algorithm according to a scoring matrix of a learner on the learning resource, predict an unscored target learning resource according to the nearest resource or user, and recommend a more accurate learning resource to the learner according to a prediction result.
For example, Knewton adopts a collaborative filtering algorithm to quickly locate the information required by the learner from the learning goal, cognitive structure and learning input degree of the learner, and presents the optimal learning content for the future learning of the learner. The genetic algorithm is one of evolutionary algorithms, and a preference attribute value of a user is extracted through a series of operations according to an initial population, so that recommendation of learning resources is performed. The squirrel AI recommends proper learning resources for the learner in the global scope by using a genetic algorithm on the basis of tracking and analyzing learning data. However, both of the above algorithms aim to satisfy user preferences as a recommendation, rather than aim to obtain optimal learning efficiency and effect, learning is a painful process, and students can be continuously motivated to generate learning motivation only if a high learning return is obtained after effort is made.
Knowledge tracking is the modeling of the knowledge of a student based on time in order to accurately predict how well the student mastery will be at the knowledge point at the next moment. The Deep Knowledge tracking algorithm (DKT) is a Knowledge tracking model established based on the LSTM (Long Short-Term Memory), and a Knowledge point mastery prediction model is trained by using the historical learning data of the user, and the Knowledge point mastery level of the student is predicted and estimated according to the trained model.
Reinforcement learning mainly comprises four elements, intelligent agents, environmental states, actions and rewards. The intelligent agent selects an action for the environment, the state of the environment changes after receiving the action, and simultaneously a reinforcement signal (reward or punishment) is generated and fed back to the intelligent agent, and the goal of reinforcement learning is to obtain the most accumulated reward. And updating the knowledge point mastering state of the student by using a reinforcement learning algorithm according to the action of 'doing the right test questions' or 'doing the wrong test questions' of the student, and establishing a reward mechanism according to the target mastering state, so that a recommendation strategy of the test questions and the learning content of the knowledge points is established, and the student can efficiently reach the target mastering degree of the knowledge points.
In the prior art, deep knowledge tracking has strong perception capability to perceive the current learning state of students, but lacks certain decision-making capability; while reinforcement learning has decision-making capability but lacks the perception of state. The perception capability tracked by the deep knowledge is combined with the decision-making capability of the reinforcement learning, the deep knowledge tracks and perceives the learning state of students, the reinforcement learning makes a decision by taking the best learning efficiency and effect target as a guide according to the perceived learning state, and the learning path recommendation effect with the optimal performance can be obtained.
Disclosure of Invention
In order to find the learning sequence of the knowledge points and the learning sequence of the learning contents in the knowledge points which are most suitable for students, the invention provides a learning path optimization method based on deep knowledge tracking and reinforcement learning, and the learning path optimization method is recommended to the students which is most suitable for efficient learning paths, so that the students can efficiently master the knowledge points.
The learning path optimization method comprises a learning path optimization process between knowledge and a learning path optimization process in a knowledge point, and specifically comprises the following steps:
the learning path optimization process aiming at the knowledge specifically comprises the following steps:
step one, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected;
modifying knowledge points first refers to learning knowledge points that need to be learned first before learning the current knowledge points.
And secondly, performing one-hot coding on the knowledge points which have been learned by the student according to historical learning data of the knowledge points, inputting the one-hot coding into a trained DKT model, and outputting a grasp level prediction value of the student on each knowledge point to be selected.
Thirdly, sorting the prediction results of the knowledge points to be selected in the order from high to low, and selecting the knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning;
the learning process is realized by using a learning path optimization algorithm in knowledge points, and is specifically divided into two stages:
the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;
step 301, initializing each parameter: learning rate α is 0.1, discount factor γ is 0.9, training round counter episcodes is 0;
step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;
the element in the Q matrix is Q (s, a), which represents the expectation that the action a can obtain the benefit under the s state at a certain moment;
the Q matrix is represented as follows:
wherein, Q(s)j,ai0) Shows the mastery level of the knowledge points of the students at sjIn the state, answer wrong test question aiA desire for revenue can be obtained.
Q(sj,ai1) Shows the mastery level of the knowledge points of the students at sjIn the state, answer the right test question aiA desire for revenue can be obtained.
The line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal number between the value range [0, 1], and the decimal number is represented by s.
The column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a isi0An action indicating that the student wrongly answered the ith test question, ai1Showing the student answering the i-th test question.
The knowledge point K and the prior knowledge point are provided with n test questions, the size of the corresponding action set A is 2n, the action set A comprises actions of answering and answering wrong test questions, the initialized matrix Q is a matrix with 1 row and 2n columns of element values all being 0, and the action corresponding to each state is initially completely corresponded.
Reward feedback Reward is defined as: from the current state sjPerform a certain action ai0Or ai1The later state reaches the target mastery degree of the knowledge point, namely the target state is reachedValue stThe Reward value Reward is 1, otherwise Reward is 0.
The concrete formula is as follows:
step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, the initialization state s is 0.5, and step 304 is entered;
step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, step 305 is entered.
305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, entering step 306, and if not, ending the current round, entering step 311;
the selection method for selecting action a according to the current state of the Q matrix is as follows:
firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.
Step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the student, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is predicted to be the next state
Step 307, judging the stateIf it is in the state set of the Q matrix, if so, go to step 308, otherwise, add the state set, add a row of data in the Q matrix, and initialize each element to 0.
Step 308, convert the stateSubstituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;
step 309, updating the Q matrix by using the current Q matrix and Reward feedback;
the update formula is as follows:
representing the next state and corresponding behavior;means thatMaximum Q values corresponding to all actions in the state;
step 310, returning to step 304, and continuing to judge the next stateWhether the target value is manually set or not is achieved, and the Q matrix is continuously updated;
step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged and step 303 is entered.
Step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next training round.
The probability P is obtained by counting the number of awards 1 obtained in M rounds and dividing by M.
And a second stage: and using the trained Q matrix for learning path recommendation in the knowledge point.
Step 3.1: setting a target state s of mastershiptThe current knowledge point mastery level s of the student is initialized to 0.5.
Step 3.2: and selecting the action a with the maximum Q value in the state s according to the Q matrix trained in the first stage, and recommending the test question corresponding to the action to the student for learning.
Step 3.3: after the student finishes learning, calculating the next state of the student by using the trained DKT algorithm model according to the response record of the current knowledge point of the studentAnd updating the current state of the student;
step 3.4: judging whether the updated current state reaches the target state value stIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.
Step four, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, entering the step five; otherwise, moving out the current knowledge point from the knowledge point set to be selected, and entering the step six;
adding subsequent knowledge points of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set;
step six, judging whether the knowledge point set to be selected is empty, and if so, terminating the circulation; otherwise, returning to the step two, and continuing to learn the next knowledge point.
The invention has the advantages that:
1) the deep knowledge tracking algorithm not only analyzes the current knowledge state of the learner, but also considers all previous learning performances of the learner, so that the accuracy of predicting the future learning capacity performance of the learner is higher. Compared with a learning path recommendation method based on a probability map model, the personalized learning path recommendation method based on the knowledge map and the deep knowledge tracking can greatly improve the recommendation precision.
2) The method combines the perception capability of deep knowledge tracking and the decision-making capability of reinforcement learning, and realizes recommendation aiming at obtaining the best learning efficiency and effect.
Drawings
FIG. 1 is a schematic diagram of a knowledge point mastery level prediction model constructed by a depth knowledge tracking algorithm according to the present invention;
FIG. 2 is a flowchart of a learning path optimization method based on deep knowledge tracking and reinforcement learning according to the present invention;
FIG. 3 is a schematic view of a knowledge graph of the labeled knowledge points first modifying successor relationships and knowledge point content employed in the present invention;
FIG. 4 is a graph comparing learning efficiency of the present invention compared to collaborative filtering and genetic algorithms.
Detailed Description
The following describes embodiments of the present invention in detail and clearly with reference to the examples and the accompanying drawings.
The Deep Knowledge tracking algorithm (DKT) is a Knowledge tracking model established based on the LSTM (Long Short-Term Memory network),by using the model and the historical learning data of the user, a knowledge point mastery level prediction model is trained to predict the mastery state of the unknown knowledge points by the students, and the predicted mastery degree of the knowledge points is [0, 1]]The value range of (a). As shown in FIG. 1, the knowledge point grasp level prediction model is applied to an input vector sequence x1…xTBy calculating a series of "hidden" states h1…hTTo the output vector sequence y1…yTIt can be seen as a continuous coding of the relevant information for past observations that are useful for future predictions.
The specific formula is as follows:
ht=tanh(Whxxt+Whhht-1+bh) (1)
yt=σ(Wyhht+by) (2)
Whxis the state-input weight; whhIs the state-state weight; bhIs a bias term for the hidden unit; σ is a sigmoid function; wyhIs the information readout weight; byIs an offset term of the information reading unit;
the method comprises the steps of pre-processing cleaned data according to a one-hot coding format by collecting sample data of a user and removing samples with sequences less than 50% of the number of knowledge point databases in each sample data;
assuming that the knowledge points have n questions, the one-hot coding length is 2n, the front n bits represent wrong test questions, and the rear n bits represent right test questions; for example, if the user answers the ith test question, if the answer is wrong, the position of the one-hot code index i-1 is 1, and the rest positions are 0, and the one-hot code is shown in table 1:
TABLE 1
|
0 | 1 | ... | i-1 | ... | n-1 | n | ... | 2n-1 |
|
0 | 0 | 0 | 1 | ... | 0 | 0 | ... | 0 |
If the user answers correctly, the position of the one-hot code index n + (i-1) is 1, and the rest positions are 0, and the one-hot code is shown in table 1:
TABLE 2
|
0 | 1 | ... | n-1 | n | ... | n+i-1 | ... | 2n-1 |
|
0 | 0 | 0 | 0 | 0 | ... | 1 | ... | 0 |
Using the preprocessed data as input data of a DKT model, training and storing a knowledge point prediction model; and inputting the history record of the user questions into the prediction model to predict the unknown knowledge point mastering level of the user in real time.
The invention provides a learning path optimization method based on inter-knowledge point and intra-knowledge point learning paths, and combined with theories and technologies such as knowledge maps, deep learning and reinforcement learning, and the like, and the learning path optimization method based on deep knowledge tracking and reinforcement learning, comprises an inter-knowledge learning path optimization process and an intra-knowledge point learning path optimization process, and as shown in fig. 2, the specific processes are as follows:
step one, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected;
modifying knowledge points first refers to learning the knowledge points which need to be learned before learning the current knowledge points; as shown in fig. 3, by establishing the knowledge graph, the labeled knowledge points modify the subsequent relationship first and the content structure of the knowledge points is simple; in the figure, k1, k2, k3, k4 and k5 represent knowledge points, k1 represents correction knowledge points of k2 and k5, k2 represents correction knowledge points of k3, and t1 and t2 represent test questions and learning contents belonging to the knowledge points k1 (the test questions may be embedded in the learning contents). k4 is a discrete knowledge point with no prior knowledge point and no subsequent knowledge point
And secondly, performing one-hot coding on the knowledge points which have been learned by the student according to historical learning data of the knowledge points, inputting the one-hot coding into a trained DKT model, and outputting a grasp level prediction value of the student on each knowledge point to be selected.
Thirdly, sorting the prediction results of the knowledge points to be selected in the order from high to low, and selecting the knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning;
the learning process is realized by using a learning path optimization algorithm in the knowledge points, and assuming that a student starts learning, firstly, a state value s of a mastery degree target is settThe student's initial mastery level (i.e., initial state s) is initialized to 0.5. Then, selecting a knowledge point K to start learning based on a learning path recommendation algorithm among knowledge points; then, recommending test questions according to the Q matrix trained by the learning path recommendation algorithm in the knowledge points until the state stAnd when the target state value is reached, the learning of the knowledge point K is finished, and the learning of the next knowledge point is started based on the learning path recommendation algorithm among the knowledge points until all the knowledge points are completely learned.
The method is specifically divided into two stages:
the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;
step 301, initializing each parameter: learning rate α is 0.1, discount factor γ is 0.9, training round counter epsilon continuously reaching the termination state is 0;
step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;
the element in the Q matrix is Q (S, a), and the Q matrix represents the expectation that the profit can be obtained by taking the action a (a belongs to A) under the state of S at a certain moment (S belongs to S); the environment can obtain corresponding reward (reward) according to the Action feedback of an intelligent agent (agent), so the main idea of the algorithm is to construct a Q-table by a State (State) and an Action (Action) to store a Q value, and then to select the Action capable of obtaining the maximum profit according to the Q value;
the Q matrix is represented as follows:
wherein, Q(s)j,ai0) Shows the mastery level of the knowledge points of the students at sjIn the state, answer wrong test question aiA desire for revenue can be obtained.
Q(sj,ai1) Shows the mastery level of the knowledge points of the students at sjIn the state, answer the right test question aiA desire for revenue can be obtained.
The line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal number between the value range [0, 1], and the decimal number is represented by s.
The column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a isi0An action indicating that the student wrongly answered the ith test question, ai1Showing the student answering the i-th test question.
The knowledge point K and the prior knowledge point are provided with n test questions, the size of the corresponding action set A is 2n, the action set A comprises actions of answering and answering wrong test questions, the initialized matrix Q is a matrix with 1 row and 2n columns of element values all being 0, and the action corresponding to each state is initially completely corresponded.
Reward, i.e., Reward feedback given by the environment, is defined as: setting the target mastery degree of the knowledge point, namely the target state value as stFrom the current state sjPerform a certain action ai0Or ai1The later state reaches the target mastery degree of the knowledge point, namely the target state value s is reachedtThe Reward value Reward is 1, otherwise Reward is 0.
The concrete formula is as follows:
(s, a) is the current state and action,is the current state sjA state after taking action; stIs the target state value.
Step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, the initialization state s is 0.5, and step 304 is entered;
assuming that the initial state of the student is 0.5 middle ability level, the state set S is initialized with 0.5;
step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, step 305 is entered.
305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, entering step 306, and if not, ending the current round, entering step 311;
the selection method for selecting action a according to the current state of the Q matrix is as follows:
firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.
Step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the student, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is predicted to be the next state
Step 307, judging the stateIf it is in the state set of the Q matrix, if so, go to step 308, otherwise, add the state set, add a row of data in the Q matrix, and initialize each element to 0.
Step 308, convert the stateSubstituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;
if R is 1, storing the round count and the reward value 1, if R is 0, judging whether the round has no action executed, if not, storing the round count and the reward value 0.
Intelligent agents (agents) continually transition from one state to another to explore until a target state is reached. Each exploration of an intelligent Agent (Agent) is called a round (epicode), in each round (epicode), the intelligent Agent (Agent) reaches a target state from any initial state, after the Agent reaches the target state, one round (epicode) is ended, and then the next round (epicode) is entered. And adding the new state into the Q table when the new state is found in the exploration process.
Step 309, updating the Q matrix by using the current Q matrix and Reward feedback;
the update formula is as follows:
representing the next state and corresponding behavior;means thatMaximum Q values corresponding to all actions in the state;is in a stateAmong all the Q values of (1), the operation corresponding to the maximum Q value.
Step 310, returning to step 304, and continuing to judge the next stateWhether the target value is manually set or not is achieved, and the Q matrix is continuously updated;
step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged and step 303 is entered.
Step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next training round.
The probability P is obtained by counting the number of rounds (e.g., M1000) in which the prize value is 1, and dividing the number by M.
And a second stage: and using the trained Q matrix for learning path recommendation in the knowledge point.
Step 3.1: setting a target state s of mastershiptThe current knowledge point mastery level s of the student is initialized to 0.5.
Step 3.2: and selecting the action a with the maximum Q value in the state s according to the Q matrix trained in the first stage, and recommending the test question corresponding to the action to the student for learning.
Step 3.3: after the student finishes learning, calculating the next state of the student by using the trained DKT algorithm model according to the response record of the current knowledge point of the studentAnd updating the current state of the student;
step 3.4: judging whether the updated current state reaches the target state value stIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.
Finally, a personalized optimal question making path is recommended for the students.
Step four, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, entering the step five; otherwise, moving out the current knowledge point from the knowledge point set to be selected, and entering the step six;
adding subsequent knowledge points of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set;
step six, judging whether the knowledge point set to be selected is empty, and if so, terminating the circulation; otherwise, returning to the step two, and continuing to learn the next knowledge point.
The inter-knowledge-point learning path recommendation method based on the deep knowledge tracking determines the learning sequence of knowledge points by combining the knowledge map constructed by domain experts and the deep knowledge tracking. Knowledge tracking based on the deep neural network not only analyzes the current knowledge state of the learner, but also considers all previous learning performances of the learner, so that the accuracy of predicting the future learning capacity performance of the learner is higher, and the recommendation accuracy of the learning path between knowledge points can be greatly improved.
Secondly, the method for recommending the learning path in the knowledge point based on the reinforcement learning and the deep knowledge tracking is used for tracking and sensing the current learning state of a student by using the deep knowledge and deciding the next learning content to be learned according to the current learning state by using the reinforcement learning.
As shown in fig. 4, compared with the collaborative filtering and genetic algorithm method, the learning efficiency of the method of the present invention is improved by more than 20% under the condition of obtaining the same learning effect.
Claims (5)
1. A learning path optimization method based on deep knowledge tracking and reinforcement learning is characterized by comprising a learning path optimization process among knowledge points and a learning path optimization process in the knowledge points; the method comprises the following specific steps:
firstly, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected; performing one-hot coding on each learned knowledge point of the student according to historical learning data, inputting the one-hot coded knowledge point into a trained DKT model, and outputting a grasping level prediction value of the student on each knowledge point to be selected;
then, sorting the prediction results of the knowledge points to be selected from high to low, and selecting the knowledge point K with the highest prediction result to recommend the knowledge point K to the student for learning; the learning process is realized by using a learning path optimization algorithm in the knowledge points;
finally, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, adding the subsequent knowledge point of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set; otherwise, directly moving the current knowledge point K out of the knowledge point set to be selected, judging whether the knowledge point set to be selected is empty, and if so, terminating the cycle; otherwise, the learning of the next knowledge point is continued.
2. The learning path optimization method based on deep knowledge tracking and reinforcement learning as claimed in claim 1, wherein the learning path optimization algorithm in the knowledge point is specifically divided into two stages:
the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;
step 301, initializing a learning rate α, setting a discount factor γ and a counter of training rounds EPISODES to 0;
step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;
the Q matrix is 1 row and 2n columns, and the action corresponding to each state is initially finished; 2n is the knowledge point K and n test questions under the prior knowledge point correction, and the corresponding action set number;
reward feedback Reward is: the state after executing a certain action from the current state reaches the knowledge point target state value stThe Reward value Reward is 1, otherwise, Reward is 0; the concrete formula is as follows:
step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, the initialization state s is 0.5, and step 304 is entered;
step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, go to step 305;
305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, entering step 306, and if not, ending the current round, entering step 311;
step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the student, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is predicted to be the next state
Step 307, judging the stateIf it is in the state set of the Q matrix, if so, go to step 308; otherwise, adding a state set, adding a row of data in the Q matrix, and initializing each element to be 0;
step 308, convert the stateSubstituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;
step 309, updating the Q matrix by using the current Q matrix and Reward feedback;
the update formula is as follows:
the update formula is as follows:
representing the next state and corresponding behavior;means thatMaximum Q values corresponding to all actions in the state;
step 310, returning to step 304, and continuing to judge the next stateWhether the target value is manually set or not is achieved, and the Q matrix is continuously updated;
step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged, and the step 303 is entered;
step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, step 303 is entered, and the next round of training is continued;
and a second stage: using the trained Q matrix for recommending the learning path in the knowledge point; the method specifically comprises the following steps:
step 3.1: setting a target state s of mastershiptThe current knowledge point mastery level s of the student is initialized to 0.5;
step 3.2: selecting an action a with the maximum Q value in a state s according to the Q matrix trained in the first stage, and recommending test questions corresponding to the action to students for learning;
step 3.3: after the student finishes learning, calculating the next state of the student by using the trained DKT algorithm model according to the response record of the current knowledge point of the studentAnd updating the current state of the student;
step 3.4: judging whether the updated current state reaches the target state value stIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.
3. The learning path optimization method based on deep knowledge tracking and reinforcement learning of claim 2, wherein the Q matrix is represented as follows:
wherein, Q(s)j,ai0) Shows the mastery level of the knowledge points of the students at sjIn the state, answer wrong test question aiThe expectation of being able to gain revenue;
Q(sj,ai1) Shows the mastery level of the knowledge points of the students at sjIn the state, answer the right test question aiThe expectation of being able to gain revenue;
the line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal number between the value range [0, 1], and the decimal number is represented by s;
the column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a isi0An action indicating that the student wrongly answered the ith test question, ai1Showing the student answering the i-th test question.
4. The learning path optimization method based on deep knowledge tracking and reinforcement learning of claim 2, wherein in step 305, the selection method of the action a according to the current state of the Q matrix is as follows:
firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.
5. The method as claimed in claim 2, wherein the probability P in step 312 is obtained by counting the number of awarded values 1 obtained in M rounds and then dividing by M.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110706088.9A CN113268611B (en) | 2021-06-24 | 2021-06-24 | Learning path optimization method based on deep knowledge tracking and reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110706088.9A CN113268611B (en) | 2021-06-24 | 2021-06-24 | Learning path optimization method based on deep knowledge tracking and reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113268611A true CN113268611A (en) | 2021-08-17 |
CN113268611B CN113268611B (en) | 2022-11-01 |
Family
ID=77235833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110706088.9A Active CN113268611B (en) | 2021-06-24 | 2021-06-24 | Learning path optimization method based on deep knowledge tracking and reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113268611B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114155124A (en) * | 2022-02-07 | 2022-03-08 | 山东建筑大学 | Test question resource recommendation method and system |
CN114385910A (en) * | 2021-12-10 | 2022-04-22 | 山东师范大学 | Knowledge tracking based online learning content recommendation method and system |
CN114461786A (en) * | 2022-04-13 | 2022-05-10 | 北京东大正保科技有限公司 | Learning path generation method and system |
CN115640410A (en) * | 2022-12-06 | 2023-01-24 | 南京航空航天大学 | Knowledge graph multi-hop question-answering method based on reinforcement learning path reasoning |
CN116796041A (en) * | 2023-05-15 | 2023-09-22 | 华南师范大学 | Learning path recommendation method, system, device and medium based on knowledge tracking |
CN117672027A (en) * | 2024-02-01 | 2024-03-08 | 青岛培诺教育科技股份有限公司 | VR teaching method, device, equipment and medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100041007A1 (en) * | 2008-08-13 | 2010-02-18 | Chi Wang | Method and System for Knowledge Diagnosis and Tutoring |
CN110264091A (en) * | 2019-06-24 | 2019-09-20 | 中国科学技术大学 | Student's cognitive diagnosis method |
CN110378818A (en) * | 2019-07-22 | 2019-10-25 | 广西大学 | Personalized exercise recommended method, system and medium based on difficulty |
CN110516116A (en) * | 2019-08-27 | 2019-11-29 | 华中师范大学 | A kind of the learner's human-subject test method for digging and system of multistep layering |
CN110569443A (en) * | 2019-03-11 | 2019-12-13 | 北京航空航天大学 | Self-adaptive learning path planning system based on reinforcement learning |
CN110991645A (en) * | 2019-11-18 | 2020-04-10 | 广东宜学通教育科技有限公司 | Self-adaptive learning method, system and storage medium based on knowledge model |
CN111461442A (en) * | 2020-04-07 | 2020-07-28 | 中国科学技术大学 | Knowledge tracking method and system based on federal learning |
-
2021
- 2021-06-24 CN CN202110706088.9A patent/CN113268611B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100041007A1 (en) * | 2008-08-13 | 2010-02-18 | Chi Wang | Method and System for Knowledge Diagnosis and Tutoring |
CN110569443A (en) * | 2019-03-11 | 2019-12-13 | 北京航空航天大学 | Self-adaptive learning path planning system based on reinforcement learning |
CN110264091A (en) * | 2019-06-24 | 2019-09-20 | 中国科学技术大学 | Student's cognitive diagnosis method |
CN110378818A (en) * | 2019-07-22 | 2019-10-25 | 广西大学 | Personalized exercise recommended method, system and medium based on difficulty |
CN110516116A (en) * | 2019-08-27 | 2019-11-29 | 华中师范大学 | A kind of the learner's human-subject test method for digging and system of multistep layering |
CN110991645A (en) * | 2019-11-18 | 2020-04-10 | 广东宜学通教育科技有限公司 | Self-adaptive learning method, system and storage medium based on knowledge model |
CN111461442A (en) * | 2020-04-07 | 2020-07-28 | 中国科学技术大学 | Knowledge tracking method and system based on federal learning |
Non-Patent Citations (1)
Title |
---|
单瑞婷 等: "基于认知诊断的协同过滤试题推荐", 《计算机系统应用》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114385910A (en) * | 2021-12-10 | 2022-04-22 | 山东师范大学 | Knowledge tracking based online learning content recommendation method and system |
CN114385910B (en) * | 2021-12-10 | 2024-09-06 | 山东师范大学 | Online learning content recommendation method and system based on knowledge tracking |
CN114155124A (en) * | 2022-02-07 | 2022-03-08 | 山东建筑大学 | Test question resource recommendation method and system |
CN114155124B (en) * | 2022-02-07 | 2022-07-12 | 山东建筑大学 | Test question resource recommendation method and system |
CN114461786A (en) * | 2022-04-13 | 2022-05-10 | 北京东大正保科技有限公司 | Learning path generation method and system |
CN115640410A (en) * | 2022-12-06 | 2023-01-24 | 南京航空航天大学 | Knowledge graph multi-hop question-answering method based on reinforcement learning path reasoning |
CN116796041A (en) * | 2023-05-15 | 2023-09-22 | 华南师范大学 | Learning path recommendation method, system, device and medium based on knowledge tracking |
CN116796041B (en) * | 2023-05-15 | 2024-04-02 | 华南师范大学 | Learning path recommendation method, system, device and medium based on knowledge tracking |
CN117672027A (en) * | 2024-02-01 | 2024-03-08 | 青岛培诺教育科技股份有限公司 | VR teaching method, device, equipment and medium |
CN117672027B (en) * | 2024-02-01 | 2024-04-30 | 青岛培诺教育科技股份有限公司 | VR teaching method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN113268611B (en) | 2022-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113268611B (en) | Learning path optimization method based on deep knowledge tracking and reinforcement learning | |
CN111460249B (en) | Personalized learning resource recommendation method based on learner preference modeling | |
CN110569443B (en) | Self-adaptive learning path planning system based on reinforcement learning | |
CN111582694B (en) | Learning evaluation method and device | |
CN110930274B (en) | Practice effect evaluation and learning path recommendation system and method based on cognitive diagnosis | |
Liu et al. | Automated feature selection: A reinforcement learning perspective | |
CN110363282B (en) | Network node label active learning method and system based on graph convolution network | |
CN112529155B (en) | Dynamic knowledge mastering modeling method, modeling system, storage medium and processing terminal | |
CN112800323B (en) | Intelligent teaching system based on deep learning | |
CN113360635B (en) | Intelligent teaching method and system based on self-attention and pre-training mechanism | |
CN111860989A (en) | Ant colony algorithm optimization-based LSTM neural network short-time traffic flow prediction method | |
CN117035074B (en) | Multi-modal knowledge generation method and device based on feedback reinforcement | |
CN115618101A (en) | Streaming media content recommendation method and device based on negative feedback and electronic equipment | |
CN111191722A (en) | Method and device for training prediction model through computer | |
CN115249072A (en) | Reinforced learning path planning method based on generation of confrontation user model | |
CN117635238A (en) | Commodity recommendation method, device, equipment and storage medium | |
CN113449182B (en) | Knowledge information personalized recommendation method and system | |
CN118193920A (en) | Knowledge tracking method of personalized forgetting mechanism based on concept driving | |
CN117422062A (en) | Test question generation method based on course knowledge network and reinforcement learning | |
CN116757282A (en) | Knowledge graph multi-hop reasoning method based on reinforced state modeling | |
CN116595245A (en) | Hierarchical reinforcement learning-based lesson admiring course recommendation system method | |
CN113095328B (en) | Semantic segmentation method guided by base index and based on self-training | |
CN115688863A (en) | Depth knowledge tracking method based on residual connection and student near-condition feature fusion | |
CN115272015A (en) | Course recommendation method and system based on abnormal picture and cooperative attenuation attention mechanism | |
CN113673773A (en) | Learning path recommendation method fusing knowledge background and learning time prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230717 Address after: 100876 Beijing city Haidian District Xitucheng Road No. 10 Patentee after: Beijing University of Posts and Telecommunications Address before: 100876 Beijing city Haidian District Xitucheng Road No. 10 Patentee before: Beijing University of Posts and Telecommunications Patentee before: Beijing Sikai Technology Co.,Ltd. |
|
TR01 | Transfer of patent right |