CN113268611B

CN113268611B - Learning path optimization method based on deep knowledge tracking and reinforcement learning

Info

Publication number: CN113268611B
Application number: CN202110706088.9A
Authority: CN
Inventors: 李建伟; 李领康; 于玉杰
Original assignee: Beijing Sikai Technology Co ltd; Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-11-01
Anticipated expiration: 2041-06-24
Also published as: CN113268611A

Abstract

The invention discloses a learning path optimization method based on deep knowledge tracking and reinforcement learning, belonging to the field of self-adaptive learning; the method specifically comprises the following steps: aiming at a certain student, selecting all unlearned knowledge points and knowledge points which are not repaired in advance as to-be-selected knowledge points; and performing one-hot coding by using the knowledge points of historical learning, inputting the one-hot coding into the DKT model, and outputting the grasp level prediction value of each to-be-selected knowledge point. Then, selecting a knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning; the learning process is realized by using a learning path optimization algorithm in the knowledge points; after the current knowledge point K passes the learning, judging whether a subsequent knowledge point exists or not, if so, adding the subsequent knowledge point into a to-be-selected knowledge point set, and moving out the current knowledge point K; otherwise, directly moving out the current knowledge point K, selecting the next knowledge point to predict and learn again until the set of the knowledge points to be selected is empty. The invention can greatly improve the recommendation precision and improve the efficiency under the condition of obtaining the same learning effect.

Description

Learning path optimization method based on deep knowledge tracking and reinforcement learning

Technical Field

The invention belongs to the field of adaptive learning, and particularly relates to a learning path optimization method based on deep knowledge tracking and reinforcement learning.

Background

In the adaptive learning process, one of the key problems to be solved is to recommend an optimal learning path for the student according to the knowledge point mastering level of the student, so as to obtain the optimal learning efficiency and effect.

The learned path recommendations include learned path recommendations between knowledge points and learned path recommendations within knowledge points.

For the recommendation of learning paths among knowledge points, the probability map model technology is most commonly used at present, and the specific implementation process is to adopt a Markov network of a probability map model to track the mastery degree of a single knowledge point of a single learner; and then, predicting the mastery degree of the unknown knowledge points by adopting a Bayesian network of a probability graph model according to the mastery degree of the learned knowledge points of the learner, thereby providing personalized learning path recommendation and predicting weak knowledge points of the learner. Most adaptive learning systems such as Knewton, squirrel AI or vipsid use this technique to implement personalized learning path recommendations. However, the above method needs to label domain knowledge (e.g. difficulty, distinction degree, knowledge points to which the learner belongs, etc.) and cannot comprehensively analyze the current overall knowledge state and past learning performance of the learner, and the recommended performance is generally expressed.

For learning path recommendation in a knowledge point, currently, a collaborative filtering algorithm and a genetic algorithm are most commonly used, the collaborative filtering algorithm is the most commonly used recommendation algorithm in a personalized recommendation system, and the basic idea is to find the nearest resource or user through a similarity algorithm according to a scoring matrix of a learner on the learning resource, predict an unscored target learning resource according to the nearest resource or user, and recommend a more accurate learning resource to the learner according to a prediction result.

For example, knewton adopts a collaborative filtering algorithm to quickly locate the information required by the learner from the learning objective, the cognitive structure and the learning input degree of the learner, so as to present the optimal learning content for the future learning of the learner. The genetic algorithm is one of evolutionary algorithms, and a preference attribute value of a user is extracted through a series of operations according to an initial population, so that the learning resources are recommended. The squirrel AI recommends proper learning resources for the learner in the global scope by using a genetic algorithm on the basis of tracking and analyzing learning data. However, both of the above algorithms aim to satisfy user preferences as a recommendation, rather than aim to obtain optimal learning efficiency and effect, learning is a painful process, and students can be continuously motivated to generate learning motivation only if a high learning return is obtained after effort is made.

Knowledge tracking is the modeling of the knowledge of a student based on time in order to accurately predict how well the student will master the knowledge point at the next moment. The Deep Knowledge tracking algorithm (DKT) is a Knowledge tracking model established based on the LSTM (Long Short-Term Memory), and a Knowledge point mastery prediction model is trained by using the historical learning data of the user, and the Knowledge point mastery level of the student is predicted and estimated according to the trained model.

Reinforcement learning mainly comprises four elements, intelligent agents, environmental states, actions and rewards. The intelligent agent selects an action for the environment, the state of the environment changes after receiving the action, and simultaneously a reinforcement signal (reward or punishment) is generated and fed back to the intelligent agent, and the goal of reinforcement learning is to obtain the most accumulated reward. And updating the knowledge point mastering state of the student by using a reinforcement learning algorithm according to the action of 'doing the right test questions' or 'doing the wrong test questions' of the student, and establishing a reward mechanism according to the target mastering state, so that a recommendation strategy of the test questions and the learning content of the knowledge points is established, and the student can efficiently reach the target mastering degree of the knowledge points.

In the prior art, deep knowledge tracking has strong perception capability to perceive the current learning state of students, but lacks certain decision-making capability; while reinforcement learning has decision-making ability but lacks the perception of state. The perception capability tracked by the deep knowledge is combined with the decision-making capability of the reinforcement learning, the deep knowledge tracks and perceives the learning state of students, the reinforcement learning makes a decision by taking the best learning efficiency and effect target as a guide according to the perceived learning state, and the learning path recommendation effect with the optimal performance can be obtained.

Disclosure of Invention

In order to find the learning sequence of the knowledge points and the learning sequence of the learning contents in the knowledge points which are most suitable for students, the invention provides a learning path optimization method based on deep knowledge tracking and reinforcement learning, and the learning path optimization method is recommended to the students which is most suitable for efficient learning paths, so that the students can efficiently master the knowledge points.

The learning path optimization method comprises a learning path optimization process between knowledge and a learning path optimization process in a knowledge point, and specifically comprises the following steps:

the learning path optimization process aiming at the knowledge specifically comprises the following steps:

step one, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected;

modifying knowledge points first refers to learning knowledge points that need to be learned first before learning the current knowledge points.

And secondly, performing one-hot coding on the knowledge points which have been learned by the student according to historical learning data of the knowledge points, inputting the one-hot coding into a trained DKT model, and outputting a grasp level prediction value of the student on each knowledge point to be selected.

Thirdly, sorting the prediction results of the knowledge points to be selected in the order from high to low, and selecting the knowledge point K with the highest prediction result and recommending the knowledge point K to the students for learning;

the learning process is realized by using a learning path optimization algorithm in knowledge points, and is specifically divided into two stages:

the first stage is as follows: training by using a Q-Learning algorithm for reinforcement Learning to obtain a grasping state of the knowledge point K and a Q matrix corresponding to the question-making action;

step 301, initializing each parameter: learning rate α =0.1, discount factor γ =0.9, training round counter elispodes =0;

step 302, initializing a Q matrix of a reinforcement learning algorithm to be 0, and defining Reward feedback given by an environment;

the element in the Q matrix is Q (s, a), which represents the expectation that the action a can obtain the benefit under the s state at a certain moment;

the Q matrix is represented as follows:

wherein, Q(s)_j,a_i0) Shows the mastery level of the knowledge points of the students at s_jIn the state, answer wrong test question a_iA desire for revenue can be obtained.

Q(s_j,a_i1) Shows the mastery level of the knowledge points of the students at s_jIn the state, answer the right test question a_iA desire for revenue can be obtained.

The line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal within the range of value [0,1], and the decimal is represented by s.

The column index is an answer action set, namely, students answer a certain test question or answer a wrong test question; a is_i0An action indicating that the student wrongly answered the ith test question, a_i1Showing the student answering the i-th test question.

The knowledge point K and the prior knowledge point are provided with n test questions, the size of the corresponding action set A is 2n, the action set A comprises actions of answering and answering wrong test questions, the initialized matrix Q is a matrix with 1 row and 2n columns of element values all being 0, and the action corresponding to each state is initially completely corresponded.

Reward feedback Reward is defined as: from the current state s_jPerform a certain action a_i0Or a_i1The latter state reaches the target mastery degree of the knowledge point, namely, the target state value s is reached_tThe Reward value Reward is 1, otherwise Reward is 0.

The concrete formula is as follows:

(s, a) is the current state and action,

is the current state s_jA state after taking action;

step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, initializing state s =0.5, and entering step 304;

step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, step 305 is entered.

305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, entering step 306, and if not, ending the current round, entering step 311;

the selection method for selecting action a according to the current state of the Q matrix is as follows:

firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with 90% probability, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.

Step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the student, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is predicted to be the next state

Step 307, judging the state

If it is in the state set of the Q matrix, if so, go to step 308, otherwise, add the state set, add a row of data in the Q matrix, and initialize each element to 0.

Step 308, convert the state

Substituting the Reward value R corresponding to the state into Reward feedback Reward, counting the number of rounds, namely EPISODES self increment 1, and saving the time record of the rounds and the Reward value R in a database;

step 309, updating the Q matrix by using the current Q matrix and Reward feedback;

the update formula is as follows:

representing the next state and corresponding behavior;

means that

Maximum Q values corresponding to all actions in the state;

step 310, returning to step 304, continuing to judge the next state

Whether the target value is manually set or not is achieved, and the Q matrix is continuously updated;

step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged and step 303 is entered.

Step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next training round.

The probability P is obtained by counting the number of awards 1 obtained in M rounds and dividing by M.

And a second stage: and using the trained Q matrix for learning path recommendation in the knowledge point.

Step 3.1: setting a target state s of mastership_tThe current knowledge point mastery level s of the student is initialized to 0.5.

Step 3.2: and selecting the action a with the maximum Q value in the state s according to the Q matrix trained in the first stage, and recommending the test question corresponding to the action to the student for learning.

Step 3.3: after the student finishes learning, calculating the next state of the student by using the trained DKT algorithm model according to the response record of the current knowledge point of the student

And updating the current state of the student;

step 3.4: judging whether the updated current state reaches a target state value s_tIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.

Step four, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, entering the step five; otherwise, moving out the current knowledge point from the knowledge point set to be selected, and entering the step six;

adding subsequent knowledge points of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set;

step six, judging whether the knowledge point set to be selected is empty, and if so, terminating the circulation; otherwise, returning to the step two, and continuing to learn the next knowledge point.

The invention has the advantages that:

1) The deep knowledge tracking algorithm not only analyzes the current knowledge state of the learner, but also considers all previous learning performances of the learner, so that the accuracy of predicting the future learning capacity performance of the learner is higher. Compared with a learning path recommendation method based on a probability map model, the personalized learning path recommendation method based on the knowledge map and the deep knowledge tracking can greatly improve the recommendation precision.

2) The method combines the perception capability of deep knowledge tracking and the decision-making capability of reinforcement learning, and realizes recommendation aiming at obtaining the best learning efficiency and effect.

Drawings

FIG. 1 is a schematic diagram of a knowledge point mastery level prediction model constructed by a depth knowledge tracking algorithm according to the present invention;

FIG. 2 is a flowchart of a learning path optimization method based on deep knowledge tracking and reinforcement learning according to the present invention;

FIG. 3 is a schematic view of a knowledge graph of the labeled knowledge points first modifying successor relationships and knowledge point content employed in the present invention;

FIG. 4 is a graph comparing learning efficiency of the present invention compared to collaborative filtering and genetic algorithms.

Detailed Description

The following describes embodiments of the present invention in detail and clearly with reference to the examples and the accompanying drawings.

The Deep Knowledge tracking algorithm (DKT) is a Knowledge tracking model established based on a Deep neural network LSTM (Long Short-Term Memory network), a Knowledge point mastery level prediction model is trained by using the model and user historical learning data to predict the mastery state of a student on an unknown Knowledge point, and the predicted mastery degree of the Knowledge point is [0,1%]The value range of (a). As shown in FIG. 1, the knowledge point grasp level prediction model is applied to an input vector sequence x₁…x_TBy calculating a series of "hidden" states h₁…h_TTo the output vector sequence y₁…y_TIt can be seen as a continuous coding of the relevant information for past observations that are useful for future predictions.

The specific formula is as follows:

h_t＝tanh(W_hxx_t+W_hhh_t-1+b_h) (1)

y_t＝σ(W_yhh_t+b_y) (2)

W_hxis the state-input weight; w_hhIs state of-a state weight; b_hIs a bias term for the hidden unit; σ is a sigmoid function; w_yhIs the information readout weight; b is a mixture of_yIs an offset term of the information reading unit;

the method comprises the steps of pre-processing cleaned data according to a one-hot coding format by collecting sample data of a user and removing samples with sequences less than 50% of the number of knowledge point databases in each sample data;

assuming that the knowledge points have n questions, the one-hot coding length is 2n, the front n bits represent wrong test questions, and the rear n bits represent right test questions; for example, if the user answers the ith test question, if the answer is wrong, the position of the one-hot code index i-1 is 1, and the other positions are 0, the one-hot code is shown in table 1:

TABLE 1

Index	0	1	...	i-1	...	n-1	n	...	2n-1
										Encoding	0	0	0	1	...	0	0	...	0

If the user answers correctly, the position of the one-hot code index n + (i-1) is 1, the rest positions are 0, and the one-hot code is shown in table 1:

TABLE 2

Index	0	1	...	n-1	n	...	n+i-1	...	2n-1
										Encoding	0	0	0	0	0	...	1	...	0

Using the preprocessed data as input data of a DKT model, training and storing a knowledge point prediction model; and inputting the history record of the user doing questions into the prediction model, and predicting the mastery level of the unknown knowledge points of the user in real time.

The invention provides a learning path optimization method based on inter-knowledge point and intra-knowledge point learning paths, and combined with theories and technologies such as knowledge maps, deep learning and reinforcement learning, and the like, and the learning path optimization method based on deep knowledge tracking and reinforcement learning, comprises an inter-knowledge learning path optimization process and an intra-knowledge point learning path optimization process, and as shown in fig. 2, the specific processes are as follows:

modifying knowledge points first refers to learning the knowledge points which need to be learned before learning the current knowledge points; as shown in fig. 3, by establishing the knowledge graph, the labeled knowledge points modify the subsequent relationship first and the content structure of the knowledge points is simple; in the figure, k1, k2, k3, k4 and k5 represent knowledge points, k1 is a correction knowledge point of k2 and k5, k2 is a correction knowledge point of k3, and t1 and t2 represent test questions and learning contents belonging to the knowledge point k1 (the test questions may be embedded in the learning contents). k4 is a discrete knowledge point, no prior knowledge point, and no subsequent knowledge point

Thirdly, sorting the prediction results of the knowledge points to be selected in a sequence from high to low, and selecting the knowledge point K with the highest prediction result and recommending the knowledge point K to the student for learning;

the learning process is realized by using a learning path optimization algorithm in the knowledge points, and assuming that a student starts learning, firstly, a state value s of a mastery degree target is set_tThe student's initial mastery level (i.e., initial state s) is initialized to 0.5. Then, based on a learning path recommendation algorithm among knowledge points, selecting a knowledge point K to start learning; then, recommending test questions according to the Q matrix trained by the learning path recommendation algorithm in the knowledge points until the state s_tAnd when the target state value is reached, the learning of the knowledge point K is finished, and the learning of the next knowledge point is started based on the learning path recommendation algorithm among the knowledge points until all the knowledge points are completely learned.

The method is specifically divided into two stages:

step 301, initializing each parameter: learning rate α =0.1, discount factor γ =0.9, training round counter eliscodes =0 continuously reaching the end state;

the element in the Q matrix is Q (S, a), and the Q matrix represents the expectation that the profit can be obtained by taking the action a (a belongs to A) under the state of S at a certain moment (S belongs to S); the environment can obtain corresponding reward (reward) according to the Action feedback of an intelligent agent (agent), so the main idea of the algorithm is to construct a Q-table by a State (State) and an Action (Action) to store a Q value, and then to select the Action capable of obtaining the maximum profit according to the Q value;

the Q matrix is represented as follows:

Q(s_j,a_i1) Indicates the mastery level of the knowledge points of the students at s_jIn the state, answer a to test question_iA desire for revenue can be obtained.

Reward, i.e., reward feedback given by the environment, is defined as: setting the target mastery degree of the knowledge point, namely the target state value as s_tFrom the current state s_jPerform a certain action a_i0Or a_i1The later state reaches the target mastery degree of the knowledge point, namely the target state value s is reached_tThe Reward value Reward is 1, otherwise Reward is 0.

The concrete formula is as follows:

(s, a) is the current state and action,

is the current state s_jA state after taking action; s is_tIs the target state value.

Step 303, judging whether the Q matrix is converged, if so, stopping the training process, and outputting the current Q matrix for recommending a learning path in the knowledge point of the second stage; otherwise, initializing the state s =0.5, and entering step 304;

assuming that the initial state of the student is 0.5 middle ability level, the state set S is initialized with 0.5;

firstly, selecting Q values of all actions which are not executed in the current round under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0 or not, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.

Step 307, determining the status

If it is in the state set of the Q matrix, and if so, step 308 is entered,otherwise, add state set, add a row of data in Q matrix, and initialize each element to 0.

Step 308, convert the state

if R =1, the round count and the bonus value are stored, if R =0, whether the current round has any unexecuted action is judged, and if not, the round count and the bonus value are stored.

Intelligent agents (agents) continually transition from one state to another to explore until a target state is reached. Each exploration of an intelligent Agent (Agent) is called a round (epicode), in each round (epicode), the intelligent Agent (Agent) reaches a target state from any initial state, after the Agent reaches the target state, one round (epicode) is ended, and then the next round (epicode) is entered. And adding the new state into the Q table when the new state is found in the exploration process.

the update formula is as follows:

representing the next state and corresponding behavior;

means for

Maximum Q values corresponding to all actions in the state;

is in a state

Among all the Q values of (1), the operation corresponding to the maximum Q value.

Step 310, returning to step 304, and continuing to judge the next state

step 311, determining whether the number of rounds EPISODES currently completed is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged and step 303 is entered.

Step 312, counting the probability P of obtaining the reward value R of 1 in the last M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, enabling the Q matrix to be close to convergence enough, stopping the algorithm, and storing the Q matrix after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next training round.

The probability P is obtained by counting the number of awards obtained in M (e.g., M = 1000) rounds with a value of 1, and then dividing by M.

And updating the current state of the student;

step 3.4: judging whether the updated current state reaches the target state value s_tIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K.

Finally, a personalized optimal question making path is recommended for the students.

Step four, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, and if so, entering the step five; otherwise, moving out the current knowledge point from the knowledge point set to be selected, and entering the step six;

The inter-knowledge-point learning path recommendation method based on the deep knowledge tracking determines the learning sequence of knowledge points by combining the knowledge map constructed by domain experts and the deep knowledge tracking. Knowledge tracking based on the deep neural network not only analyzes the current knowledge state of the learner, but also considers all previous learning performances of the learner, so that the accuracy of predicting the future learning capacity performance of the learner is higher, and the recommendation accuracy of the learning path between knowledge points can be greatly improved.

Secondly, the method for recommending the learning path in the knowledge point based on the reinforcement learning and the deep knowledge tracking is used for tracking and sensing the current learning state of a student by using the deep knowledge and deciding the next learning content to be learned according to the current learning state by using the reinforcement learning.

As shown in fig. 4, compared with the collaborative filtering and genetic algorithm method, the learning efficiency of the method of the present invention is improved by more than 20% under the condition of obtaining the same learning effect.

Claims

1. A learning path optimization method based on deep knowledge tracking and reinforcement learning is characterized by comprising a learning path optimization process among knowledge points and a learning path optimization process in the knowledge points; the method comprises the following specific steps:

firstly, aiming at a certain student, selecting all unlearned discrete knowledge points and root knowledge points without modifying knowledge points in advance as a knowledge point set to be selected; performing one-hot coding on each learned knowledge point of the student according to historical learning data, inputting the one-hot coded knowledge point into a trained DKT model, and outputting a grasping level prediction value of the student on each knowledge point to be selected;

then, sorting the prediction results of the knowledge points to be selected from high to low, and selecting the knowledge point K with the highest prediction result to recommend the knowledge point K to the student for learning; the learning process is realized by using a learning path optimization algorithm in the knowledge points;

the learning path optimization algorithm in the knowledge points is specifically divided into two stages:

step 301, initializing a learning rate alpha, a discount factor gamma and a counter EPISODES =0 of a training round;

the Q matrix is 1 row and 2n columns, and the action corresponding to each state is initially finished; 2n is the knowledge point K and n test questions under the prior knowledge point correction, and the corresponding action set number;

reward feedback Reward is: the state after executing a certain action from the current state reaches the knowledge point target state value s_tThe Reward value Reward is 1, otherwise, reward is 0; the concrete formula is as follows:

(s, a) is the current stateAnd an action to be taken in the form of,

a state after taking action for the current state;

step 304, judging whether the current state s of the current round reaches a target state value, if so, ending the current round, and entering step 311; otherwise, go to step 305;

step 305, judging whether the current round has unexecuted actions, if so, selecting an action a according to the current state of the Q matrix, and entering step 306, otherwise, ending the current round, and entering step 311;

step 306, after the action a is finished, one-hot coding is carried out according to the historical learning data of the students, the one-hot coding is input into the trained DKT model, and the latest mastered state value of the current knowledge point K is obtained through prediction and is the next state

Step 307, determining the status

If it is in the state set of the Q matrix, if yes, go to step 308; otherwise, adding a state set, adding a row of data in the Q matrix, and initializing each element to be 0;

step 308, convert the state

the update formula is as follows:

representing the next state and corresponding behavior;

means that

Maximum Q values corresponding to all actions in the state;

step 310, returning to step 304, and continuing to judge the next state

Whether the target value is artificially set or not is judged, and the Q matrix is continuously updated;

step 311, determining whether the number of rounds EPISODES completed currently is greater than or equal to the target number of rounds M, if yes, entering step 312; otherwise, the Q matrix is not converged, and the step 303 is entered;

step 312, counting the probability P that the reward value R is 1 obtained in the latest M rounds according to time, judging whether the probability P is more than or equal to 90%, if so, the Q matrix is close to convergence enough, the algorithm is terminated, and the Q matrix is stored after the Q matrix is trained; otherwise, the Q matrix is not converged, and step 303 is entered to continue the next round of training;

and a second stage: using the trained Q matrix for recommending the learning path in the knowledge point; the method specifically comprises the following steps:

step 3.1: setting a target state s of mastership_tThe current knowledge point mastery level s of the student is initialized to 0.5;

step 3.2: selecting an action a with the maximum Q value in a state s according to the Q matrix trained in the first stage, and recommending test questions corresponding to the action to students for learning;

step 3.3: after the students finish learning, calculating the next state of the students by using the trained DKT algorithm model according to the answering records of the current knowledge points of the students

And updating the current state of the student;

step 3.4: judging whether the updated current state reaches the target state value s_tIf yes, finishing the learning of the knowledge point K; otherwise, returning to the step 3.2 to continue the study of the test question content corresponding to the next action of the current knowledge point K;

finally, after the current knowledge point K passes the learning, judging whether the knowledge point has a subsequent knowledge point, if so, adding the subsequent knowledge point of the current knowledge point K into a to-be-selected knowledge point set, and moving the current knowledge point K out of the to-be-selected knowledge point set; otherwise, directly moving the current knowledge point K out of the knowledge point set to be selected, judging whether the knowledge point set to be selected is empty, and if so, terminating the cycle; otherwise, the learning of the next knowledge point is continued.

2. The learning path optimization method based on deep knowledge tracking and reinforcement learning of claim 1, wherein the Q matrix is represented as follows:

wherein, Q(s)_j,a_i0) Shows the mastery level of the knowledge points of the students at s_jIn the state, answer wrong test question a_iThe expectation of being able to gain revenue;

Q(s_j,a_i1) Shows the mastery level of the knowledge points of the students at s_jIn the state, answer a to test question_iThe expectation of being able to gain revenue;

the line index represents the grasping state set of the knowledge point by the student, namely the current grasping degree level of the knowledge point of the student is estimated by a DKT algorithm to obtain the decimal within the range of value [0,1], and the decimal is represented by s;

3. The learning path optimization method based on deep knowledge tracking and reinforcement learning of claim 1, wherein in step 305, the selection method of the action a according to the current state of the Q matrix is as follows:

firstly, selecting Q values of all actions which are not executed in the current turn under the state s from a Q matrix as a candidate action Q value set; then, judging whether all the action values in the candidate action set are 0, if so, selecting the action according to a non-greedy mode, namely, randomly selecting the action; otherwise, selecting the action according to a greedy mode with a probability of 90 percent, namely selecting the action with the maximum Q value; the actions are selected in a non-greedy mode with a 10% probability, i.e., randomly.

4. The method as claimed in claim 1, wherein the probability P in step 312 is obtained by counting the number of awarded values 1 obtained in M rounds and then dividing by M.