CN113094495A

CN113094495A - Learning path demonstration method, device, equipment and medium for deep reinforcement learning

Info

Publication number: CN113094495A
Application number: CN202110431018.7A
Authority: CN
Inventors: 王鑫; 许昭慧
Original assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Current assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-07-09

Abstract

The embodiment of the invention discloses a learning path demonstration method, a device, equipment and a medium for deep reinforcement learning, wherein the method comprises the following steps: receiving a learning path demonstration instruction of a target user; determining a reinforcement learning factor of a learning path of a target user according to the learning path demonstration instruction; the reinforcement learning factors comprise an intelligent agent, a learning environment, a state space, an action space and a learning evaluation index; responding to the learning path demonstration instruction, and performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors to obtain a path generation process of the learning path of the target user; and visually demonstrating the path generation process. The technical scheme of the embodiment of the invention can enrich the demonstration function of the intelligent adaptive learning demonstration system on the learning effect, thereby improving the intuitiveness and the intelligence of the intelligent adaptive learning demonstration system for demonstrating the learning effect.

Description

Learning path demonstration method, device, equipment and medium for deep reinforcement learning

Technical Field

The embodiment of the invention relates to the technical field of artificial intelligence online education, in particular to a learning path demonstration method, device, equipment and medium for deep reinforcement learning.

Background

The intelligent adaptive learning system can be used for customizing learning modes and learning courses according to the learning strengths and weaknesses of each student. When a student user enters the intelligent adaptive learning system, a round of test is needed to detect the weak point of the student user at the current level, the intelligent adaptive learning system checks the knowledge point consistent with the dynamic learning target in the least time by combining the nanoscale knowledge map, dynamically establishes the dynamic user portrait of each student user through the mastery state of the knowledge point after the student learns, knows the learning state and the abnormal early warning of each student user, timely adjusts the learning path and the learning content of the student user, and obtains the most appropriate and personalized learning path and the learning content of the student user from a plurality of learning contents.

The knowledge point learning strategy intelligently adaptive to learning has a plurality of influence factors, and personalized learning data of a student user can be formulated according to the logic map recommendation, the sequential map recommendation, the root tracing, the strategy priority and other learning strategies. However, the learning effect of each student user is a result accumulated for a long time, so that the intelligent adaptive learning demonstration system needs to show the maximum learning effect to the student users within a certain time to show the intelligent recommendation characteristic of the intelligent adaptive learning system, and parents and students can intuitively know the intelligence of the recommended knowledge points of the intelligent adaptive learning system.

At present, the existing intelligent adaptation learning demonstration system shows the maximized learning effect through two modes. The first way is to push a presentation form in a two-dimensional form through a dynamic display processor to show a maximized learning effect through the pushed presentation form. The demonstration mode of the learning effect is difficult to intuitively display the complex factors of knowledge point selection learning comprehensively considered by the artificial intelligence system in the decision-making teaching strategy. The second mode is to configure different rates and influence factors of the interest items of the users to a preset database for storage and output demonstration. In the learning effect demonstration mode, the knowledge point recommendation algorithm of the intelligent adaptive learning system dynamically recommends along with the learning state of the student user and multiple attributes of the knowledge points in the positions, the pre-post relationship, the difficulty, the examination frequency and the like of the knowledge map, but the method of configuring the influence factors causes higher maintenance cost, is difficult to realize data driving, and is also difficult to visually display the complex factors of selecting knowledge point learning comprehensively considered by the artificial intelligent system in the decision-making teaching strategy. Therefore, the existing intelligent adaptive learning demonstration system cannot intuitively demonstrate in a teaching scene, and how the intelligent adaptive learning system simulates a famous teacher with years of teaching experience to implement personalized teaching results in unsatisfactory learning effect demonstration.

Disclosure of Invention

The embodiment of the invention provides a learning path demonstration method, a device, equipment and a medium for deep reinforcement learning, which can enrich the demonstration function of an intelligent adaptive learning demonstration system on the learning effect, thereby improving the intuitiveness and the intelligence of the intelligent adaptive learning demonstration system for demonstrating the learning effect.

In a first aspect, an embodiment of the present invention provides a learning path demonstration method for deep reinforcement learning, including:

receiving a learning path demonstration instruction of a target user;

determining a reinforcement learning factor of a learning path of a target user according to the learning path demonstration instruction; the reinforcement learning factors comprise a state space, an action space and a learning evaluation index;

responding to the learning path demonstration instruction, and performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors to obtain a path generation process of the learning path of the target user;

and visually demonstrating the path generation process.

In a second aspect, an embodiment of the present invention further provides a learning path demonstration apparatus for deep reinforcement learning, including:

the learning path demonstration instruction receiving module is used for receiving a learning path demonstration instruction of a target user;

the reinforcement learning factor determining module is used for determining reinforcement learning factors of the target user learning path according to the learning path demonstration instruction; the reinforcement learning factors comprise a state space, an action space and a learning evaluation index;

the reinforcement learning module is used for responding to the learning path demonstration instruction, performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors, and obtaining a path generation process of the learning path of the target user;

and the path generation process demonstration module is used for visually demonstrating the path generation process.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the learning path demonstration method for deep reinforcement learning provided by any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the learning path demonstration method for deep reinforcement learning provided in any embodiment of the present invention.

According to the embodiment of the invention, the intelligent adaptive learning demonstration system determines the state space, the action space, the learning evaluation index and other strong learning factors of the learning path of the target user according to the received learning path demonstration instruction, so as to respond to the learning path demonstration instruction, carry out reinforcement learning on the learning path of the target user according to the determined reinforcement learning factors, obtain the path generation process of the learning path of the target user, and carry out visual demonstration on the path generation process, solve the problem of poor demonstration effect of the existing intelligent adaptive learning demonstration system on the learning effect, enrich the demonstration function of the intelligent adaptive learning demonstration system on the learning effect, and improve the intuitiveness and intelligence of the intelligent adaptive learning demonstration system in demonstrating the learning effect.

Drawings

Fig. 1 is a flowchart of a learning path demonstration method for deep reinforcement learning according to an embodiment of the present invention;

fig. 2 is a flowchart of a learning path demonstration method for deep reinforcement learning according to a second embodiment of the present invention;

FIG. 3 is a schematic flow chart of reinforcement learning in the prior art;

FIG. 4 is a flow chart illustrating the execution of reinforcement learning in the prior art;

fig. 5 is a schematic structural diagram of each functional module included in an intelligent adaptive learning demonstration system according to a second embodiment of the present invention;

fig. 6 is a schematic flowchart of intelligent agent reinforcement learning according to a second embodiment of the present invention;

FIG. 7 is a diagram illustrating the effect of the association relationship between knowledge points of a knowledge graph according to a second embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating a single interaction mode according to a second embodiment of the present invention;

fig. 9 is a schematic view of a learning path demonstration apparatus for deep reinforcement learning according to a third embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

The terms "first" and "second," and the like in the description and claims of embodiments of the invention and in the drawings, are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not set forth for a listed step or element but may include steps or elements not listed.

Example one

Fig. 1 is a flowchart of a learning path demonstration method for deep reinforcement learning according to an embodiment of the present invention, where the present embodiment is applicable to intuitively and intelligently demonstrating a learning path to a user, and the method may be executed by a learning path demonstration apparatus for deep reinforcement learning, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device, where the electronic device may be a device capable of operating an intelligent learning demonstration system. Accordingly, as shown in fig. 1, the method comprises the following operations:

and S110, receiving a learning path demonstration instruction of the target user.

The target user can be a learning user for which the learning path generated by the intelligent adaptive learning system is aimed, and the learning path demonstration instruction can be an instruction which is input to the intelligent adaptive learning demonstration system by an operation user and is used for requesting the intelligent adaptive learning demonstration system to perform learning path demonstration on the target user. The operation user may be a user who operates the intelligent adaptive learning demonstration system, may be a student user, may also be a teacher user, and the like, and the embodiment of the present invention does not limit the specific type of the operation user.

In the embodiment of the invention, when the operation user needs to preview the learning path of the target user through the intelligent adaptive learning demonstration system, the learning path demonstration instruction of the target user can be input into the intelligent adaptive learning demonstration system. Optionally, the intelligent adaptive learning demonstration system may be used as an independent system to interact with the intelligent adaptive learning system, so as to demonstrate the decision making process of the intelligent adaptive learning system to the operating user in real time. Or, the intelligent adaptive learning demonstration system may be integrated in the intelligent adaptive learning system, and used as a subsystem of the intelligent adaptive learning system to directly output the learning path decision process of the intelligent adaptive learning system, which is not limited in the embodiment of the present invention.

For example, when the operating user is a student user, the operating user may also be a target user, and the target user may input a learning path demonstration instruction of the target user to the intelligent adaptive learning demonstration system to request the intelligent adaptive learning demonstration system to perform the learning path demonstration of the target user. When the operation user is a teacher user, the operation user can input a learning path demonstration instruction of a certain student user to the intelligent adaptive learning demonstration system to request the intelligent adaptive learning demonstration system to perform learning path demonstration on the student user.

S120, determining a reinforcement learning factor of the learning path of the target user according to the learning path demonstration instruction; the reinforcement learning factors comprise an agent, a learning environment, a state space, an action space and a learning evaluation index.

The target user learning path is also the learning path of the target user. It is understood that the learning path may include the learning process and the learning content of the target user for each knowledge point. For example, the learning path of the target user a may be: the method comprises the following steps of secondary root concept, secondary root effective condition, secondary root simplification, denominator rationalization, secondary root multiplication, secondary root division and secondary root multiplication division. The learning paths of different target users may be the same or different, and need to be determined specifically according to relevant factors such as learning ability of the target users. The reinforcement learning factors can be relevant factors of reinforcement learning, and can include but are not limited to agents, learning environments, state spaces, action spaces, learning evaluation indexes and the like.

Correspondingly, after the intelligent adaptive learning demonstration system receives the learning path demonstration instruction of the target user, the intelligent adaptive learning demonstration system can determine strong learning factors such as the state space, the action space and the learning evaluation index of the learning path of the target user according to the learning path demonstration instruction, and the initialization configuration for strengthening learning of the learning path is realized.

And S130, responding to the learning path demonstration instruction, and performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors to obtain a path generation process of the learning path of the target user.

The path generation process may be a complete planning and generation process of a learned path.

After the intelligent adaptive learning demonstration system realizes the initialization configuration of reinforcement learning on the learning path, the intelligent adaptive learning demonstration system can start to respond to the learning path demonstration instruction, and the target user learning path is subjected to reinforcement learning by combining the intelligent adaptive learning system and the reinforcement learning model according to the configured reinforcement learning factors to obtain the path generation process of the target user learning path.

It should be noted that the path generation process in the embodiment of the present invention may embody a decision process and an effect of dynamically recommending each knowledge point by the intelligent adaptive learning system, and the whole path generation process may intuitively embody a complete learning state change of a target user and an intelligent decision inference manner of the intelligent adaptive learning system for a real-time learning state change process.

Optionally, the reinforcement learning model in the embodiment of the present invention may be a deep reinforcement learning model, and the embodiment of the present invention does not limit the type of the reinforcement learning model. It should be noted that the reinforcement learning model may be integrated inside the intelligent adaptive learning demonstration system, so as to be directly scheduled by the intelligent adaptive learning demonstration system for reinforcement learning, and generate a path generation process of the target user learning path. Or the reinforcement learning model can be executed independently of the intelligent adaptive learning demonstration system, and the intelligent adaptive learning demonstration system can send an instruction to a system or equipment where the reinforcement learning model is located to schedule the reinforcement learning model for reinforcement learning, so as to generate a path generation process of a target user learning path. The embodiment of the invention does not limit the integration mode between the reinforcement learning model and the intelligent adaptive learning demonstration system and the mode for dispatching the reinforcement learning model by the intelligent adaptive learning demonstration system.

And S140, visually demonstrating the path generation process.

Correspondingly, after the path generation process is obtained, the intelligent adaptive learning demonstration system can intuitively demonstrate the path generation process of the whole learning path of the target user in real time, namely the learning effect of the target user, so that the operating user can intuitively know the personalized learning dynamic process suitable for the target user.

It should be noted that, in the embodiment of the present invention, in addition to the operation user specifying the target user to the intelligent adaptive learning demonstration system by using the learning path demonstration instruction, the operation user may also specify a path generation process of different types of learning paths to the intelligent adaptive learning demonstration system by using the learning path demonstration instruction. Optionally, the intelligent adaptive learning demonstration system may perform visual demonstration on the path generation process in the form of a knowledge map or a path map, and the like, and the embodiment of the invention does not limit the demonstration mode of the intelligent adaptive learning demonstration system.

For example, an operating user can use a learning path demonstration instruction to designate an intelligent adaptive learning demonstration system to perform demonstration in an automatic mode, namely, the whole course of the demonstration in the path generation process has no human-computer interaction, and the intelligent adaptive learning system automatically and intelligently judges the content of the target user to be learned, so that the maximum learning effect is shown. The operating user can also use the learning path demonstration instruction to designate the intelligent adaptive learning demonstration system to perform demonstration in a single interaction mode, namely the operating user can use the learning path demonstration instruction to designate the knowledge point where the target user starts learning. Specifically, the intelligent adaptive learning system initially determines the knowledge points needed to be learned by the target user, and the user can choose to accept or not accept the knowledge points in the operation. When the operation user selects to accept, the intelligent adaptive learning system automatically and intelligently judges the learning path of the target user continuously. In order to further improve the user experience, the operating user can also use the learning path demonstration instruction to designate the intelligent adaptive learning demonstration system to perform demonstration in a multi-user interaction mode, namely, the operating user can use the learning path demonstration instruction to designate that the operating user selects the next knowledge point for the target user to learn, so that the operating user can watch the condition that the learning path arranged for the target user by the operating user can master the knowledge point without the intervention of the intelligent adaptive learning system.

Therefore, the learning path demonstration method for deep reinforcement learning provided by the intelligent adaptive learning demonstration system in the embodiment of the invention not only can enable the user to intuitively know the dynamic intelligent decision process of the intelligent adaptive learning system on the whole learning path of the target user, improve the intuitiveness and the intelligence of the intelligent adaptive learning demonstration system for demonstrating the learning effect, but also can provide various different types of interactive demonstration modes, and further enrich the demonstration function of the intelligent adaptive learning demonstration system for the learning effect.

Example two

Fig. 2 is a flowchart of a learning path demonstration method for deep reinforcement learning according to a second embodiment of the present invention, which is embodied on the basis of the second embodiment of the present invention. Correspondingly, as shown in fig. 2, the method of the present embodiment may include:

and S210, receiving a learning path demonstration instruction of the target user.

The reinforcement learning belongs to a machine learning mode, fig. 3 is a schematic flow diagram of reinforcement learning in the prior art, and as shown in fig. 3, a reinforcement learning algorithm includes 5 major elements: agent, Environment, Action, State, and Reward. The intelligent agent interacts with the environment in real time, and after observing the state of the environment, the intelligent agent outputs actions according to a Policy model (Policy), and the actions act on the environment to influence the state of the environment. In addition, the environment gives the intelligent agent a reward according to the action and the state, and the intelligent agent updates the strategy model for selecting the action according to the action state and the reward. By trying continuously in the environment, the maximum reward is obtained, and the mapping from state to action is learned, namely the strategy model, or simply the model, which is expressed by a parameterized deep neural network.

Fig. 4 is a schematic diagram of an execution flow of reinforcement learning in the prior art, and as shown in fig. 4, the execution flow of reinforcement learning in the prior art specifically includes: agent observes Environment and obtains state, action is made to state according to its Policy, at this time, a reward can be obtained, Environment is changed, so Agent obtains a new state and continues to execute until learning succeeds.

In the embodiment of the invention, reinforcement learning is applied to an application scene of learning path deduction, and when reinforcement learning factors of a learning path of a target user are determined, 5 major elements included in a reinforcement learning algorithm are respectively configured to obtain an agent, a learning environment (namely environment), a state space, an action space and a learning evaluation index (namely a reward return function) corresponding to an intelligent adaptive learning demonstration system. The intelligent adaptive learning demonstration system utilizes a reinforcement learning mode to obtain a path generation process of a target user learning path, and the following operations are specifically referred to in the path generation process.

And S220, determining the intelligent adaptive learning system as a knowledge point recommendation environment in the reinforcement learning model.

And S230, determining the knowledge point recommendation environment type matched with the state space according to the type of the learning path demonstration instruction.

Specifically, the intelligent adaptive learning system can be determined as a knowledge point recommendation environment in the reinforcement learning model, that is, the agent in the reinforcement learning model remains unchanged, the intelligent adaptive learning system is set as a learning environment factor in the reinforcement learning model, and the type of the knowledge point recommendation environment matched with the reinforcement learning state space is determined according to the type of the learning path demonstration instruction. It is to be appreciated that different types of learning path demonstration instructions can specify different types of knowledge point recommendation environment types, each of which can correspond to a reinforcement learning environment.

In an optional embodiment of the invention, the knowledge point recommendation environment type may comprise at least one of an adaptive engine recommendation knowledge point environment and an operating user feedback recommendation knowledge point environment; the adaptive engine knowledge point recommending environment is used for recommending knowledge points to the target user by adopting an adaptive engine; the operation user feedback recommendation knowledge point environment is used for recommending a knowledge point to the target user according to feedback information of an operation user; the operation user comprises a first operation user or a second operation user; the feedback information of the first operation user is used for confirming whether to accept the knowledge point recommended by the adaptive engine; and the feedback information of the second operation user is used for autonomously recommending the knowledge points to the target user.

The adaptive engine is also an intelligent learning engine of the intelligent adaptive learning system. The adaptive engine knowledge point recommending environment may be an environment in which the adaptive engine is adopted to recommend knowledge points to a target user. The operation user feedback recommended knowledge point environment may be an environment in which a knowledge point is recommended to a target user according to feedback information of the operation user. The first operating user may be an operating user who interacts with the intelligent adaptive learning demonstration system in a single-person interaction mode, for example, the target user may feed back information to the intelligent adaptive learning demonstration system to confirm whether to accept the knowledge point recommended by the adaptive engine. The second operating user may be an operating user who interacts with the intelligent adaptive learning demonstration system in a multi-user interaction mode, such as a teacher user, or may feed back information to the intelligent adaptive learning demonstration system, so as to autonomously determine the knowledge point recommended to the target user by using the intelligent adaptive learning demonstration system.

That is, in the embodiment of the present invention, the intelligent adaptive learning demonstration system can simulate the generation modes of three different types of learning paths. By configuring the adaptive engine to recommend the knowledge point environment, the learning path can be generated in an automatic mode, namely, the whole demonstration process of the path generation process has no human-computer interaction, and the intelligent adaptive learning system automatically and intelligently judges the content to be learned of the target user, so that the maximum learning effect is shown. By configuring the environment of the recommended knowledge point fed back by the operation user, the learning path can be generated by adopting a man-machine interaction mode, and not only can a single-person interaction mode be realized, but also a multi-person interaction mode can be realized. In the single-person interaction mode, the first operating user may specify knowledge points at which the target user starts learning using the learning path demonstration instructions. Specifically, the intelligent adaptive learning system initially determines the knowledge points needed to be learned by the target user, and the user can choose to accept or not accept the knowledge points in the operation. When the operation user selects to accept, the intelligent adaptive learning system automatically and intelligently judges the learning path of the target user continuously. In the multi-person interaction mode, the second operating user can designate that the second operating user autonomously selects the next knowledge point for the target user to learn by using the learning path demonstration instruction so as to watch the condition that the learning path arranged by the operating user for the target user can master the knowledge point without the intervention of the intelligent adaptive learning system. The multi-user interaction mode can realize that an operating user can judge and guide the current learning state of a target user according to own experience.

Therefore, the learning path demonstration method for deep reinforcement learning in the man-machine interaction mode can enable an operation user to participate in the demonstration process, and the interactivity and the expandability of the learning path demonstration are improved. When the teacher user serves as a second operation user to judge the learning state of the target user according to the experience of the teacher user and guide the learning state of the target user, the learning path of the target user can be determined, and the teacher user can also compare the learning path with the intelligent learning strategy of the intelligent adaptive learning system, so that the teacher user can know the difference between a real teacher and students under the same condition when the intelligent adaptive learning system performs learning path planning, and the advantages of the intelligent adaptive learning system are highlighted.

S240, determining the state space according to the attributes of the student users, the attributes of the knowledge points of the knowledge graph, the attributes of the learning situations and the demonstration sequence.

Wherein the student user attributes may include learning abilities of a target user; the knowledge point attributes of the knowledge graph can comprise logical relations among knowledge points, knowledge point reference frequency, knowledge point importance degree and knowledge point difficulty; the learning situation attribute can comprise a knowledge point mastering state; the demonstration attribute may include knowledge point mastery number, learning path demonstration duration, or knowledge point learning range.

In the embodiment of the invention, the state space of reinforcement learning can be determined according to the attributes of student users, the attributes of knowledge points of a knowledge graph and the attributes of learning situations. Optionally, if a demonstration mode of the learning path needs to be set, such as setting a demonstration duration or a demonstration learning range, a demonstration attribute may also be added to the state space.

Alternatively, the student user attribute may be the student ability of the target user, and may be a setting value provided to the target user by the smart adaptive learning demonstration system, for example: the student user attribute of the target user may be excellent learning, medium learning, poor learning, etc. The attribute of the student user can also be determined by storing historical learning data of the user by adopting an intelligent adaptive learning system, namely a plurality of capacity value intervals divided by the capacity value of the user at each moment obtained according to Item Response Theory (IRT). If the learning is excellent (the capacity value is between 0.7 and 1), the learning is moderate (the capacity value is between 0.36 and 0.69) and the learning is poor (the capacity value is between 0.35 and 0), the operating user can set the initial student user attribute level of the target user by using the learning path demonstration instruction, and after the intelligent adaptive learning system simulates the learning knowledge point of the target user, the intelligent adaptive learning system can be used as the updated student user attribute according to the capacity value of the target user in the intelligent adaptive learning system. The knowledge point attributes of the knowledge graph can be the characteristics of the knowledge point dimension, and can include but are not limited to the logical relationship between knowledge points (for example, there is a pre-post relationship between knowledge points), knowledge point reference frequency, knowledge point importance degree, knowledge point difficulty degree, and the like. The learning situation attribute may include a knowledge point mastering state of the target user, that is, a state of learning whether the target user masters the knowledge point or not under the student user attribute of the target user. The demonstration attribute can be a condition for setting demonstration requirements for an operating user, including but not limited to the maximum knowledge point mastery number, the demonstration duration and the like, namely the duration for the intelligent agent to achieve the task goal, or the learning range to be demonstrated.

And S250, determining the action space according to the recommended learning knowledge points of the target user.

The recommended learning knowledge points are also the knowledge points to be learned recommended to the target user.

Specifically, the next knowledge point to be learned by the target user may be set as the action output by the agent.

And S260, responding to the learning path demonstration instruction, and performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors.

After the reinforcement learning factor configuration is completed, the intelligent adaptive learning demonstration system can respond to the learning path demonstration instruction and perform reinforcement learning on the learning path of the target user according to the reinforcement learning factor.

Fig. 5 is a schematic structural diagram of each functional module included in an intelligent learning demonstration system according to a second embodiment of the present invention. In a specific example, as shown in fig. 5, the functional modules of the reinforcement learning-based intelligent adaptive learning demonstration system can be subdivided into a state module, a decision module, an interaction module, a recommendation module, a learning simulation module and a demonstration module. The state module can provide state attributes required by the intelligent agent, including student user attributes, knowledge point attributes of the knowledge graph, learning situation attributes, demonstration attributes and the like. The decision module may present the actions output by the agent, i.e. the next knowledge point to be learned by the student. Specifically, the decision module can establish a deep reinforcement learning model for the agent, set a state space of the agent in the environment, a decision-making behavior space of the agent and a behavior reward of the environment to the agent, adopt a deep neural network to approximate a mapping function from the state to the action, and the agent makes a behavior decision by using the mapping function through observing dynamic environment states such as the mastery state of each knowledge point on a knowledge map, the learning level of a target student and the like, namely, the dynamic planning of recommendation of the knowledge point of the agent. The interaction module can provide various interaction modes of the intelligent adaptive learning demonstration system, and an operation user can set demonstration attributes required by the state module through the interaction module and select the adopted interaction mode, wherein the interaction mode can comprise an automatic mode, a single interaction mode, a multi-person interaction mode and the like. The recommendation module can be accessed to a knowledge point recommendation algorithm of the intelligent adaptive learning system, and provides required data such as exercises related to current knowledge points through a knowledge point recommendation algorithm interface. When the intelligent adaptive learning system can deduce the next recommended knowledge point of the target user according to the mastery state of the target user on the current knowledge point, the recommendation module can continue to provide required data according to the next recommended knowledge point by using a knowledge point recommendation algorithm. And repeating the steps until the intelligent adaptive learning system completes the generation of the learning path to obtain the complete learning path of the target user. Optionally, when the knowledge point recommendation algorithm has multiple question-pushing strategies, names of the question-pushing strategies can be demonstrated in real time in the process of demonstrating the learning path by the intelligent adaptive learning demonstration system, so that an operator can intuitively know the decision process of the intelligent adaptive learning system. The learning simulation module can be accessed to a problem-pushing algorithm of the intelligent adaptive learning system, receives learning data such as various knowledge point exercises and the like sent by a problem-pushing algorithm interface, and performs simulation learning according to the student user attributes of the target user to obtain a state value of whether the target user can master the knowledge point in a given learning situation. The learning simulation module can simulate the ability values of various student users by using a question-pushing algorithm, determine the possibility of wrong answer under various question difficulties, and judge whether the target user grasps the knowledge point according to the grasping conditions of the intelligent adaptive learning system. The demonstration module may demonstrate a learning path of the target user in the intelligent adaptive learning system, generally in the form of a knowledge map or a route map, and may be a computer client, a web page, a television, a mobile terminal, an intelligent terminal, or various demonstration screens. The content demonstrated by the demonstration module can be designed according to the user requirements, for example, the content can include user attributes, knowledge point attributes, learning strategies and the like, which are not limited in the embodiment of the invention.

Fig. 6 is a schematic flowchart of the intelligent agent reinforcement learning according to the second embodiment of the present invention, and in a specific example, as shown in fig. 6, step S260 may specifically include the following operations.

And S261, observing the knowledge point recommending environment through an agent in the reinforcement learning model to obtain a multi-dimensional vector state.

Wherein the multidimensional vector state can be an observation of the environment state by the agent represented by the high-dimensional vector. Illustratively, the multidimensional vector state may include student user attributes, knowledge point attributes of the knowledge graph, learning context attributes, presentation attributes, and the like.

In an optional embodiment of the present invention, if the type of the knowledge point recommendation environment includes an adaptive engine recommendation knowledge point environment, observing the knowledge point recommendation environment by the agent to obtain a multidimensional vector state may include: determining self-adaptive recommended knowledge points through a self-adaptive engine of the intelligent adaptive learning system according to a knowledge point recommendation algorithm; and determining the multidimensional vector state of the environment of the self-adaptive engine recommended knowledge point according to the self-adaptive recommended knowledge point through the agent.

The self-adaptive recommended knowledge points can be knowledge points automatically recommended by a self-adaptive engine according to a knowledge point recommendation algorithm.

Optionally, if the operation user instructs the adaptive engine to recommend the knowledge point environment as the environment type of reinforcement learning in the learning path demonstration instruction, the adaptive engine of the intelligent adaptive learning system may determine the adaptive recommended knowledge point according to the knowledge point recommendation algorithm. Further, the agent may determine a multidimensional vector state of the adaptive engine recommended knowledge point environment according to the adaptive recommended knowledge point.

In an optional embodiment of the present invention, if the type of the knowledge point recommendation environment includes an operation user feedback recommendation knowledge point environment, observing the knowledge point recommendation environment by the agent to obtain a multidimensional vector state, which may include: receiving feedback recommendation knowledge points determined by an operating user through the intelligent adaptive learning system; determining a multi-dimensional vector state of the environment of the feedback recommended knowledge point of the operation user according to the feedback recommended knowledge point through the agent; if the operation user is the first operation user, the feedback recommended knowledge point is a target self-adaptive recommended knowledge point selected by the first operation user according to the self-adaptive recommended knowledge point determined by the self-adaptive engine; and if the operation user is the second operation user, the feedback recommended knowledge point is an autonomous recommended knowledge point determined by the second operation user (according to teaching experience).

Wherein, the feedback recommendation knowledge point can be a knowledge point which is fed back to the intelligent adaptive learning demonstration system by the operation user. The target adaptive recommended knowledge point may be one of the recommended knowledge points selected by the first operation user based on the adaptive recommended knowledge points determined by the adaptation engine. The autonomous recommended knowledge point may be a recommended knowledge point selected by the second operation user based on own experience.

Optionally, if the operation user indicates the operation user to feed back the recommended knowledge point environment as the environment type of reinforcement learning in the learning path demonstration instruction, the smart adaptive learning system may receive the feedback recommended knowledge point determined by the operation user. Further, the agent may determine a multidimensional vector state of the environment where the recommended knowledge point is fed back by the operating user according to the feedback recommended knowledge point.

Optionally, if the operating user is a first operating user, the feedback recommended knowledge point may be a target adaptive recommended knowledge point selected by the first operating user according to the adaptive recommended knowledge point determined by the adaptive engine; and if the operation user is a second operation user, feeding back the autonomous recommended knowledge point determined by the second operation user.

In an optional embodiment of the present invention, the observing, by the agent, the knowledge point recommendation environment to obtain the multidimensional vector state may include: acquiring a simulation recommendation exercise according to the knowledge points recommended by the knowledge point recommendation environment through the intelligent adaptive learning system; automatically simulating the answer result of the target user according to the student user attribute and the learning situation attribute of the target user through the intelligent adaptive learning system, and determining the knowledge point mastering state of the target user according to the answer result; and receiving the knowledge point mastering state of the target user through the agent, and determining the multidimensional vector state according to the knowledge point mastering state.

The simulation recommendation problem can be a problem recommended by a recommendation module of the intelligent adaptive learning system according to the current knowledge point. The knowledge point mastering state can represent whether the target user grasps the knowledge point.

Specifically, the intelligent adaptive learning system can automatically acquire the simulation recommendation exercises through the recommendation module according to the knowledge points recommended by the knowledge point recommendation environment. After the simulated recommendation exercises are obtained, the intelligent adaptive learning system utilizes the learning simulation module to automatically simulate the answer results of the target user according to the attributes of the student user and the attributes of the learning situation of the target user. That is, in the whole demonstration process, the target user does not need to participate in the actual answering process, and the whole learning process can be automatically simulated by the intelligent adaptive learning system. Correspondingly, the intelligent adaptive learning system can automatically simulate the answer result of the target user according to the student user attribute and the learning condition attribute of the target user, so that whether the target user masters the knowledge point mastering state of the knowledge point is determined according to the answer result. Correspondingly, the intelligent agent can observe the intelligent adaptive learning system, and then obtain the multi-dimensional vector state corresponding to the target user.

And S262, determining the action of the intelligent agent through the intelligent agent according to the action selection strategy model and the multi-dimensional vector state.

And S263, executing the intelligent body action through the intelligent body so as to update the state of the knowledge point recommendation environment according to the intelligent body action execution result to obtain an updated multidimensional vector state.

The intelligent body action execution result, namely the intelligent body action execution result, can act on the current knowledge point recommendation environment, so that the knowledge point recommendation environment updates the current state. The updated multidimensional vector state may be a state in which the agent performs an agent action and then updates the knowledge point recommendation environment.

And S264, receiving the reward value determined by the knowledge point recommending environment according to the updated multidimensional vector state and the action of the intelligent agent through the intelligent agent.

And S265, determining to update the action of the intelligent agent according to the reward value, the updated multidimensional vector state and the learning evaluation index through the intelligent agent.

The learning evaluation index comprises a knowledge point mastering target and a teaching rule determining rule of an award value.

Wherein the update agent action may be a new action determined by the agent according to the action selection policy. The knowledge point grasping target may be a knowledge point that allows the target user to grasp the most knowledge points as quickly as possible within a predetermined presentation period. The teaching law determination rule may be a rule that determines the prize value according to a teaching law or logic.

Fig. 7 is a schematic diagram illustrating an effect of an association relationship between knowledge points of a knowledge graph according to a second embodiment of the present invention. As shown in fig. 7, taking the learning of the quadratic root formula knowledge points in junior high school mathematics as an example for specific explanation, the list of the knowledge graph shown in fig. 7 is the association relationship between a part of the nanoscale knowledge points related to the quadratic root formula in the constructed knowledge graph.

In FIG. 7, the third column of the list is the nanoscale knowledge point name, the second column of the list is the reference number of the knowledge point, and the fourth column of the list is the leading knowledge point reference number of the knowledge point. Generally, the difficulty of the subsequent knowledge point is higher than that of the preceding knowledge point. That is, the more posterosuperior knowledge points are more difficult. It can be understood that, in general, if the current knowledge point is not mastered, it is more reasonable to recommend the front knowledge point to learn, and it is unreasonable to recommend the rear knowledge point. Take the knowledge point labeled c090201 as an example, and its subsequent knowledge points include: c090301, the preposed knowledge points are c090203, c090204 and c 090103. After the knowledge points of c090201 are well learned, the intelligent adaptive learning system recommends that the subsequent knowledge points c090301 conform to the teaching rules, but if the knowledge points of c090201 are not well learned, the recommended subsequent knowledge points c090301 violate the teaching rules. Therefore, the learning evaluation index can specifically determine the reward value according to the knowledge point mastering target and the teaching rule determination rule of the reward value. When the recommended knowledge points accord with the teaching rules and the target user masters the knowledge points, certain rewards can be given; certain penalties may be given when the recommended knowledge points violate teaching rules and/or the target user does not have knowledge points in hand.

And S266, judging whether the learning termination condition of the reinforcement learning model is met or not through the agent, if so, executing S267, and otherwise, returning to execute S261.

And S267, terminating the reinforcement learning process to obtain a path generation process of the target user learning path.

Specifically, the interaction process of the agent and the environment comprises three stages of environment observation, agent action and environment feedback sensed by the agent. Wherein the agent-aware environmental observations characterize agent-observations of environmental states using high-dimensional vectors that may contain a collection of acquired information issued from the agent. The agent action represents which knowledge point the target user is next to learn. Context feedback refers to feedback of the context to the agent in numerical reward. At each time step t, the agent receives status information S of the environment_tE S, where S is the set of possible states, S_tRepresents the state at time t; selecting an action A based on this state agent_t∈A(S_t) Wherein A (S)_t) Is state S_tSet of all actions down, A_tIndicating the action at time t. After a time step, the agent receives a numerical reward R_t+1(the reward at the moment t + 1) is equal to R, and a new environmental state S is observed as the reward of the action_t+1And thus enters the loop process of the next interaction.

Optionally, the decision module in the intelligent adaptive learning demonstration system may establish a deep reinforcement learning model for the agent, and set a state space of the agent in the environment, a behavior space that the agent can make a decision, and a behavior reward of the environment to the agent.

In the learning range of the learning path to be planned by the intelligent agent, each node corresponds to a nanoscale knowledge point in the intelligent adaptive learning system, and the connection between the knowledge points corresponds to the logic relationship between the knowledge points in the intelligent adaptive learning system. The intelligent agent can obtain the required input from the state module of the intelligent adaptive learning system when establishing the model so as to realize the observation of the environment: the state may be regarded as an observed value, and recorded as (learning level, grasping state) a state grasped for a current knowledge point at a current learning level of a target user, such as: (excellent learning, mastered learning).

Specifically, each knowledge point may be provided with a state of grasp by the target user and a learning level of the target student. Wherein the state may include: not yet learned, mastered, not judged, etc.; the learning level can be divided into several categories according to user settings, such as: the learning quality, the learning medium level, the learning poor level, or a plurality of capability value ranges corresponding to the capability value of the student at each moment obtained by the intelligent adaptive learning system through the IRT, such as the learning quality (capability value range of 0.7-1), the learning medium level (capability value range of 0.36-0.69), and the learning poor level (capability value range of 0.35-0). The operation user can only set the initial learning level of the target student, and once the target user learns a knowledge point, the learning level is taken as the learning level according to the ability value of the target user in the intelligent adaptive learning system. Considering that the target student has knowledge points that are not mastered when the ability value is low before the target student can possibly re-learn, the same target user may have a plurality of learning level states at the same knowledge point.

Optionally, the target user learning path may be intensively learned based on a Q learning algorithm (Q learning). Q is Q (S, a) under the framework of a Q learning algorithm, and the expectation that the action of taking a (a belongs to A) can obtain the profit is shown in the S state (S belongs to S) at a certain moment. The knowledge point recommendation environment feeds back a corresponding reward r according to the action of the agent. That is, it contains an agent, a set of states S representing its state in the environment, and a set of actions A that can be performed at each state. The intelligent agent selects and executes an action a through an action selection strategy in an initial state s, wherein a belongs to A, specifically, a knowledge point is randomly selected from a learning range corresponding to a target user, and the knowledge point can be selected by an adaptive engine or an operation user. In interacting with the environment, the agent will transition from the current state s to the next state s' and will get an immediate reward r for the environment, modifying the Q value according to the update rules. It is understood that the update rule may be set according to different actions of the selection policy model and different adaptations of the specific learning algorithm, which is not limited by the embodiment of the present invention. The goal of agent learning is to maximize the accumulated reward obtained from the environment, i.e., to perform the action of obtaining the maximum reward at each state. Accordingly, the method for updating the Q value is as follows:

Q(s，a)←Q(s，a)+α[r+γmax_a′Q(s′，a′)-Q(s，a)]

wherein alpha represents the learning rate, and the learning rate alpha belongs to [0, 1] influences the proportion of new values learned in the future to replace original values. If alpha is 0, the intelligent agent does not have new knowledge; and α ═ 1 means that the learned knowledge is not stored, and all are replaced with new knowledge. Gamma represents a discount factor, the discount factor gamma belongs to [0, 1] represents the perspective of the intelligent agent, the size of the discount factor gamma influences the weight occupied by the predicted return of future actions, gamma approaches to 0 and represents that the intelligent agent only pays attention to the return of the action before the eyes and often executes the action of maximizing the current instant reward; when gamma approaches 1, the intelligence will consider more future returns; when γ ∈ [0, 1] indicates that the action influence at the front is larger, while the action influence at the back is smaller or even negligible.

To solve the problem of state space being too large (i.e., dimensionality disaster), Q (s, a) can be represented by a function rather than a Q table. That is, the Q value that can be obtained for which action is chosen for a given state, can be calculated by a deep neural network. Optionally, a Deep Q Network (DQN) may be used, and after the network is trained, the computation may be performed as needed without storing a Q value. The model of the DQN network is as follows:

wherein q is_π(s, a) represents a group consisting ofStarting at state S, taking action A, and following the discounted revenue of policy π, is the effect of taking a particular action from a particular state. E represents expectation, G_tThe method comprises the steps of representing the discount income at the time t, representing the weight, solving by using a supervised learning algorithm (such as linear regression, decision tree, neural network and the like) in a machine learning algorithm, fitting a proper function, extracting features of an input state as input, calculating a value function as output by a Monte Carlo Method (MC) or time Difference learning (TD), and then training function parameters until convergence. The DQN algorithm learns knowledge continuously during training, but learns not the Q values stored in the table, but rather the neural network parameters. Through the deep reinforcement learning method, it can be known which action is selected to maximize the sum of the rewards obtained in the future.

In the environment feedback stage, a discrete reward function r (rewarded function) can be adopted as the learning evaluation index. The reward function is one of the important elements of environment-to-agent information feedback, and the reward function is to tell the agent the goal that he wants to achieve, but not how to achieve the goal. The goal of the reward function can be to tell the agent to let the target user master the most knowledge points as soon as possible in the demonstration period, the target user masters the knowledge points at each time step and gives rewards, and certain punishment is made if the knowledge points are not mastered. Meanwhile, the design of the return function also needs to accord with the rule and the logic of the learning sequence of the knowledge points, the diversity and the process complexity of courses are considered, and certain punishment can be made for the recommendation violating the teaching rule.

In an alternative embodiment, in the single-person interaction mode of the intelligent adaptive learning presentation system, the intelligent adaptive learning presentation system may be externally connected with a human-computer interaction device, and the devices available for human-computer interaction mainly include, but are not limited to, a keyboard, a mouse, a joystick, and various mode recognition devices (gesture recognition, motion recognition, and voice recognition). Fig. 8 is a schematic illustration of a demonstration flow of a single interaction mode according to a second embodiment of the present invention. As shown in fig. 8, the knowledge point recommending module of the smart adaptive learning system may send knowledge point data to the interface of the recommending module and wait for feedback from the target user (user for short). The user can interact with the intelligent adaptive learning demonstration system through the human-computer interaction equipment, and when the user feedback confirms that the knowledge point is learned, the intelligent adaptive learning demonstration system continues the next process to perform knowledge point learning simulation. And when the user feeds back to skip the knowledge point for learning, returning to the knowledge point recommendation process of the intelligent adaptive learning system, recommending the next knowledge point by the intelligent adaptive learning system, and waiting for the user feedback again.

In an alternative embodiment, in the multi-person interaction mode of the intelligent adaptive learning demonstration system, the interaction process of the intelligent agent and the environment does not need to carry out recommendation of the knowledge point of the target user through a knowledge point recommendation module of the intelligent adaptive learning system, and an action of selecting the next knowledge point to learn by the operating user is received to become the next state. After a time step, the agent receives a numerical reward as a result of this action, and observes a new environmental state S_t+1And thus enters the loop process of the next interaction. In the mode, the intelligent adaptive learning demonstration system can demonstrate the difference of the results of the two modes of mastering the knowledge points of the intelligent agent and the operation user when the intelligent agent and the operation user recommend the next knowledge point learning strategy of the same target user.

And S270, visually demonstrating the path generation process.

Correspondingly, step S270 may specifically include the following operations:

s271, determining the demonstration state attribute of the path generation process.

Wherein, the demonstration state attribute can be a demonstration attribute specified by the operation user through the learning path demonstration instruction.

In the embodiment of the invention, the operation user can also specify the demonstration state attribute through the learning path demonstration instruction. For example, the operation user may specify a presentation time length or a learning range of the presentation, and the like, which is not limited by the embodiment of the present invention. Accordingly, the intelligent adaptive learning demonstration system can determine the demonstration state attribute of the path generation process according to the learning path demonstration instruction.

And S272, visually demonstrating the path generation process according to the demonstration state attribute.

Correspondingly, after the intelligent adaptive learning demonstration system determines the demonstration state attribute of the path generation process according to the learning path demonstration instruction, the path generation process can be intuitively demonstrated according to the demonstration mode specified by the operation user.

Illustratively, when the demonstration state attribute is 5 minutes of demonstration duration, the intelligent adaptive learning demonstration system needs to intuitively demonstrate the path generation process of the completed learning path of the target user under the condition that the mastery number of knowledge points is the maximum within 5 minutes.

It should be noted that the intelligent adaptive learning demonstration system can display the content decision of each knowledge point in real time during the reinforcement learning process, and can also display the complete path generation process in sequence after the reinforcement learning is finished.

In summary, in the embodiment of the present invention, the intelligent agent of the intelligent adaptive learning demonstration system makes a decision automatically according to the state of the environment, so that the operating user can intuitively know the decision process and effect of the intelligent adaptive learning system for dynamically recommending the knowledge point during the specified demonstration period, the demonstration function of the intelligent adaptive learning demonstration system for the learning effect is enriched, and the intuitiveness and the intelligence of the intelligent adaptive learning demonstration system for demonstrating the learning effect are improved.

It should be noted that any permutation and combination between the technical features in the above embodiments also belong to the scope of the present invention.

EXAMPLE III

Fig. 9 is a schematic view of a learning path demonstration apparatus for deep reinforcement learning according to a third embodiment of the present invention, and as shown in fig. 9, the apparatus includes: a learning path demonstration instruction receiving module 310, a reinforcement learning factor determining module 320, a reinforcement learning module 330 and a path generation process demonstration module 340, wherein:

a learning path demonstration instruction receiving module 310, configured to receive a learning path demonstration instruction of a target user;

the reinforcement learning factor determining module 320 is configured to determine a reinforcement learning factor of the learning path of the target user according to the learning path demonstration instruction; the reinforcement learning factors comprise a state space, an action space and a learning evaluation index;

the reinforcement learning module 330 is configured to perform reinforcement learning on the learning path of the target user according to the reinforcement learning factor in response to the learning path demonstration instruction, so as to obtain a path generation process of the learning path of the target user;

and the path generation process demonstration module 340 is used for visually demonstrating the path generation process.

Optionally, the reinforcement learning factor determining module 320 is specifically configured to: determining an intelligent adaptive learning system as a knowledge point recommendation environment in a reinforcement learning model; determining the knowledge point recommendation environment type matched with the state space according to the type of the learning path demonstration instruction; determining the state space according to the attributes of student users, the attributes of knowledge points of the knowledge graph, the attributes of learning situations and the demonstration attributes; wherein the student user attributes comprise learning abilities of a target user; the knowledge point attributes of the knowledge graph comprise logical relations among knowledge points, knowledge point reference frequency, knowledge point importance degree and knowledge point difficulty; the learning situation attribute comprises a knowledge point mastering state; the demonstration attributes comprise knowledge point mastering quantity, learning path demonstration duration or knowledge point learning range; and determining the action space according to the recommended learning knowledge points of the target user.

Optionally, the type of the knowledge point recommending environment includes at least one of an adaptive engine recommending knowledge point environment and an operating user feedback recommending knowledge point environment; the adaptive engine knowledge point recommending environment is used for recommending knowledge points to the target user by adopting an adaptive engine; the operation user feedback recommendation knowledge point environment is used for recommending a knowledge point to the target user according to feedback information of an operation user; the operation user comprises a first operation user or a second operation user; the feedback information of the first operation user is used for confirming whether to accept the knowledge point recommended by the adaptive engine; and the feedback information of the second operation user is used for autonomously recommending the knowledge points to the target user.

Optionally, the reinforcement learning module 330 is specifically configured to: observing a knowledge point recommendation environment through an agent in the reinforcement learning model to obtain a multidimensional vector state; determining the action of the intelligent agent according to the action selection strategy model and the multi-dimensional vector state by the intelligent agent; executing the intelligent body action through the intelligent body so as to update the state of the knowledge point recommendation environment according to the intelligent body action execution result to obtain an updated multidimensional vector state; receiving, by the agent, a reward value determined by the knowledge point recommendation environment based on the updated multi-dimensional vector state and the agent action; determining, by the agent, an updated agent action based on the reward value, the updated multi-dimensional vector state, and the learning evaluation index; the learning evaluation index comprises a knowledge point mastering target and a teaching rule determining rule of an award value; and returning and executing the operation of observing the recommendation environment of the knowledge points by the agent to obtain the state of the multidimensional vector until the learning termination condition of the reinforcement learning model is determined to be met.

Optionally, if the type of the knowledge point recommendation environment includes an adaptive engine recommended knowledge point environment, the reinforcement learning module 330 is specifically configured to: determining self-adaptive recommended knowledge points through a self-adaptive engine of the intelligent adaptive learning system according to a knowledge point recommendation algorithm; and determining the multidimensional vector state of the environment of the self-adaptive engine recommended knowledge point according to the self-adaptive recommended knowledge point through an agent.

Optionally, if the type of the knowledge point recommendation environment includes an operation user feedback recommended knowledge point environment, the reinforcement learning module 330 is specifically configured to: receiving feedback recommendation knowledge points determined by an operating user through the intelligent adaptive learning system; determining a multidimensional vector state of the environment of the feedback recommended knowledge point of the operation user according to the feedback recommended knowledge point through an agent; if the operation user is the first operation user, the feedback recommended knowledge point is a target self-adaptive recommended knowledge point selected by the first operation user according to the self-adaptive recommended knowledge point determined by the self-adaptive engine; and if the operation user is the second operation user, the feedback recommended knowledge point is the autonomous recommended knowledge point determined by the second operation user.

Optionally, the reinforcement learning module 330 is specifically configured to: acquiring a simulation recommendation exercise according to the knowledge points recommended by the knowledge point recommendation environment through the intelligent adaptive learning system; automatically simulating the answer result of the target user according to the student user attribute and the learning situation attribute of the target user through the intelligent adaptive learning system, and determining the knowledge point mastering state of the target user according to the answer result; and receiving the knowledge point mastering state of the target user through the agent, and determining the multidimensional vector state according to the knowledge point mastering state.

Optionally, the path generation process demonstration module 340 is specifically configured to: determining a presentation state attribute of the path generation process; and visually demonstrating the path generation process according to the demonstration state attribute.

The learning path demonstration device for deep reinforcement learning can execute the learning path demonstration method for deep reinforcement learning provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For the technical details not described in detail in this embodiment, reference may be made to a learning path demonstration method of deep reinforcement learning provided in any embodiment of the present invention.

Since the learning path demonstration apparatus for deep reinforcement learning described above is an apparatus capable of performing the learning path demonstration method for deep reinforcement learning in the embodiment of the present invention, based on the learning path demonstration method for deep reinforcement learning described in the embodiment of the present invention, those skilled in the art can understand the specific implementation manner and various variations of the learning path demonstration apparatus for deep reinforcement learning in the embodiment of the present invention, and therefore, how to implement the learning path demonstration method for deep reinforcement learning in the embodiment of the present invention by the learning path demonstration apparatus for deep reinforcement learning is not described in detail here. As long as those skilled in the art implement the apparatus for demonstrating the learning path of the deep reinforcement learning in the embodiment of the present invention, the apparatus is within the scope of the present application.

Example four

Fig. 10 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. FIG. 10 illustrates a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in FIG. 10, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors 16, a memory 28, and a bus 18 that connects the various system components (including the memory 28 and the processors 16).

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 10, and commonly referred to as a "hard drive"). Although not shown in FIG. 10, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an Input/Output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network such as the internet) via the Network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be appreciated that although not shown in FIG. 10, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, (Redundant Arrays of Independent Disks, RAID) systems, tape drives, and data backup storage systems, to name a few.

The processor 16 executes various functional applications and data processing by running the program stored in the memory 28, so as to implement the learning path demonstration method for deep reinforcement learning provided by the embodiment of the present invention: receiving a learning path demonstration instruction of a target user; determining a reinforcement learning factor of a learning path of a target user according to the learning path demonstration instruction; the reinforcement learning factors comprise an intelligent agent, a learning environment, a state space, an action space and a learning evaluation index; responding to the learning path demonstration instruction, and performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors to obtain a path generation process of the learning path of the target user; and visually demonstrating the path generation process.

EXAMPLE five

An embodiment of the present invention further provides a computer storage medium storing a computer program, where the computer program is executed by a computer processor to perform the learning path demonstration method for deep reinforcement learning according to any one of the above embodiments of the present invention: receiving a learning path demonstration instruction of a target user; determining a reinforcement learning factor of a learning path of a target user according to the learning path demonstration instruction; the reinforcement learning factors comprise an intelligent agent, a learning environment, a state space, an action space and a learning evaluation index; responding to the learning path demonstration instruction, and performing reinforcement learning on the learning path of the target user according to the reinforcement learning factors to obtain a path generation process of the learning path of the target user; and visually demonstrating the path generation process.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A learning path demonstration method for deep reinforcement learning is characterized by comprising the following steps:

receiving a learning path demonstration instruction of a target user;

determining a reinforcement learning factor of a learning path of a target user according to the learning path demonstration instruction; the reinforcement learning factors comprise an intelligent agent, a learning environment, a state space, an action space and a learning evaluation index;

and visually demonstrating the path generation process.

2. The method of claim 1, wherein determining reinforcement learning factors for a target user learning path according to the learning path demonstration instruction comprises:

determining an intelligent adaptive learning system as a knowledge point recommendation environment;

determining the knowledge point recommendation environment type matched with the state space according to the type of the learning path demonstration instruction;

determining the state space according to the attributes of student users, the attributes of knowledge points of the knowledge graph, the attributes of learning situations and the demonstration attributes; wherein the student user attributes comprise learning abilities of a target user; the knowledge point attributes of the knowledge graph comprise logical relations among knowledge points, knowledge point reference frequency, knowledge point importance degree and knowledge point difficulty; the learning situation attribute comprises a knowledge point mastering state; the demonstration attributes comprise knowledge point mastering quantity, learning path demonstration duration or knowledge point learning range;

determining the action space according to the recommended learning knowledge points of the target user;

the knowledge point recommending environment type comprises at least one of an adaptive engine recommending knowledge point environment and an operating user feedback recommending knowledge point environment;

the adaptive engine knowledge point recommending environment is used for recommending knowledge points to the target user by adopting an adaptive engine;

the operation user feedback recommendation knowledge point environment is used for recommending a knowledge point to the target user according to feedback information of an operation user; the operation user comprises a first operation user or a second operation user; the feedback information of the first operation user is used for confirming whether to accept the knowledge point recommended by the adaptive engine; and the feedback information of the second operation user is used for autonomously recommending the knowledge points to the target user.

3. The method of claim 2, wherein the learning-enhanced target user learning path according to the learning-enhanced factors comprises:

observing a knowledge point recommendation environment through an agent in the reinforcement learning model to obtain a multidimensional vector state;

determining, by the agent, an agent action according to an action selection policy model and the multi-dimensional vector state;

executing the intelligent agent action through the intelligent agent so as to update the state of the knowledge point recommendation environment according to the intelligent agent action execution result to obtain an updated multidimensional vector state;

receiving, by the agent, a reward value determined by the knowledge point recommendation environment based on the updated multi-dimensional vector state and the agent action;

determining, by the agent, an updated agent action based on the reward value, the updated multi-dimensional vector state, and the learning evaluation index; the learning evaluation index comprises a knowledge point mastering target and a teaching rule determining rule of an award value;

and returning and executing the operation of observing the recommendation environment of the knowledge points by the agent to obtain the state of the multidimensional vector until the learning termination condition of the reinforcement learning model is met.

4. The method of claim 3, wherein if the type of knowledge point recommendation environment comprises an adaptive engine recommended knowledge point environment, then observing the knowledge point recommendation environment by the agent to obtain a multidimensional vector state comprises:

determining self-adaptive recommended knowledge points through a self-adaptive engine of the intelligent adaptive learning system according to a knowledge point recommendation algorithm;

and determining the multidimensional vector state of the environment of the self-adaptive engine recommended knowledge point according to the self-adaptive recommended knowledge point through the agent.

5. The method of claim 3, wherein if the type of knowledge point recommendation environment comprises operating a user feedback recommendation knowledge point environment, then observing the knowledge point recommendation environment by the agent to obtain a multidimensional vector state, comprising:

receiving feedback recommendation knowledge points determined by an operating user through the intelligent adaptive learning system;

determining a multi-dimensional vector state of the environment of the feedback recommended knowledge point of the operation user according to the feedback recommended knowledge point through the agent;

if the operation user is the first operation user, the feedback recommended knowledge point is a target self-adaptive recommended knowledge point selected by the first operation user according to the self-adaptive recommended knowledge point determined by the self-adaptive engine;

and if the operation user is the second operation user, the feedback recommended knowledge point is the autonomous recommended knowledge point determined by the second operation user.

6. The method of claim 4 or 5, wherein the observing, by the agent, the knowledge point recommendation environment results in a multidimensional vector state, comprising:

acquiring a simulation recommendation exercise according to the knowledge points recommended by the knowledge point recommendation environment through the intelligent adaptive learning system;

automatically simulating the answer result of the target user according to the student user attribute and the learning situation attribute of the target user through the intelligent adaptive learning system, and determining the knowledge point mastering state of the target user according to the answer result;

and receiving the knowledge point mastering state of the target user through the agent, and determining the multidimensional vector state according to the knowledge point mastering state.

7. The method of claim 1, wherein visually demonstrating the path generation process comprises:

determining a presentation state attribute of the path generation process;

and visually demonstrating the path generation process according to the demonstration state attribute.

8. A learning path demonstration apparatus for deep reinforcement learning, comprising:

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a learning path demonstration method of deep reinforcement learning according to any one of claims 1-7.

10. A computer storage medium on which a computer program is stored, which when executed by a processor implements a learning path demonstration method for deep reinforcement learning according to any one of claims 1 to 7.