WO2024111866A1

WO2024111866A1 - Reinforcement learning system for self-development

Info

Publication number: WO2024111866A1
Application number: PCT/KR2023/015319
Authority: WO
Inventors: 이정수
Original assignee: 주식회사 트위니어스
Priority date: 2022-11-25
Filing date: 2023-10-05
Publication date: 2024-05-30
Also published as: KR102518825B1

Abstract

A reinforcement learning system for self-development, of the present invention, comprises: at least one agent operating by means of at least one processor; and a non-transitory storage medium storing instructions for executing a reinforcement learning training algorithm for self-development, wherein the reinforcement learning training algorithm comprises steps in which: the current state (S: state) related to a group to which a user belongs is observed by the agent; the agent selects, on the basis of a predefined policy, an action (a: action) corresponding to the current state and recommends the user a mission corresponding to the selected action; the agent receives user satisfaction evaluated after the user has performed the mission recommended by the agent, and a vector of a modified user interest keyword of the user; the agent updates a state transition probability on the basis of the proportion of mentees who have updated a mentee interest keyword from among various mentees in a group to which the user belongs; a reward function is updated on the basis of the weighted average of the results of mentee satisfaction evaluated by the mentees after the performance of a mission recommended to the mentees who have updated the mentee interest keyword; a discount rate is updated so that reward becomes smaller as the cosine similarity between the vector of the user interest keyword and a mentor interest keyword vector of a role model mentor predetermined by each mentee who has updated the mentee interest keyword become father apart; and the policy is updated on the basis of the updated state transition probability, reward function, and discount rate, wherein the updated policy is set to allow the agent to perform the next action so that the weighted average of the reward function reaches a maximum value.

Description

Reinforcement learning system for self-development

The present invention relates to a reinforcement learning system for self-development, and more specifically, to a reinforcement learning system for self-development using reinforcement learning, which is an area of machine learning.

A mentor is a person who helps a mentee in various aspects. A mentor is a person who cares for, trusts, and encourages the mentee. A good mentor wants to be around their mentee, is experienced, and likes to help their mentee succeed in life.

A mentee is a person who develops and develops his or her own capabilities with the help of a mentor. In college, a mentee is a learner who lacks basic knowledge of their major and seeks help from a mentor to develop learning skills to supplement this, adapt to college life, and obtain information about career paths and employment.

Mentoring refers to activities in which a mentor influences a mentee. The type of mentoring is divided into 1:1 mentoring, peer mentoring, and group mentoring depending on how the relationship between mentor and mentee is formed. 1:1 mentoring refers to a relationship where an experienced mentor teaches one-on-one to inexperienced people who are in a stage of learning or transition. Peer mentoring (or group study) refers to a relationship in which colleagues of similar level support and guide one another. Group mentoring is a form in which several mentees work together under one or more experienced mentors for a specific purpose. The advantage is that we can exchange ideas and information and receive feedback as a group.

There is a demand for a platform where people can naturally decide their career path or direction of further education through mentoring.

The present invention is intended to solve the above problems, and the purpose of the present invention is to provide a reinforcement learning system for self-development.

The above and other objects and advantages of the present invention will become apparent from the following description of preferred embodiments.

The above purpose is to provide a reinforcement learning system for self-development.

A reinforcement learning system for self-development according to an embodiment of the present invention,

At least one agent operating by at least one processor;

Non-transitory storage medium that stores instructions for executing reinforcement learning training algorithms for self-development

Including,

The reinforcement learning training algorithm is,

Observation of the current state (S) related to the group to which the user belongs is performed by the agent;

The agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommending a mission corresponding to the selected action to the user;

receiving, by the agent, user satisfaction evaluated after the user performs the mission recommended by the agent, and a vector of modified user interest keywords of the user;

updating, by the agent, a state transition probability based on the ratio of mentees who have updated a mentee keyword of interest among several mentees in the group to which the user belongs;

Updating a reward function based on a weighted average of mentee satisfaction results evaluated by the mentees after performing a mission recommended to the mentees who updated the mentee interest keyword;

Updating the discount rate so that the greater the cosine similarity between the mentor interest keyword vector of a role model mentor predetermined by each mentee who updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward; and

Updating the policy based on the updated state transition probability, the reward function, and the discount rate.

Including,

The updated policy is set so that the agent takes the next action such that the weighted average of the reward function becomes the maximum value.

Preferably,

The reinforcement learning training algorithm for self-development is based on the Markov Decision Process (MDP) and uses Q-Learning to find optimal conditions for predefined variables of the Bellman Equation. It is characterized by being carried out.

Preferably,

The user interest keywords and the mentee interest keywords include natural language related to their career, advancement, and employment, and the mentor interest keywords include natural language related to mentoring content in their mentoring group,

The vector of the user interest keyword, the mentee interest keyword vector, and the mentor interest keyword vector are generated using a word embedding method of natural language processing using a neural network (NN).

Preferably,

The current state is information input by the user and the mentee and has a vector form generated by word embedding,

The action includes the agent recommending missions to be performed by the user and the mentees, and the missions are missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in related fields. It is the sum of the missions proposed by

Preferably,

The agent is deployed one per user, and the agents deployed for each user operate in different environments, and the similarity of the environments of the agents of users within a user cluster is determined by the environment of the agents of users of different user clusters. Higher than similarity.

Preferably,

Observation of the current state is performed; performing said action; The receiving step; updating the state transition probability; updating the compensation function; updating the discount rate; And the step of updating the policy is repeated multiple times,

Modification of keywords of interest is received from the user and the mentees who performed the mission recommended by the agent, and the current status of the user is updated to the next status based on the modified keywords of interest. .

Preferably,

The step of updating the reward function is calculated including the user satisfaction, and is calculated by giving a weight to the user satisfaction.

Preferably,

It further includes a grouping unit configured to match a new mentoring group into which the user will be grouped when the cosine similarity between the user interest keyword and the mentee interest keyword exceeds a predetermined threshold.

Preferably,

The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( The mentoring group matching for the user is performed by applying at least one of Collaborative Filtering, mutual satisfaction for each mentor and each mentee, or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest.

By using the reinforcement learning system for self-development according to the present invention, users can naturally discover which career path is more suitable for them by performing missions in the mentoring group.

In addition, users have the effect of receiving mentoring related to their career goals and acquiring related knowledge.

However, the effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

In order to more fully understand the drawings cited in the detailed description of the present invention, a brief description of each drawing is provided.

1 is a schematic diagram of a reinforcement learning system for self-development of the present invention.

Figure 2 is a diagram showing a flow chart of the reinforcement learning training algorithm of the reinforcement learning system for self-development of the present invention.

Figure 3 is a diagram schematically showing the operation between the agent and the environment of the reinforcement learning system for self-development of the present invention.

Figure 4 is a diagram showing an operation flowchart of an agent of the reinforcement learning system for self-development of the present invention.

Hereinafter, the present invention will be described in detail with reference to embodiments of the present invention and drawings. These examples are merely presented as examples to explain the present invention in more detail, and it will be apparent to those skilled in the art that the scope of the present invention is not limited by these examples. .

Additionally, unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains, and in case of conflict, this specification including definitions The description will take precedence.

In order to clearly explain the proposed invention in the drawings, parts unrelated to the description have been omitted, and similar reference numerals have been assigned to similar parts throughout the specification. And, when it is said that a part "includes" a certain component, this means that it does not exclude other components, but may further include other components, unless specifically stated to the contrary. Additionally, “unit” as used in the specification refers to a unit or block that performs a specific function.

Identification codes (first, second, etc.) for each step are used for convenience of explanation. The identification codes do not describe the order of each step, and each step does not clearly state a specific order in context. It may be carried out differently from the order specified above. That is, each step may be performed in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.

Reinforcement learning is an area of machine learning. Inspired by behavioral psychology, it is a method in which an agent defined within an environment recognizes the current state and selects an action or action sequence that maximizes reward among selectable actions. Because these problems are so comprehensive, they are also studied in fields such as game theory, control theory, operations science, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms.

Reinforcement learning is a field that allows people to learn on their own through repetition by later determining whether a certain action in a certain environment was a good action or a wrong action and providing a reward (or penalty).

The reinforcement learning system 100 includes at least one agent 110 operated by at least one processor and a non-transitory storage medium 130.

An example of the agent 110 in the present invention is artificial intelligence (AI).

Agents of the present invention are deployed one per user. Agents deployed for each user can operate in different environments. At this time, the similarity of the environments of the agents of users within a user cluster may be higher than the similarity of the environments of the agents of users of different user clusters.

The non-transitory storage medium 130 stores instructions for executing a reinforcement learning training algorithm for self-development.

Reinforcement learning has two components: environment and agent. The interaction between the environment and the agent will be described in more detail with reference to Figure 3 below.

The agent decides on an action in a specific environment, and the environment rewards that decision. This reward is often determined all at once after several actions are taken, rather than immediately upon action. This is because in many cases, it is not possible to immediately evaluate a specific action when that action is taken.

Reinforcement learning is closely related to deep learning, which was discussed earlier. When an agent determines its behavior and learns on its own with rewards provided by the environment, artificial neural networks, which are mainly covered in deep learning, are used. The artificial neural network determines behavior using the environment and the state of the agent as input, and if there is a reward, it positively learns from previous input values and behaviors.

In the present invention, detailed descriptions of specific formulas and related background knowledge related to reinforcement learning are omitted, and the related content can be easily understood by those skilled in the art.

The reinforcement learning training algorithm of the present invention includes a step (S210) in which the agent 110 observes the current state (S) related to the group to which the user belongs.

As used in the present invention, the term “current state (t state)” includes information input by the user and the mentee of the mentoring group to which the user belongs.

The current state has the form of a vector created using word embedding so that the agent can understand it.

In the reinforcement learning training algorithm of the present invention, the agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommends a mission corresponding to the selected action to the user (S212). Includes.

The agent 110 recommends a mission that is optimal for the observed current state to the user so that the user can perform it.

The optimal action ( c) is suggested to the user.

At this time, the latest version of Deep Q NN takes <the keyword + action combination vector of each user in the user cluster> as input, and <the weighted average of the rewards of the users in the user cluster (evaluation of satisfaction with the mission performed by each user) > is learned by reducing the loss function of the target value and NN output.

The dimensions of the <keyword (s) + action (c) combination vector> of the users within the user cluster are the same. In the case of s, a word-embedded keyword vector (the dimensions of each keyword vector for all users in the user cluster are the same). c is the user's individual missions derived from mentoring results. In other words, the total sum of the missions derived by each user in the user cluster is the number of possible cases of c in (s, c).

As used in the present invention, the term “action” refers to an action in which an agent recommends missions for users and mentees to perform.

As used in the present invention, the term "mission" is a concept similar to a project carried out in a mentoring group, and refers to missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in the related field. It means the total of the missions proposed by In the present invention, the mission can be described in the form of a 'phrase', 'clause', or sentence, such as 'Summer Vacation IT Company Intern'.

The reinforcement learning training algorithm of the present invention includes a step (S214) in which the vector of the user's modified user interest keywords and the user satisfaction evaluated after the user performs the mission recommended by the agent are received by the agent (S214). do.

As used in the present invention, the terms “user interest keyword” and “mentee interest keyword” include natural language related to their career path, advancement to higher education, and employment.

As used in the present invention, the term “mentor interest keyword” includes natural language related to the content of mentoring to the mentor's mentoring group.

Vectors of user interest keywords, mentee interest keyword vectors, and mentor interest keyword vectors are created using the word embedding method of natural language processing using a neural network (NN).

The keyword vector of interest is the embedded state vector.

In the case of keyword(s) existing in a user cluster, it is a keyword vector derived by word embedding method, and each element of the vector is roughly quantized and expressed discretely. Allows a finite number of state vectors to appear within a user cluster. Prevents the agent's environment from becoming too large. It operates by reducing the number of ‘possible’ state vector cases by performing quantization more roughly as the number of users in the user cluster decreases.

Therefore, users from different user clusters also have keyword vectors of the same ‘dimension’. At this time, if the new keyword vector that a specific user arrives at through a specific action has a significant difference in similarity from the average of the keyword vectors in the existing user cluster, the user is assigned a new user cluster similar to the user's keyword vector. .

Keyword vectors are created using a word embedding method using a neural network. In the field of natural language processing, embedding refers to the result of converting natural language used by humans into a vector, a numerical form that machines can understand, or the entire series of processes. The simplest form of embedding is to use the word frequency as a vector. In a term-document matrix, rows and columns correspond to documents. A word-document matrix is an example of the simplest form of embedding.

The reinforcement learning training algorithm of the present invention includes the step of updating the state transition probability based on the ratio of mentees who updated the mentee interest keyword among several mentees in the group to which the user belongs (S216) by an agent. ) includes.

In other words, the state transition probability can be updated based on the ratio of mentees who updated the mentee interest keyword after a specific action.

The reinforcement learning training algorithm of the present invention includes a step (S218) of updating the reward function based on the weighted average of the mentee satisfaction results evaluated by the mentees after performing the mission recommended to the mentees who updated the mentee interest keyword. Includes.

As the similarity between the keyword vector(s) of the user's role model (for example, a mentor who the user gave a high score as a result of mentoring, or a mentor who did not conduct mentoring but showed high interest in the user) and the user's keyword vector increases, the user's It is designed so that the rewards you receive can be even greater.

The step of updating the reward function (S218) includes user satisfaction and is calculated by weighting user satisfaction.

The reward function can also be updated based on the weighted average of the satisfaction results performed by users in the user cluster after performing the recommended mission.

The reinforcement learning training algorithm of the present invention is such that the farther the cosine similarity between the mentor interest keyword vector of the role model mentor predetermined by each mentee who has updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward. It includes a step of updating the discount rate (S220).

The discount rate can be updated so that the greater the cosine similarity of the average keyword vectors of the mentees in the user's group, the smaller the reward.

Pre-determined role model mentors may include mentors who were highly evaluated by the user after mentoring, or mentors who were designated as role models or expressed high interest even if mentoring was not conducted.

Cosine similarity refers to the similarity of two vectors that can be obtained using the cosine angle between the two vectors. If the directions of the two vectors are completely the same, it has a value of 1, if they form an angle of 90°, it has a value of 0, and if they have opposite directions of 180°, it has a value of -1. In other words, cosine similarity ultimately has a value between -1 and 1, and the closer the value is to 1, the higher the similarity can be judged. If you understand this intuitively, it means how similar the directions the two vectors are pointing are.

The reinforcement learning training algorithm of the present invention includes a step (S222) of updating the policy based on the updated state transition probability, reward function, and discount rate.

The updated policy is set so that the agent takes the next action so that the weighted average of the reward function reaches its maximum value.

The reinforcement learning training algorithm for self-development of the present invention is based on the Markov Decision Process (MDP) and uses Q-Learning to determine the optimal conditions of predefined variables of the Bellman Equation. It is characterized by being carried out to find.

The Bellman equation is used as a practical way to find the value of a given state. The Bellman equation deals with the relationship between the value at time t and the value at time t+1, and also deals with the relationship between the value function and the policy function. The Bellman equation is defined using a recursive relationship between the current time point (t) and the next time point (t+1).

Q Learning can be used to find the optimal policy for a given finite Markov decision process. Q learning learns the optimal policy by learning the Q function, which is a function that predicts the expected utility value of performing a given action in a given state. A policy is a rule that indicates what action to perform in a given state. After learning the Q function, the optimal policy can be derived by performing the action that gives the highest Q in each state. One of the advantages of Q Learning is that it allows you to compare the expected values of actions performed without a model of a given environment. In addition, Q Learning can be applied without any modification even in environments where transitions occur stochastically or rewards are given stochastically. It has been proven that Q Learning can learn the optimal policy that obtains the maximum reward from the current state for an arbitrary finite Markov decision process (MDP).

The reinforcement learning training algorithm for self-development of the present invention includes the steps of observing the current state (S210); performing an action (S212); receiving step (S214); Updating the state transition probability (S216); Updating the compensation function (S218); Updating the discount rate (S220); and the step of updating the policy (S222) are repeated multiple times.

The reinforcement learning training algorithm for self-development of the present invention receives modifications to keywords of interest from users and mentees who have performed missions recommended by an agent, and moves the user's current state to the next state based on the modified keywords of interest. Update.

In addition, the reinforcement learning training algorithm for self-development of the present invention evaluates the satisfaction evaluated by users and mentees after performing the mission recommended by the agent and the state vector changed as the user and mentees modify their keywords of interest. Based on this, rewards and state transition probabilities are updated and ultimately the user agent's 'policy' is updated. For example, if the user is “satisfied” with the mission after performing the “Summer Vacation IT Company Intern” mission according to the user agent’s first policy, in the next cycle, the user agent will suggest a mission according to the second policy to the user, but “ “If you are not satisfied,” the user agent proposes a mission to the user according to a third policy that is different from the second policy. In the extreme, if the degree of “not satisfied” is high, the third policy of the user agent may suggest a mission to the user in the direction of excluding “employment at an IT company” from the user’s career path.

The AI of the present invention is learned to find a that causes Q*(s, a) to have the maximum value by applying a method called Q-learning to the Bellman Equation. Here, Q*(s, a) is a function that quantitatively indicates the suitability of the next mission. A, which causes Q*(s, a) to have the maximum value, means “the most appropriate next mission (a, action) considering the mentee’s current stage (s, state).”

The reinforcement learning system for self-development of the present invention further includes a grouping unit (not shown) that matches a new mentoring group into which the user will be grouped when the cosine similarity between the user's interest keyword and the mentee's interest keyword exceeds a predetermined threshold. do.

The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning for vectors of keywords of user interest and vectors of keywords of mentee interest. , Mentoring group matching for users is performed by applying at least one of the mutual satisfaction of each mentor and each mentee or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest.

The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( Collaborative Filtering), mutual satisfaction for each mentor and each mentee, or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest is applied to perform group matching of mentors and mentees for the user.

Mentors and mentees each have one keyword vector, and the keyword vector is generated by a word embedding method. Typically, during soft-clustering, the keyword vectors of all users, including mentors and mentees, are mapped to the Semantic Space, and each user is clustered within this space.

Gaussian Mixture Model (GMM), as its name suggests, is a clustering algorithm that mixes multiple Gaussian distributions. The basic idea of GMM is to express the complex probability distribution that exists in reality by mixing K Gaussian distributions. At this time, K must be set directly by the person who wants to analyze the data.

Gaussian Mixture Model (GMM) is widely used for Unsupervised Learning (clustering) in machine learning. Detailed technical descriptions of the various methods of machine learning in the present invention are omitted since they are obvious and widely known to those skilled in the art.

Group matching can be performed by merging the vectors of the user's keywords of interest with the keyword vectors of mentors and mentees, and applying GMM (Gaussian Mixture Model)-based soft-clustering to the merged keyword vectors. . In addition, depending on the embodiment, collaborative filtering of normal unsupervised learning, mutual satisfaction for each mentor and each mentee, or group filtering through RNN (Recurrent Neural Network) based on the user's new interest keyword vector Matching can be performed, and group matching can also be performed through a combination of the methods listed above.

The reinforcement learning system for self-development of the present invention performs reinforcement learning in a structure where the agent 310 takes an “action” and receives a reward from the environment 330.

It is a type of machine learning technique in which a computer agent learns how to perform a task through repeated trial and error interactions in a dynamic environment. This learning approach allows the agent to make decisions that maximize the reward metric for a task without human intervention or explicit programming to perform the task.

In the reinforcement learning system for self-development of the present invention, an action refers to an action in which an agent recommends missions to be performed by users and mentees.

Mission is a concept similar to a project carried out in a mentoring group, and refers to the sum of missions proposed by mentees in the user's group and missions proposed by the mentor and other mentors in related fields. .

The reward provided to the agent in the environment is the satisfaction evaluated by users and mentees after completing the mission. The reward is a weighted average of the satisfaction results evaluated by users and mentees.

The reinforcement learning system for self-development of the present invention proceeds repeatedly by establishing a policy in such a way that the agent recommends a mission to the user so that the reward has the maximum value.

Through performing these repeated missions, users and mentees naturally acquire various knowledge related to their career path and can obtain useful information to determine their aptitude and career path.

As shown in Figure 4, the agent begins by recommending or proposing an optimal mission to the user considering the current state (410).

The user performs the mission recommended by the agent in the real world (420).

After completing the mission, the user evaluates satisfaction and updates his/her keyword vector of interest (430).

The state transition probability (T(s,a,s')) is updated based on the ratio of mentees who updated the mentee's keyword of interest among several mentees in the user's group, and the mentee who updated the mentee's keyword of interest After performing the mission recommended to the mentees, the reward function (R(s,a,s')) is updated based on the weighted average of the mentee satisfaction results evaluated by the mentees (440).

Additionally, the discount rate is updated so that the greater the cosine similarity between the vector of the mentor interest keyword of the role model mentor predetermined by the mentee and the vector of the user interest keyword, the smaller the reward.

Update the policy based on the updated state transition probability, reward function, and discount rate.

In this specification, only a few examples of various embodiments performed by the present inventors are described, but the technical idea of the present invention is not limited or limited thereto, and of course, it can be modified and implemented in various ways by those skilled in the art.

Claims

At least one agent operating by at least one processor;

Non-transitory storage medium that stores instructions for executing reinforcement learning training algorithms for self-development

Including,

The reinforcement learning training algorithm is,

Observation of the current state (S) related to the group to which the user belongs is performed by the agent;

The agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommending a mission corresponding to the selected action to the user;

receiving, by the agent, user satisfaction evaluated after the user performs the mission recommended by the agent, and a vector of modified user interest keywords of the user;

updating, by the agent, a state transition probability based on the ratio of mentees who have updated a mentee keyword of interest among several mentees in the group to which the user belongs;

Updating a reward function based on a weighted average of mentee satisfaction results evaluated by the mentees after performing a mission recommended to the mentees who updated the mentee interest keyword;

Updating the discount rate so that the greater the cosine similarity between the mentor interest keyword vector of a role model mentor predetermined by each mentee who updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward; and

Updating the policy based on the updated state transition probability, the reward function, and the discount rate.

Including,

The updated policy is set so that the agent takes the next action so that the weighted average of the reward function is the maximum value,

A reinforcement learning system for self-development.
In claim 1,

The reinforcement learning training algorithm for self-development is based on the Markov Decision Process (MDP) and uses Q-Learning to find optimal conditions for predefined variables of the Bellman Equation. Characterized by being performed,

A reinforcement learning system for self-development.
In claim 2,

The user interest keywords and the mentee interest keywords include natural language related to their career, advancement, and employment, and the mentor interest keywords include natural language related to mentoring content in their mentoring group,

The vector of the user interest keyword, the mentee interest keyword vector, and the mentor interest keyword vector are generated by word embedding of natural language processing using a neural network (NN),

A reinforcement learning system for self-development.
In claim 3,

The current state is information input by the user and the mentee and has a vector form generated by word embedding,

The action includes the agent recommending missions to be performed by the user and the mentees, and the missions are missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in related fields. A reinforcement learning system for self-development, which is the sum of the missions proposed by.
In claim 1,

The agent is deployed one per user,

The agents deployed for each user operate in different environments, and the similarity of the environments of the agents of the users within the user cluster is higher than the similarity of the environments of the agents of the users of the different user clusters.

A reinforcement learning system for self-development.
In claim 1,

Observation of the current state is performed; performing said action; The receiving step; updating the state transition probability; updating the compensation function; updating the discount rate; And the step of updating the policy is repeated multiple times,

Receiving corrections to keywords of interest from the user and the mentees who performed the mission recommended by the agent, and updating the current status of the user to the next status based on the modified keywords of interest,

A reinforcement learning system for self-development.
In claim 1,

The step of updating the compensation function is calculated including the user satisfaction, and is calculated by weighting the user satisfaction.

A reinforcement learning system for self-development.
In claim 1,

A grouping unit that performs matching with a new mentoring group into which the user will be grouped when the cosine similarity between the user interest keyword and the mentee interest keyword exceeds a predetermined threshold;

A reinforcement learning system for self-development, further comprising:
In claim 8,

The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( Collaborative Filtering), which performs the mentoring group matching for the user by applying at least one of mutual satisfaction for each mentor and each mentee or a Recurrent Neural Network (RNN) based on the user's new interest keyword vector,

A reinforcement learning system for self-development.