WO2024111866A1 - Reinforcement learning system for self-development - Google Patents

Reinforcement learning system for self-development Download PDF

Info

Publication number
WO2024111866A1
WO2024111866A1 PCT/KR2023/015319 KR2023015319W WO2024111866A1 WO 2024111866 A1 WO2024111866 A1 WO 2024111866A1 KR 2023015319 W KR2023015319 W KR 2023015319W WO 2024111866 A1 WO2024111866 A1 WO 2024111866A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
agent
mentee
reinforcement learning
interest
Prior art date
Application number
PCT/KR2023/015319
Other languages
French (fr)
Korean (ko)
Inventor
이정수
Original Assignee
주식회사 트위니어스
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 트위니어스 filed Critical 주식회사 트위니어스
Publication of WO2024111866A1 publication Critical patent/WO2024111866A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education

Definitions

  • the present invention relates to a reinforcement learning system for self-development, and more specifically, to a reinforcement learning system for self-development using reinforcement learning, which is an area of machine learning.
  • a mentor is a person who helps a mentee in various aspects.
  • a mentor is a person who cares for, trusts, and encourages the mentee.
  • a good mentor wants to be around their mentee, is experienced, and likes to help their mentee succeed in life.
  • a mentee is a person who develops and develops his or her own capabilities with the help of a mentor.
  • a mentee In college, a mentee is a learner who lacks basic knowledge of their major and seeks help from a mentor to develop learning skills to supplement this, adapt to college life, and obtain information about career paths and employment.
  • Mentoring refers to activities in which a mentor influences a mentee.
  • the type of mentoring is divided into 1:1 mentoring, peer mentoring, and group mentoring depending on how the relationship between mentor and mentee is formed.
  • 1:1 mentoring refers to a relationship where an experienced mentor teaches one-on-one to inexperienced people who are in a stage of learning or transition.
  • Peer mentoring (or group study) refers to a relationship in which colleagues of similar level support and guide one another.
  • Group mentoring is a form in which several mentees work together under one or more experienced mentors for a specific purpose. The advantage is that we can exchange ideas and information and receive feedback as a group.
  • the present invention is intended to solve the above problems, and the purpose of the present invention is to provide a reinforcement learning system for self-development.
  • the above purpose is to provide a reinforcement learning system for self-development.
  • a reinforcement learning system for self-development according to an embodiment of the present invention
  • At least one agent operating by at least one processor
  • Non-transitory storage medium that stores instructions for executing reinforcement learning training algorithms for self-development
  • the reinforcement learning training algorithm is,
  • the agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommending a mission corresponding to the selected action to the user;
  • the updated policy is set so that the agent takes the next action such that the weighted average of the reward function becomes the maximum value.
  • the reinforcement learning training algorithm for self-development is based on the Markov Decision Process (MDP) and uses Q-Learning to find optimal conditions for predefined variables of the Bellman Equation. It is characterized by being carried out.
  • MDP Markov Decision Process
  • the user interest keywords and the mentee interest keywords include natural language related to their career, advancement, and employment, and the mentor interest keywords include natural language related to mentoring content in their mentoring group,
  • the vector of the user interest keyword, the mentee interest keyword vector, and the mentor interest keyword vector are generated using a word embedding method of natural language processing using a neural network (NN).
  • NN neural network
  • the current state is information input by the user and the mentee and has a vector form generated by word embedding
  • the action includes the agent recommending missions to be performed by the user and the mentees, and the missions are missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in related fields. It is the sum of the missions proposed by
  • the agent is deployed one per user, and the agents deployed for each user operate in different environments, and the similarity of the environments of the agents of users within a user cluster is determined by the environment of the agents of users of different user clusters. Higher than similarity.
  • Modification of keywords of interest is received from the user and the mentees who performed the mission recommended by the agent, and the current status of the user is updated to the next status based on the modified keywords of interest. .
  • the step of updating the reward function is calculated including the user satisfaction, and is calculated by giving a weight to the user satisfaction.
  • the grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning (
  • GMM Global System for Mobile Communications
  • the mentoring group matching for the user is performed by applying at least one of Collaborative Filtering, mutual satisfaction for each mentor and each mentee, or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest.
  • users have the effect of receiving mentoring related to their career goals and acquiring related knowledge.
  • FIG. 1 is a schematic diagram of a reinforcement learning system for self-development of the present invention.
  • Figure 2 is a diagram showing a flow chart of the reinforcement learning training algorithm of the reinforcement learning system for self-development of the present invention.
  • Figure 3 is a diagram schematically showing the operation between the agent and the environment of the reinforcement learning system for self-development of the present invention.
  • Figure 4 is a diagram showing an operation flowchart of an agent of the reinforcement learning system for self-development of the present invention.
  • Identification codes (first, second, etc.) for each step are used for convenience of explanation.
  • the identification codes do not describe the order of each step, and each step does not clearly state a specific order in context. It may be carried out differently from the order specified above. That is, each step may be performed in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.
  • Reinforcement learning is an area of machine learning.
  • behavioral psychology it is a method in which an agent defined within an environment recognizes the current state and selects an action or action sequence that maximizes reward among selectable actions. Because these problems are so comprehensive, they are also studied in fields such as game theory, control theory, operations science, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms.
  • Reinforcement learning is a field that allows people to learn on their own through repetition by later determining whether a certain action in a certain environment was a good action or a wrong action and providing a reward (or penalty).
  • FIG. 1 is a schematic diagram of a reinforcement learning system for self-development of the present invention.
  • the reinforcement learning system 100 includes at least one agent 110 operated by at least one processor and a non-transitory storage medium 130.
  • An example of the agent 110 in the present invention is artificial intelligence (AI).
  • Agents of the present invention are deployed one per user. Agents deployed for each user can operate in different environments. At this time, the similarity of the environments of the agents of users within a user cluster may be higher than the similarity of the environments of the agents of users of different user clusters.
  • the non-transitory storage medium 130 stores instructions for executing a reinforcement learning training algorithm for self-development.
  • Reinforcement learning has two components: environment and agent. The interaction between the environment and the agent will be described in more detail with reference to Figure 3 below.
  • the agent decides on an action in a specific environment, and the environment rewards that decision. This reward is often determined all at once after several actions are taken, rather than immediately upon action. This is because in many cases, it is not possible to immediately evaluate a specific action when that action is taken.
  • Reinforcement learning is closely related to deep learning, which was discussed earlier.
  • artificial neural networks which are mainly covered in deep learning, are used.
  • the artificial neural network determines behavior using the environment and the state of the agent as input, and if there is a reward, it positively learns from previous input values and behaviors.
  • Figure 2 is a diagram showing a flow chart of the reinforcement learning training algorithm of the reinforcement learning system for self-development of the present invention.
  • the reinforcement learning training algorithm of the present invention includes a step (S210) in which the agent 110 observes the current state (S) related to the group to which the user belongs.
  • current state includes information input by the user and the mentee of the mentoring group to which the user belongs.
  • the current state has the form of a vector created using word embedding so that the agent can understand it.
  • the agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommends a mission corresponding to the selected action to the user (S212). Includes.
  • the agent 110 recommends a mission that is optimal for the observed current state to the user so that the user can perform it.
  • the optimal action ( c) is suggested to the user.
  • the latest version of Deep Q NN takes ⁇ the keyword + action combination vector of each user in the user cluster> as input, and ⁇ the weighted average of the rewards of the users in the user cluster (evaluation of satisfaction with the mission performed by each user) > is learned by reducing the loss function of the target value and NN output.
  • ⁇ keyword (s) + action (c) combination vector> of the users within the user cluster are the same.
  • a word-embedded keyword vector (the dimensions of each keyword vector for all users in the user cluster are the same).
  • c is the user's individual missions derived from mentoring results. In other words, the total sum of the missions derived by each user in the user cluster is the number of possible cases of c in (s, c).
  • action refers to an action in which an agent recommends missions for users and mentees to perform.
  • the term "mission” is a concept similar to a project carried out in a mentoring group, and refers to missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in the related field. It means the total of the missions proposed by In the present invention, the mission can be described in the form of a 'phrase', 'clause', or sentence, such as 'Summer Vacation IT Company Intern'.
  • the reinforcement learning training algorithm of the present invention includes a step (S214) in which the vector of the user's modified user interest keywords and the user satisfaction evaluated after the user performs the mission recommended by the agent are received by the agent (S214). do.
  • user interest keyword and “mentee interest keyword” include natural language related to their career path, advancement to higher education, and employment.
  • the term “mentor interest keyword” includes natural language related to the content of mentoring to the mentor's mentoring group.
  • Vectors of user interest keywords, mentee interest keyword vectors, and mentor interest keyword vectors are created using the word embedding method of natural language processing using a neural network (NN).
  • NN neural network
  • the keyword vector of interest is the embedded state vector.
  • keyword(s) existing in a user cluster it is a keyword vector derived by word embedding method, and each element of the vector is roughly quantized and expressed discretely. Allows a finite number of state vectors to appear within a user cluster. Prevents the agent's environment from becoming too large. It operates by reducing the number of ‘possible’ state vector cases by performing quantization more roughly as the number of users in the user cluster decreases.
  • users from different user clusters also have keyword vectors of the same ‘dimension’.
  • the new keyword vector that a specific user arrives at through a specific action has a significant difference in similarity from the average of the keyword vectors in the existing user cluster, the user is assigned a new user cluster similar to the user's keyword vector. .
  • Keyword vectors are created using a word embedding method using a neural network.
  • embedding refers to the result of converting natural language used by humans into a vector, a numerical form that machines can understand, or the entire series of processes.
  • the simplest form of embedding is to use the word frequency as a vector.
  • rows and columns correspond to documents.
  • a word-document matrix is an example of the simplest form of embedding.
  • the reinforcement learning training algorithm of the present invention includes the step of updating the state transition probability based on the ratio of mentees who updated the mentee interest keyword among several mentees in the group to which the user belongs (S216) by an agent. ) includes.
  • the state transition probability can be updated based on the ratio of mentees who updated the mentee interest keyword after a specific action.
  • the reinforcement learning training algorithm of the present invention includes a step (S218) of updating the reward function based on the weighted average of the mentee satisfaction results evaluated by the mentees after performing the mission recommended to the mentees who updated the mentee interest keyword. Includes.
  • the user's keyword vector(s) of the user's role model for example, a mentor who the user gave a high score as a result of mentoring, or a mentor who did not conduct mentoring but showed high interest in the user
  • the user's keyword vector increases, the user's It is designed so that the rewards you receive can be even greater.
  • the step of updating the reward function (S218) includes user satisfaction and is calculated by weighting user satisfaction.
  • the reward function can also be updated based on the weighted average of the satisfaction results performed by users in the user cluster after performing the recommended mission.
  • the reinforcement learning training algorithm of the present invention is such that the farther the cosine similarity between the mentor interest keyword vector of the role model mentor predetermined by each mentee who has updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward. It includes a step of updating the discount rate (S220).
  • the discount rate can be updated so that the greater the cosine similarity of the average keyword vectors of the mentees in the user's group, the smaller the reward.
  • Pre-determined role model mentors may include mentors who were highly evaluated by the user after mentoring, or mentors who were designated as role models or expressed high interest even if mentoring was not conducted.
  • Cosine similarity refers to the similarity of two vectors that can be obtained using the cosine angle between the two vectors. If the directions of the two vectors are completely the same, it has a value of 1, if they form an angle of 90°, it has a value of 0, and if they have opposite directions of 180°, it has a value of -1. In other words, cosine similarity ultimately has a value between -1 and 1, and the closer the value is to 1, the higher the similarity can be judged. If you understand this intuitively, it means how similar the directions the two vectors are pointing are.
  • the reinforcement learning training algorithm of the present invention includes a step (S222) of updating the policy based on the updated state transition probability, reward function, and discount rate.
  • the updated policy is set so that the agent takes the next action so that the weighted average of the reward function reaches its maximum value.
  • the reinforcement learning training algorithm for self-development of the present invention is based on the Markov Decision Process (MDP) and uses Q-Learning to determine the optimal conditions of predefined variables of the Bellman Equation. It is characterized by being carried out to find.
  • MDP Markov Decision Process
  • the Bellman equation is used as a practical way to find the value of a given state.
  • the Bellman equation deals with the relationship between the value at time t and the value at time t+1, and also deals with the relationship between the value function and the policy function.
  • the Bellman equation is defined using a recursive relationship between the current time point (t) and the next time point (t+1).
  • Q Learning can be used to find the optimal policy for a given finite Markov decision process.
  • Q learning learns the optimal policy by learning the Q function, which is a function that predicts the expected utility value of performing a given action in a given state.
  • a policy is a rule that indicates what action to perform in a given state.
  • the optimal policy can be derived by performing the action that gives the highest Q in each state.
  • One of the advantages of Q Learning is that it allows you to compare the expected values of actions performed without a model of a given environment.
  • Q Learning can be applied without any modification even in environments where transitions occur stochastically or rewards are given stochastically. It has been proven that Q Learning can learn the optimal policy that obtains the maximum reward from the current state for an arbitrary finite Markov decision process (MDP).
  • MDP finite Markov decision process
  • the reinforcement learning training algorithm for self-development of the present invention includes the steps of observing the current state (S210); performing an action (S212); receiving step (S214); Updating the state transition probability (S216); Updating the compensation function (S218); Updating the discount rate (S220); and the step of updating the policy (S222) are repeated multiple times.
  • the reinforcement learning training algorithm for self-development of the present invention receives modifications to keywords of interest from users and mentees who have performed missions recommended by an agent, and moves the user's current state to the next state based on the modified keywords of interest. Update.
  • the reinforcement learning training algorithm for self-development of the present invention evaluates the satisfaction evaluated by users and mentees after performing the mission recommended by the agent and the state vector changed as the user and mentees modify their keywords of interest. Based on this, rewards and state transition probabilities are updated and ultimately the user agent's 'policy' is updated. For example, if the user is “satisfied” with the mission after performing the “Summer Vacation IT Company Intern” mission according to the user agent’s first policy, in the next cycle, the user agent will suggest a mission according to the second policy to the user, but “If you are not satisfied,” the user agent proposes a mission to the user according to a third policy that is different from the second policy. In the extreme, if the degree of “not satisfied” is high, the third policy of the user agent may suggest a mission to the user in the direction of excluding “employment at an IT company” from the user’s career path.
  • the AI of the present invention is learned to find a that causes Q*(s, a) to have the maximum value by applying a method called Q-learning to the Bellman Equation.
  • Q*(s, a) is a function that quantitatively indicates the suitability of the next mission.
  • A, which causes Q*(s, a) to have the maximum value means “the most appropriate next mission (a, action) considering the mentee’s current stage (s, state).”
  • the reinforcement learning system for self-development of the present invention further includes a grouping unit (not shown) that matches a new mentoring group into which the user will be grouped when the cosine similarity between the user's interest keyword and the mentee's interest keyword exceeds a predetermined threshold. do.
  • the grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning for vectors of keywords of user interest and vectors of keywords of mentee interest.
  • GMM Global System for Mobile Communications
  • RNN Recurrent Neural Network
  • the grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( Collaborative Filtering), mutual satisfaction for each mentor and each mentee, or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest is applied to perform group matching of mentors and mentees for the user.
  • GMM Global System for Mobile Communications
  • collaborative filtering of unsupervised learning Collaborative Filtering
  • mutual satisfaction for each mentor and each mentee collaborative filtering of unsupervised learning
  • RNN Recurrent Neural Network
  • Mentors and mentees each have one keyword vector, and the keyword vector is generated by a word embedding method.
  • the keyword vectors of all users, including mentors and mentees are mapped to the Semantic Space, and each user is clustered within this space.
  • GMM Gaussian Mixture Model
  • GMM Gaussian Mixture Model
  • Group matching can be performed by merging the vectors of the user's keywords of interest with the keyword vectors of mentors and mentees, and applying GMM (Gaussian Mixture Model)-based soft-clustering to the merged keyword vectors. .
  • GMM Global System for Mobile Communications
  • collaborative filtering of normal unsupervised learning, mutual satisfaction for each mentor and each mentee, or group filtering through RNN (Recurrent Neural Network) based on the user's new interest keyword vector Matching can be performed, and group matching can also be performed through a combination of the methods listed above.
  • Figure 3 is a diagram schematically showing the operation between the agent and the environment of the reinforcement learning system for self-development of the present invention.
  • the reinforcement learning system for self-development of the present invention performs reinforcement learning in a structure where the agent 310 takes an “action” and receives a reward from the environment 330.
  • an action refers to an action in which an agent recommends missions to be performed by users and mentees.
  • Mission is a concept similar to a project carried out in a mentoring group, and refers to the sum of missions proposed by mentees in the user's group and missions proposed by the mentor and other mentors in related fields. .
  • the reward provided to the agent in the environment is the satisfaction evaluated by users and mentees after completing the mission.
  • the reward is a weighted average of the satisfaction results evaluated by users and mentees.
  • the reinforcement learning system for self-development of the present invention proceeds repeatedly by establishing a policy in such a way that the agent recommends a mission to the user so that the reward has the maximum value.
  • Figure 4 is a diagram showing an operation flowchart of an agent of the reinforcement learning system for self-development of the present invention.
  • the agent begins by recommending or proposing an optimal mission to the user considering the current state (410).
  • the user performs the mission recommended by the agent in the real world (420).
  • the user After completing the mission, the user evaluates satisfaction and updates his/her keyword vector of interest (430).
  • the state transition probability (T(s,a,s')) is updated based on the ratio of mentees who updated the mentee's keyword of interest among several mentees in the user's group, and the mentee who updated the mentee's keyword of interest
  • the reward function (R(s,a,s')) is updated based on the weighted average of the mentee satisfaction results evaluated by the mentees (440).
  • the discount rate is updated so that the greater the cosine similarity between the vector of the mentor interest keyword of the role model mentor predetermined by the mentee and the vector of the user interest keyword, the smaller the reward.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Primary Health Care (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A reinforcement learning system for self-development, of the present invention, comprises: at least one agent operating by means of at least one processor; and a non-transitory storage medium storing instructions for executing a reinforcement learning training algorithm for self-development, wherein the reinforcement learning training algorithm comprises steps in which: the current state (S: state) related to a group to which a user belongs is observed by the agent; the agent selects, on the basis of a predefined policy, an action (a: action) corresponding to the current state and recommends the user a mission corresponding to the selected action; the agent receives user satisfaction evaluated after the user has performed the mission recommended by the agent, and a vector of a modified user interest keyword of the user; the agent updates a state transition probability on the basis of the proportion of mentees who have updated a mentee interest keyword from among various mentees in a group to which the user belongs; a reward function is updated on the basis of the weighted average of the results of mentee satisfaction evaluated by the mentees after the performance of a mission recommended to the mentees who have updated the mentee interest keyword; a discount rate is updated so that reward becomes smaller as the cosine similarity between the vector of the user interest keyword and a mentor interest keyword vector of a role model mentor predetermined by each mentee who has updated the mentee interest keyword become father apart; and the policy is updated on the basis of the updated state transition probability, reward function, and discount rate, wherein the updated policy is set to allow the agent to perform the next action so that the weighted average of the reward function reaches a maximum value.

Description

자기 개발을 위한 강화 학습 시스템Reinforcement learning system for self-development
본 발명은 자기 개발을 위한 강화 학습 시스템에 관한 것으로서, 보다 상세하게는, 기계 학습의 일 영역인 강화 학습을 이용한 자기 개발을 위한 강화 학습 시스템에 관한 것이다.The present invention relates to a reinforcement learning system for self-development, and more specifically, to a reinforcement learning system for self-development using reinforcement learning, which is an area of machine learning.
멘토(mentor)는 다양한 측면에서 멘티에게 도움을 주는 사람이다. 멘토는 멘티를 보살펴주고, 믿어주며, 격려해주는 사람이다. 훌륭한 멘토는 멘티가 함께 있고 싶어하고, 경험이 많으며, 멘티가 인생에서 성공하도록 돕기를 좋아한다.A mentor is a person who helps a mentee in various aspects. A mentor is a person who cares for, trusts, and encourages the mentee. A good mentor wants to be around their mentee, is experienced, and likes to help their mentee succeed in life.
멘티(mentee)는 멘토를 통하여 도움을 받아 자신의 역량을 개발하고 발전하는 사람이다. 대학에서 멘티는 전공 기초 지식이 부족하여 이를 보강할 학습력을 키우기 위해 멘토의 도움을 받고, 대학생활에 적응하고 진로 및 취업에 대한 정보를 얻고자 하는 학습자이다.A mentee is a person who develops and develops his or her own capabilities with the help of a mentor. In college, a mentee is a learner who lacks basic knowledge of their major and seeks help from a mentor to develop learning skills to supplement this, adapt to college life, and obtain information about career paths and employment.
멘토링은 멘토가 멘티에게 영향을 끼치는 활동을 의미한다. 멘토링의 유형은 멘토와 멘티의 관계가 어떻게 형성되는가에 따라 1:1 멘토링, 동료 멘토링 및 그룹 멘토링으로 구분한다. 1:1 멘토링은 학습과정이나 전환의 필요성이 있는 단계에 있는 경험이 부족한 사람들에게 경험이 많은 멘토가 일대일로 가르치는 관계를 말한다. 동료 멘토링(또는 그룹 스터디)은 비슷한 수준의 동료들이 서로 지원하고 지도해주는 관계를 말한다. 그룹 멘토링은 특정한 목적을 가지고 경험이 풍부한 한 명 이상의 멘토 아래 여러 명의 멘티가 함께 있는 형태이다. 그룹으로서 아이디어 및 정보를 교환하고 피드백을 받을 수 있다는 것이 장점이다.Mentoring refers to activities in which a mentor influences a mentee. The type of mentoring is divided into 1:1 mentoring, peer mentoring, and group mentoring depending on how the relationship between mentor and mentee is formed. 1:1 mentoring refers to a relationship where an experienced mentor teaches one-on-one to inexperienced people who are in a stage of learning or transition. Peer mentoring (or group study) refers to a relationship in which colleagues of similar level support and guide one another. Group mentoring is a form in which several mentees work together under one or more experienced mentors for a specific purpose. The advantage is that we can exchange ideas and information and receive feedback as a group.
멘토링을 통하여 자신의 진로 또는 진학 방향들을 자연스럽게 결정할 수 있는 플랫폼에 대한 요구가 존재한다.There is a demand for a platform where people can naturally decide their career path or direction of further education through mentoring.
본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 본 발명의 목적은 자기 개발을 위한 강화 학습 시스템을 제공하는데 있다.The present invention is intended to solve the above problems, and the purpose of the present invention is to provide a reinforcement learning system for self-development.
본 발명의 상기 및 다른 목적과 이점은 바람직한 실시예를 설명한 하기의 설명으로부터 분명해질 것이다.The above and other objects and advantages of the present invention will become apparent from the following description of preferred embodiments.
상기 목적은, 자기 개발을 위한 강화 학습 시스템을 제공하는데 있다The above purpose is to provide a reinforcement learning system for self-development.
본 발명의 일 실시예에 따른 자기 개발을 위한 강화 학습 시스템은, A reinforcement learning system for self-development according to an embodiment of the present invention,
적어도 하나의 프로세서에 의해 동작하는 적어도 하나의 에이전트(agent);At least one agent operating by at least one processor;
자기 개발을 위한 강화 학습 훈련 알고리즘을 실행시키기 위한 명령어들을 저장하는 비일시적 저장 매체Non-transitory storage medium that stores instructions for executing reinforcement learning training algorithms for self-development
를 포함하되,Including,
상기 강화 학습 훈련 알고리즘은,The reinforcement learning training algorithm is,
상기 에이전트에 의해 유저가 속한 그룹에 관련된 현재 상태(S: state)에 대한 관찰이 수행되는 단계;Observation of the current state (S) related to the group to which the user belongs is performed by the agent;
상기 에이전트는 미리 정의된 정책(policy)에 기반하여 상기 현재 상태에 대응하는 행동(a: action)을 선택하며, 선택된 행동에 대응하는 미션을 상기 유저에게 추천하는 단계;The agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommending a mission corresponding to the selected action to the user;
상기 유저가 상기 에이전트에 의해 추천된 상기 미션을 수행한 후 평가된 유저 만족도, 및 상기 유저의 수정된 유저 관심 키워드(keyword)의 벡터가 상기 에이전트에 의해 수신되는 단계;receiving, by the agent, user satisfaction evaluated after the user performs the mission recommended by the agent, and a vector of modified user interest keywords of the user;
상기 에이전트에 의해 상기 유저가 속한 그룹 내 여러 명의 멘티(mentee)들 중 멘티 관심 키워드를 업데이트한 멘티들의 비율을 기초로 상태 전이 확률(state transition probability)을 업데이트하는 단계;updating, by the agent, a state transition probability based on the ratio of mentees who have updated a mentee keyword of interest among several mentees in the group to which the user belongs;
상기 멘티 관심 키워드를 업데이트한 멘티들에게 추천된 미션의 수행 후 상기 멘티들이 평가한 멘티 만족도 결과의 가중치 평균을 기초로 보상 함수(reward function)를 업데이트하는 단계;Updating a reward function based on a weighted average of mentee satisfaction results evaluated by the mentees after performing a mission recommended to the mentees who updated the mentee interest keyword;
상기 멘티 관심 키워드를 업데이트한 각각의 멘티가 미리 정한 롤모델 멘토(mentor)의 멘토 관심 키워드 벡터와 상기 유저 관심 키워드의 벡터 간의 코사인 유사도(cosine similarity)가 멀수록 보상이 작도록 할인율을 업데이트하는 단계; 및Updating the discount rate so that the greater the cosine similarity between the mentor interest keyword vector of a role model mentor predetermined by each mentee who updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward; and
업데이트된 상기 상태 전이 확률, 상기 보상 함수 및 상기 할인율을 기초로 상기 정책을 업데이트하는 단계Updating the policy based on the updated state transition probability, the reward function, and the discount rate.
를 포함하고,Including,
업데이트된 상기 정책은 상기 보상 함수의 가중치 평균이 최대값이 되도록 상기 에이전트가 다음 행동(next action)을 취하도록 설정된다.The updated policy is set so that the agent takes the next action such that the weighted average of the reward function becomes the maximum value.
바람직하게는,Preferably,
상기 자기 개발을 위한 강화 학습 훈련 알고리즘은 마르코프 결정 과정(MDP : Markov Decision Process)에 기초하며, Q-러닝(Learning)을 이용하여 벨만 방정식(Bellman Equation)의 미리 정의된 변수의 최적 조건을 찾도록 수행되는 것을 특징으로 한다.The reinforcement learning training algorithm for self-development is based on the Markov Decision Process (MDP) and uses Q-Learning to find optimal conditions for predefined variables of the Bellman Equation. It is characterized by being carried out.
바람직하게는,Preferably,
상기 유저 관심 키워드 및 상기 멘티 관심 키워드는 자신들의 진로, 진학 및 취업과 관련된 자연어를 포함하고, 상기 멘토 관심 키워드는 자신의 멘토링 그룹에 멘토링하는 내용과 관련된 자연어를 포함하고,The user interest keywords and the mentee interest keywords include natural language related to their career, advancement, and employment, and the mentor interest keywords include natural language related to mentoring content in their mentoring group,
상기 유저 관심 키워드의 벡터, 상기 멘티 관심 키워드의 벡터 및 상기 멘토 관심 키워드의 벡터는 신경망(NN : neural network)을 이용하여 자연어 처리의 워드 임베딩(embedding) 방식으로 생성된다.The vector of the user interest keyword, the mentee interest keyword vector, and the mentor interest keyword vector are generated using a word embedding method of natural language processing using a neural network (NN).
바람직하게는,Preferably,
상기 현재 상태는 상기 유저 및 상기 멘티에 의해 입력된 정보로서, 워드 임베딩(embedding) 방식으로 생성된 벡터 형태를 갖고,The current state is information input by the user and the mentee and has a vector form generated by word embedding,
상기 행동은 상기 에이전트가 상기 유저 및 상기 멘티들이 수행할 미션들을 추천하는 동작을 포함하고, 상기 미션들은 상기 유저가 속한 그룹 내 멘티들에 의해 제안된 미션들 및 상기 멘토 및 관련 분야의 다른 멘토에 의해 제안된 미션들의 총합이다.The action includes the agent recommending missions to be performed by the user and the mentees, and the missions are missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in related fields. It is the sum of the missions proposed by
바람직하게는,Preferably,
상기 에이전트는 상기 유저 당 하나씩 배치되고, 상기 유저 마다 배치된 에이전트들은 각각 서로 다른 환경(environment)에서 동작하되, 유저 클러스터 내 유저들의 에이전트들의 환경의 유사도는 서로 다른 유저 클러스터의 유저들의 에이전트들의 환경의 유사도보다 높다.The agent is deployed one per user, and the agents deployed for each user operate in different environments, and the similarity of the environments of the agents of users within a user cluster is determined by the environment of the agents of users of different user clusters. Higher than similarity.
바람직하게는,Preferably,
상기 현재 상태에 대한 관찰이 수행되는 단계; 상기 행동을 수행하는 단계; 상기 수신되는 단계; 상기 상태 전이 확률을 업데이트하는 단계; 상기 보상 함수를 업데이트하는 단계; 상기 할인율을 업데이트하는 단계; 및 상기 정책을 업데이트 하는 단계는 복수 회 반복되고,Observation of the current state is performed; performing said action; The receiving step; updating the state transition probability; updating the compensation function; updating the discount rate; And the step of updating the policy is repeated multiple times,
상기 에이전트에 의해 추천된 미션을 수행한 상기 유저 및 상기 멘티들로부터 관심 키워드의 수정을 입력받고, 수정된 관심 키워드들을 기초로 상기 유저의 상기 현재 상태를 다음 상태로 업데이트한다. .Modification of keywords of interest is received from the user and the mentees who performed the mission recommended by the agent, and the current status of the user is updated to the next status based on the modified keywords of interest. .
바람직하게는,Preferably,
상기 보상 함수를 업데이트하는 단계는 상기 유저 만족도를 포함하여 계산하되, 상기 유저 만족도에 가중치를 부여하여 계산된다.The step of updating the reward function is calculated including the user satisfaction, and is calculated by giving a weight to the user satisfaction.
바람직하게는,Preferably,
상기 유저 관심 키워드와 상기 멘티 관심 키워드의 코사인 유사도가 미리 결정된 임계치를 초과하는 경우 상기 유저가 그룹핑될 새로운 멘토링 그룹을 매칭을 수행하는 그룹핑부;를 더 포함한다.It further includes a grouping unit configured to match a new mentoring group into which the user will be grouped when the cosine similarity between the user interest keyword and the mentee interest keyword exceeds a predetermined threshold.
바람직하게는,Preferably,
상기 그룹핑부는 상기 유저 관심 키워드의 벡터, 상기 멘티 관심 키워드의 벡터들에 대하여 비지도 학습(Unsupervised Learning)의 GMM(Gaussian Mixture Model) 기반 소프트-클러스터링(soft-clustering), 비지도 학습의 협업 필터링(Collaborative Filtering), 각 멘토 및 각 멘티에 대한 상호 만족도를 또는 유저의 새로운 관심 키워드 벡터를 기초로 한 RNN(Recurrent Neural Network) 중 적어도 하나를 적용하여 상기 유저에 대한 상기 멘토링 그룹 매칭을 수행한다.The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( The mentoring group matching for the user is performed by applying at least one of Collaborative Filtering, mutual satisfaction for each mentor and each mentee, or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest.
본 발명에 따른 자기 개발을 위한 강화 학습 시스템을 이용하여, 멘토링 그룹에서 미션을 수행함에 따라 유저는 자신에게 보다 적합한 진로가 무엇인지를 자연스럽게 발견할 수 있는 효과가 있다.By using the reinforcement learning system for self-development according to the present invention, users can naturally discover which career path is more suitable for them by performing missions in the mentoring group.
또한, 유저는 자신의 진로 목표와 관련된 멘토링을 받고 관련 지식을 습득하게 되는 효과가 있다.In addition, users have the effect of receiving mentoring related to their career goals and acquiring related knowledge.
다만, 본 발명의 효과들은 이상에서 언급한 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.
본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.In order to more fully understand the drawings cited in the detailed description of the present invention, a brief description of each drawing is provided.
도 1은 본 발명의 자기 개발을 위한 강화 학습 시스템의 개략도를 도시한 도면이다.1 is a schematic diagram of a reinforcement learning system for self-development of the present invention.
도 2는 본 발명의 자기 개발을 위한 강화 학습 시스템의 강화 학습 훈련 알고리즘의 흐름도를 보여주는 도면이다.Figure 2 is a diagram showing a flow chart of the reinforcement learning training algorithm of the reinforcement learning system for self-development of the present invention.
도 3는 본 발명의 자기 개발을 위한 강화 학습 시스템의 에이전트와 환경간의 동작을 도식적으로 보여주는 도면이다.Figure 3 is a diagram schematically showing the operation between the agent and the environment of the reinforcement learning system for self-development of the present invention.
도 4는 본 발명의 자기 개발을 위한 강화 학습 시스템의 에이전트의 동작 흐름도를 보여주는 도면이다.Figure 4 is a diagram showing an operation flowchart of an agent of the reinforcement learning system for self-development of the present invention.
이하, 본 발명의 실시예와 도면을 참조하여 본 발명을 상세히 설명한다. 이들 실시예는 오로지 본 발명을 보다 구체적으로 설명하기 위해 예시적으로 제시한 것일 뿐, 본 발명의 범위가 이들 실시예에 의해 제한되지 않는다는 것은 당업계에서 통상의 지식을 가지는 자에 있어서 자명할 것이다.Hereinafter, the present invention will be described in detail with reference to embodiments of the present invention and drawings. These examples are merely presented as examples to explain the present invention in more detail, and it will be apparent to those skilled in the art that the scope of the present invention is not limited by these examples. .
또한, 달리 정의하지 않는 한, 본 명세서에서 사용되는 모든 기술적 및 과학적 용어는 본 발명이 속하는 기술 분야의 숙련자에 의해 통상적으로 이해되는 바와 동일한 의미를 가지며, 상충되는 경우에는, 정의를 포함하는 본 명세서의 기재가 우선할 것이다.Additionally, unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains, and in case of conflict, this specification including definitions The description will take precedence.
도면에서 제안된 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. 그리고, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에서 기술한 "부"란, 특정 기능을 수행하는 하나의 단위 또는 블록을 의미한다.In order to clearly explain the proposed invention in the drawings, parts unrelated to the description have been omitted, and similar reference numerals have been assigned to similar parts throughout the specification. And, when it is said that a part "includes" a certain component, this means that it does not exclude other components, but may further include other components, unless specifically stated to the contrary. Additionally, “unit” as used in the specification refers to a unit or block that performs a specific function.
각 단계들에 있어 식별부호(제1, 제2, 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 실시될 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 실시될 수도 있고 실질적으로 동시에 실시될 수도 있으며 반대의 순서대로 실시될 수도 있다.Identification codes (first, second, etc.) for each step are used for convenience of explanation. The identification codes do not describe the order of each step, and each step does not clearly state a specific order in context. It may be carried out differently from the order specified above. That is, each step may be performed in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.
강화 학습(Reinforcement learning)은 기계 학습의 한 영역이다. 행동심리학에서 영감을 받았으며, 어떤 환경 안에서 정의된 에이전트가 현재의 상태를 인식하여, 선택 가능한 행동들 중 보상을 최대화하는 행동 혹은 행동 순서를 선택하는 방법이다. 이러한 문제는 매우 포괄적이기 때문에 게임 이론, 제어이론, 운용 과학, 정보 이론, 시뮬레이션 기반 최적화, 다중 에이전트 시스템, 떼 지능, 통계학, 유전 알고리즘 등의 분야에서도 연구된다.Reinforcement learning is an area of machine learning. Inspired by behavioral psychology, it is a method in which an agent defined within an environment recognizes the current state and selects an action or action sequence that maximizes reward among selectable actions. Because these problems are so comprehensive, they are also studied in fields such as game theory, control theory, operations science, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics, and genetic algorithms.
강화 학습은 어떠한 환경에서 어떠한 행동을 했을 때 그것이 잘 된 행동인지 잘못된 행동인지를 나중에 판단하고 보상(또는 벌칙)을 줌으로써 반복을 통해 스스로 학습하게 하는 분야이다.Reinforcement learning is a field that allows people to learn on their own through repetition by later determining whether a certain action in a certain environment was a good action or a wrong action and providing a reward (or penalty).
도 1은 본 발명의 자기 개발을 위한 강화 학습 시스템의 개략도를 도시한 도면이다.1 is a schematic diagram of a reinforcement learning system for self-development of the present invention.
강화 학습 시스템(100)은 적어도 하나의 프로세서에 의해 동작하는 적어도 하나의 에이전트(agent)(110) 및 비일시적 저장 매체(130)을 포함한다.The reinforcement learning system 100 includes at least one agent 110 operated by at least one processor and a non-transitory storage medium 130.
본 발명에서의 에이전트(110)의 일 예는 인공지능(AI : artificial intelligence)이다.An example of the agent 110 in the present invention is artificial intelligence (AI).
본 발명의 에이전트는 상기 유저 당 하나씩 배치된다. 유저 마다 배치된 에이전트들은 각각 서로 다른 환경(environment)에서 동작할 수 있다. 이때, 유저 클러스터 내 유저들의 에이전트들의 환경의 유사도는 서로 다른 유저 클러스터의 유저들의 에이전트들의 환경의 유사도보다 높을 수 있다.Agents of the present invention are deployed one per user. Agents deployed for each user can operate in different environments. At this time, the similarity of the environments of the agents of users within a user cluster may be higher than the similarity of the environments of the agents of users of different user clusters.
비일시적 저장 매체(130)는 자기 개발을 위한 강화 학습 훈련 알고리즘을 실행시키기 위한 명령어들을 저장한다.The non-transitory storage medium 130 stores instructions for executing a reinforcement learning training algorithm for self-development.
강화 학습에는 두 가지 구성 요소로 환경(environment)과 에이전트(agent)가 있다. 환경과 에이전트의 상호 동작에 대하여는 아래의 도 3을 참조하여 보다 상세하게 설명한다.Reinforcement learning has two components: environment and agent. The interaction between the environment and the agent will be described in more detail with reference to Figure 3 below.
에이전트는 특정 환경에서 행동(action)을 결정하고 환경은 그 결정에 대한 보상을 내린다. 이 보상은 행동 즉시 결정되기보다는 여러 행동들을 취한 후에 한꺼번에 결정되는 경우가 많다. 특정 행동을 취했을 때 바로 그 행동에 대한 평가를 내릴 수 없는 경우가 많기 때문이다.The agent decides on an action in a specific environment, and the environment rewards that decision. This reward is often determined all at once after several actions are taken, rather than immediately upon action. This is because in many cases, it is not possible to immediately evaluate a specific action when that action is taken.
강화 학습은 앞서 다룬 딥러닝과 밀접한 관계가 있다. 에이전트가 행동을 결정하고 환경이 주는 보상으로 스스로 학습할 때 주로 딥러닝에서 다룬 인공 신경망을 사용한다. 환경과 에이전트의 상태 등을 입력값으로 인공 신경망이 행동을 결정하고 보상이 있으면 이전의 입력값과 행동들을 긍정적으로 학습한다.Reinforcement learning is closely related to deep learning, which was discussed earlier. When an agent determines its behavior and learns on its own with rewards provided by the environment, artificial neural networks, which are mainly covered in deep learning, are used. The artificial neural network determines behavior using the environment and the state of the agent as input, and if there is a reward, it positively learns from previous input values and behaviors.
본 발명에서는 강화 학습과 관련된 구체적인 수식 및 관련 배경 지식등은 구체적인 설명을 생략하며 이와 관련된 내용은 당업자에게 용이하게 이해될 수 있다.In the present invention, detailed descriptions of specific formulas and related background knowledge related to reinforcement learning are omitted, and the related content can be easily understood by those skilled in the art.
도 2는 본 발명의 자기 개발을 위한 강화 학습 시스템의 강화 학습 훈련 알고리즘의 흐름도를 보여주는 도면이다.Figure 2 is a diagram showing a flow chart of the reinforcement learning training algorithm of the reinforcement learning system for self-development of the present invention.
본 발명의 강화 학습 훈련 알고리즘은 에이전트(110)에 의해 유저가 속한 그룹에 관련된 현재 상태(S: state)에 대한 관찰이 수행되는 단계(S210)를 포함한다.The reinforcement learning training algorithm of the present invention includes a step (S210) in which the agent 110 observes the current state (S) related to the group to which the user belongs.
본원 발명에서 사용되는, 용어"현재 상태(t state)"는 유저 및 유저가 속한 멘토링 그룹의 멘티에 의해 입력된 정보를 포함한다.As used in the present invention, the term “current state (t state)” includes information input by the user and the mentee of the mentoring group to which the user belongs.
현재 상태는 에이전트가 이해할 수 있도록 워드 임베딩(embedding) 방식으로 생성된 벡터 형태를 갖는다. The current state has the form of a vector created using word embedding so that the agent can understand it.
본 발명의 강화 학습 훈련 알고리즘은 에이전트가 미리 정의된 정책(policy)에 기반하여 현재 상태에 대응하는 행동(a: action)을 선택하며, 선택된 행동에 대응하는 미션을 유저에게 추천하는 단계(S212)를 포함한다.In the reinforcement learning training algorithm of the present invention, the agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommends a mission corresponding to the selected action to the user (S212). Includes.
에이전트(110)는 관찰된 현재 상태에 최적인 미션을 유저에게 추천하여 유저가 수행할 수 있도록 한다.The agent 110 recommends a mission that is optimal for the observed current state to the user so that the user can perform it.
현재의 유저의 상태(임베딩된 키워드+액션 조합 벡터)에 최신 버전의 Deep Q NN의 파라미터(theta)를 내적(inner product)하여 얻은 Q(s,c,theta)의 argmax를 통해 최적의 행동(c)을 유저에게 제안한다.The optimal action ( c) is suggested to the user.
이 때, 최신버전의 Deep Q NN은 <유저 클러스터 내 각 유저들의 키워드+액션 조합 벡터>을 입력으로 하고, <유저 클러스터 내 유저들의 리워드(각 유저들이 수행한 미션에 대한 만족도 평가)들의 가중 평균>이 포함된 목표치와 NN의 출력의 Loss function을 줄이는 방식으로 학습한다.At this time, the latest version of Deep Q NN takes <the keyword + action combination vector of each user in the user cluster> as input, and <the weighted average of the rewards of the users in the user cluster (evaluation of satisfaction with the mission performed by each user) > is learned by reducing the loss function of the target value and NN output.
유저 클러스터 내 유저들의 <키워드(s)+액션(c) 조합 벡터>의 차원은 동일. s의 경우 워드 임베딩된 키워드 벡터(유저 클러스터 내 모든 유저들 각자의 키워드 벡터의 차원은 모두 동일). c는 유저가 멘토링 결과 도출한 각자의 미션들. 즉, 유저 클러스터 내 유저들 각자가 도출한 미션들의 총 합이 (s,c)에서 가능한 c의 경우의 수이다.The dimensions of the <keyword (s) + action (c) combination vector> of the users within the user cluster are the same. In the case of s, a word-embedded keyword vector (the dimensions of each keyword vector for all users in the user cluster are the same). c is the user's individual missions derived from mentoring results. In other words, the total sum of the missions derived by each user in the user cluster is the number of possible cases of c in (s, c).
본 발명에서 사용되는, 용어 "행동(action)"은 에이전트가 유저 및 멘티들이 수행할 미션들을 추천하는 동작을 말한다.As used in the present invention, the term “action” refers to an action in which an agent recommends missions for users and mentees to perform.
본 발명에서 사용되는, 용어 "미션(mission)"은 멘토링 그룹에서 수행되는 프로젝트(project)와 유사한 개념으로서, 유저가 속한 그룹 내 멘티들에 의해 제안된 미션들 및 멘토 및 관련 분야의 다른 멘토에 의해 제안된 미션들의 총합을 의미한다. 본 발명에서 미션은 '여름방학 IT기업 인턴'과 같은 '구'나 '절' 혹은 문장의 형태로 기술될 수 있다. As used in the present invention, the term "mission" is a concept similar to a project carried out in a mentoring group, and refers to missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in the related field. It means the total of the missions proposed by In the present invention, the mission can be described in the form of a 'phrase', 'clause', or sentence, such as 'Summer Vacation IT Company Intern'.
본 발명의 강화 학습 훈련 알고리즘은 유저가 에이전트에 의해 추천된 미션을 수행한 후 평가된 유저 만족도, 및 유저의 수정된 유저 관심 키워드(keyword)의 벡터가 에이전트에 의해 수신되는 단계(S214)를 포함한다.The reinforcement learning training algorithm of the present invention includes a step (S214) in which the vector of the user's modified user interest keywords and the user satisfaction evaluated after the user performs the mission recommended by the agent are received by the agent (S214). do.
본 발명에서 사용되는, 용어 "유저 관심 키워드" 및 "멘티 관심 키워드"는 자신들의 진로, 진학 및 취업과 관련된 자연어를 포함한다.As used in the present invention, the terms “user interest keyword” and “mentee interest keyword” include natural language related to their career path, advancement to higher education, and employment.
본 발명에서 사용되는, 용어 "멘토 관심 키워드"는 멘토의 멘토링 그룹에 멘토링하는 내용과 관련된 자연어를 포함한다.As used in the present invention, the term “mentor interest keyword” includes natural language related to the content of mentoring to the mentor's mentoring group.
유저 관심 키워드의 벡터, 멘티 관심 키워드의 벡터 및 멘토 관심 키워드의벡터는 신경망(NN : neural network)을 이용하여 자연어 처리의 워드 임베딩(embedding) 방식으로 생성된다.Vectors of user interest keywords, mentee interest keyword vectors, and mentor interest keyword vectors are created using the word embedding method of natural language processing using a neural network (NN).
관심 키워드 벡터는 임베딩된 상태 벡터이다.The keyword vector of interest is the embedded state vector.
유저 클러스터 내 존재하는 키워드(s)의 경우 워드 임베딩 방식으로 나온 키워드 벡터이며, 해당 벡터의 각 element들은 rough하게 quantization 하여 discrete하게 표기된다. 유저 클러스터 내에서 유한개의 상태 벡터가 나올 수 있도록한다. 에이전트의 환경이 지나치게 커지는 것을 방지한다. 유저 클러스터 내 유저들의 숫자가 적을 수록 더욱 rough하게 quantization을 하여 ‘가능한’ 상태벡터의 경우의 수를 줄이는 방식으로 동작한다.In the case of keyword(s) existing in a user cluster, it is a keyword vector derived by word embedding method, and each element of the vector is roughly quantized and expressed discretely. Allows a finite number of state vectors to appear within a user cluster. Prevents the agent's environment from becoming too large. It operates by reducing the number of ‘possible’ state vector cases by performing quantization more roughly as the number of users in the user cluster decreases.
따라서 서로 다른 유저 클러스터의 유저들도 동일한 ‘차원’의 키워드 벡터를 가지고 있다. 이 때, 특정 유저가 특정 액션을 통해 도달하게 된 새로운 키워드 벡터가 기존의 유저 클러스터 내 키워드 벡터들의 평균과 similarity가 심하게 차이가 나는 경우, 해당 유저는 본인의 키워드 벡터와 비슷한 새로운 유저 클러스터가 할당된다.Therefore, users from different user clusters also have keyword vectors of the same ‘dimension’. At this time, if the new keyword vector that a specific user arrives at through a specific action has a significant difference in similarity from the average of the keyword vectors in the existing user cluster, the user is assigned a new user cluster similar to the user's keyword vector. .
키워드 벡터는 신경망을 활용한 워드 임베딩 방식으로 생성된다. 자연어 처리(Natural Language Processing)분야에서 임베딩(Embedding)은 사람이 쓰는 자연어를 기계가 이해할 수 있는 숫자 형태인 벡터(vector)로 바꾼 결과 혹은 그 일련의 과정 전체를 의미한다. 가장 간단한 형태의 임베딩은 단어의 빈도를 그대로 벡터로 사용하는 것이다. 단어-문서 행렬(Term-Document Matrix)는 행(row)는 단어 (column)은 문서에 대응한다. 단어-문서 행렬은 가장 단순한 형태의 임베딩의 예이다.Keyword vectors are created using a word embedding method using a neural network. In the field of natural language processing, embedding refers to the result of converting natural language used by humans into a vector, a numerical form that machines can understand, or the entire series of processes. The simplest form of embedding is to use the word frequency as a vector. In a term-document matrix, rows and columns correspond to documents. A word-document matrix is an example of the simplest form of embedding.
본 발명의 강화 학습 훈련 알고리즘은 에이전트에 의해 유저가 속한 그룹 내 여러 명의 멘티(mentee)들 중 멘티 관심 키워드를 업데이트한 멘티들의 비율을 기초로 상태 전이 확률(state transition probability)을 업데이트하는 단계(S216)를 포함한다.The reinforcement learning training algorithm of the present invention includes the step of updating the state transition probability based on the ratio of mentees who updated the mentee interest keyword among several mentees in the group to which the user belongs (S216) by an agent. ) includes.
즉, 특정 행동 이후 멘티 관심 키워드를 업데이트한 멘티들의 비율을 기초로 상태 전이 확률을 업데이트할 수 있다.In other words, the state transition probability can be updated based on the ratio of mentees who updated the mentee interest keyword after a specific action.
본 발명의 강화 학습 훈련 알고리즘은 멘티 관심 키워드를 업데이트한 멘티들에게 추천된 미션의 수행 후 멘티들이 평가한 멘티 만족도 결과의 가중치 평균을 기초로 보상 함수(reward function)를 업데이트하는 단계(S218)를 포함한다.The reinforcement learning training algorithm of the present invention includes a step (S218) of updating the reward function based on the weighted average of the mentee satisfaction results evaluated by the mentees after performing the mission recommended to the mentees who updated the mentee interest keyword. Includes.
유저의 롤모델(예를 들어, 유저가 멘토링 결과 높은 점수를 준 멘토, 혹은 멘토링은 진행하지 않았지만 유저가 높은 관심을 보인 멘토)의 키워드 벡터(s)와 유저의 키워드 벡터 간 유사도가 높아질 수록 유저가 받게 되는 리워드가 더욱 커질 수 있도록 설계된다.As the similarity between the keyword vector(s) of the user's role model (for example, a mentor who the user gave a high score as a result of mentoring, or a mentor who did not conduct mentoring but showed high interest in the user) and the user's keyword vector increases, the user's It is designed so that the rewards you receive can be even greater.
보상 함수를 업데이트하는 단계(S218)는 유저 만족도를 포함하여 계산하되, 유저 만족도에 가중치를 부여하여 계산된다.The step of updating the reward function (S218) includes user satisfaction and is calculated by weighting user satisfaction.
유저 클러스터 내 유저들이 추천된 미션을 수행 후 수행한 만족도 결과의 가중치 평균을 기초로 보상함수를 업데이트할 수도 있다. The reward function can also be updated based on the weighted average of the satisfaction results performed by users in the user cluster after performing the recommended mission.
본 발명의 강화 학습 훈련 알고리즘은 멘티 관심 키워드를 업데이트한 각각의 멘티가 미리 정한 롤모델 멘토(mentor)의 멘토 관심 키워드 벡터와 유저 관심 키워드의 벡터 간의 코사인 유사도(cosine similarity)가 멀수록 보상이 작도록 할인율을 업데이트하는 단계(S220)를 포함한다.The reinforcement learning training algorithm of the present invention is such that the farther the cosine similarity between the mentor interest keyword vector of the role model mentor predetermined by each mentee who has updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward. It includes a step of updating the discount rate (S220).
유저가 속한 그룹 내 멘티들의 키워드 벡터들의 평균값의 코사인 유사도가 멀수록 보상이 작도록 할인율을 업데이트할 수도 있다.The discount rate can be updated so that the greater the cosine similarity of the average keyword vectors of the mentees in the user's group, the smaller the reward.
미리 정한 롤모델 멘토는 유저가 멘토링 진행 후 높은 평가를 한 멘토 혹은 멘토링을 진행하지 않았더라도 롤모델로 지정하거나 높은 관심을 표현한 멘토를 포함할 수 있다.Pre-determined role model mentors may include mentors who were highly evaluated by the user after mentoring, or mentors who were designated as role models or expressed high interest even if mentoring was not conducted.
코사인 유사도는 두 벡터 간의 코사인 각도를 이용하여 구할 수 있는 두 벡터의 유사도를 의미한다. 두 벡터의 방향이 완전히 동일한 경우에는 1의 값을 가지며, 90°의 각을 이루면 0, 180°로 반대의 방향을 가지면 -1의 값을 갖게 된다. 즉, 결국 코사인 유사도는 -1 이상 1 이하의 값을 가지며 값이 1에 가까울수록 유사도가 높다고 판단할 수 있다. 이를 직관적으로 이해하면 두 벡터가 가리키는 방향이 얼마나 유사한가를 의미한다.Cosine similarity refers to the similarity of two vectors that can be obtained using the cosine angle between the two vectors. If the directions of the two vectors are completely the same, it has a value of 1, if they form an angle of 90°, it has a value of 0, and if they have opposite directions of 180°, it has a value of -1. In other words, cosine similarity ultimately has a value between -1 and 1, and the closer the value is to 1, the higher the similarity can be judged. If you understand this intuitively, it means how similar the directions the two vectors are pointing are.
본 발명의 강화 학습 훈련 알고리즘은 업데이트된 상태 전이 확률, 보상 함수 및 할인율을 기초로 정책을 업데이트하는 단계(S222)를 포함한다.The reinforcement learning training algorithm of the present invention includes a step (S222) of updating the policy based on the updated state transition probability, reward function, and discount rate.
업데이트된 정책(policy)은 보상 함수의 가중치 평균이 최대값이 되도록 에이전트가 다음 행동(next action)을 취하도록 설정된다.The updated policy is set so that the agent takes the next action so that the weighted average of the reward function reaches its maximum value.
본 발명의 자기 개발을 위한 강화 학습 훈련 알고리즘은 마르코프 결정 과정(MDP : Markov Decision Process)에 기초하며, Q-러닝(Learning)을 이용하여 벨만 방정식(Bellman Equation)의 미리 정의된 변수의 최적 조건을 찾도록 수행되는 것을 특징으로 한다.The reinforcement learning training algorithm for self-development of the present invention is based on the Markov Decision Process (MDP) and uses Q-Learning to determine the optimal conditions of predefined variables of the Bellman Equation. It is characterized by being carried out to find.
주어진 상태의 밸류를 구하는 실제 방법에 벨만 방정식이 사용된다. 벨만 방정식은 시점 t 에서의 밸류와 시점 t+1 에서의 밸류 사이의 관계를 다루고 있으며 또 가치함수와 정책함수 사이의 관계도 다루고 있다. 벨만 방정식은 현재 시점 (t)와 다음 시점(t+1) 사이의 재귀적 관계를 이용해 정의된다.The Bellman equation is used as a practical way to find the value of a given state. The Bellman equation deals with the relationship between the value at time t and the value at time t+1, and also deals with the relationship between the value function and the policy function. The Bellman equation is defined using a recursive relationship between the current time point (t) and the next time point (t+1).
Q 러닝(Learning)은 주어진 유한 마르코프 결정 과정의 최적의 정책을 찾기 위해 사용할 수 있다. Q 러닝은 주어진 상태에서 주어진 행동을 수행하는 것이 가져다 줄 효용의 기대값을 예측하는 함수인 Q 함수를 학습함으로써 최적의 정책을 학습한다. 정책이란 주어진 상태에서 어떤 행동을 수행할지 나타내는 규칙이다. Q 함수를 학습하고 나면 각 상태에서 최고의 Q를 주는 행동을 수행함으로써 최적의 정책을 유도할 수 있다. Q 러닝의 장점 중 하나는 주어진 환경의 모델 없이도 수행하는 행동의 기대값을 비교할 수 있다는 점이다. 뿐만 아니라 Q 러닝은 전이가 확률적으로 일어나거나 보상이 확률적으로 주어지는 환경에서도 별다른 변형 없이 적용될 수 있다. Q 러닝은 임의의 유한 마르코프 결정 과정(MDP)에 대해서 현재 상태에서 최대의 보상을 획득하는 최적의 정책을 학습할 수 있다는 사실이 증명되어 있다.Q Learning can be used to find the optimal policy for a given finite Markov decision process. Q learning learns the optimal policy by learning the Q function, which is a function that predicts the expected utility value of performing a given action in a given state. A policy is a rule that indicates what action to perform in a given state. After learning the Q function, the optimal policy can be derived by performing the action that gives the highest Q in each state. One of the advantages of Q Learning is that it allows you to compare the expected values of actions performed without a model of a given environment. In addition, Q Learning can be applied without any modification even in environments where transitions occur stochastically or rewards are given stochastically. It has been proven that Q Learning can learn the optimal policy that obtains the maximum reward from the current state for an arbitrary finite Markov decision process (MDP).
본 발명의 자기 개발을 위한 강화 학습 훈련 알고리즘은 현재 상태에 대한 관찰이 수행되는 단계(S210); 행동을 수행하는 단계(S212); 수신되는 단계(S214); 상태 전이 확률을 업데이트하는 단계(S216); 보상 함수를 업데이트하는 단계(S218); 할인율을 업데이트하는 단계(S220); 및 정책을 업데이트 하는 단계(S222)는 복수 회 반복된다.The reinforcement learning training algorithm for self-development of the present invention includes the steps of observing the current state (S210); performing an action (S212); receiving step (S214); Updating the state transition probability (S216); Updating the compensation function (S218); Updating the discount rate (S220); and the step of updating the policy (S222) are repeated multiple times.
본 발명의 자기 개발을 위한 강화 학습 훈련 알고리즘은 에이전트에 의해 추천된 미션을 수행한 유저 및 멘티들로부터 관심 키워드의 수정을 입력받고, 수정된 관심 키워드들을 기초로 유저의 상기 현재 상태를 다음 상태로 업데이트한다.The reinforcement learning training algorithm for self-development of the present invention receives modifications to keywords of interest from users and mentees who have performed missions recommended by an agent, and moves the user's current state to the next state based on the modified keywords of interest. Update.
또한, 본 발명의 자기 개발을 위한 강화 학습 훈련 알고리즘은 에이전트에 의해 추천된 미션을 수행한 후 유저 및 멘티들에 의해 평가된 만족도 및 유저 및 멘티들이 본인의 관심 키워드를 수정함에 따라 변경된 상태 벡터를 기반으로, 리워드 및 상태전이확률이 업데이트 되어 최종적으로 유저 에이전트의 '정책'이 업데이트 된다. 가령, 유저 에이전트의 제1 정책에 따라 유저가 “여름방학 IT기업 인턴” 미션을 수행한 후 미션에 대해 “만족했다면” 다음 사이클에서 유저 에이전트는 제2 정책에 따른 미션을 유저에게 제안하겠지만, “만족하지 않았다면” 유저 에이전트는 제2 정책과 다른 제3 정책에 따른 미션을 유저에게 제안하게 된다. 극단적으로, “만족하지 않았음”의 정도가 크다면 유저 에이전트의 제3 정책은 “IT 기업 취업”을 유저의 진로에서 제외하는 방향으로 유저에게 미션을 제안하게 될 수도 있다.In addition, the reinforcement learning training algorithm for self-development of the present invention evaluates the satisfaction evaluated by users and mentees after performing the mission recommended by the agent and the state vector changed as the user and mentees modify their keywords of interest. Based on this, rewards and state transition probabilities are updated and ultimately the user agent's 'policy' is updated. For example, if the user is “satisfied” with the mission after performing the “Summer Vacation IT Company Intern” mission according to the user agent’s first policy, in the next cycle, the user agent will suggest a mission according to the second policy to the user, but “ “If you are not satisfied,” the user agent proposes a mission to the user according to a third policy that is different from the second policy. In the extreme, if the degree of “not satisfied” is high, the third policy of the user agent may suggest a mission to the user in the direction of excluding “employment at an IT company” from the user’s career path.
본 발명의 AI는 Bellman Equation에 Q-learning이라는 방법을 적용하여 Q*(s, a)가 최대값을 가지도록 하는 a를 찾아낼 수 있도록 학습된다. 여기서 Q*(s, a)는 다음 미션의 적합도를 정량적으로 표시한 함수이다. Q*(s, a)가 최대값을 가지도록 하는 a라는 것은 "멘티의 현재 단계(s, state)를 고려했을 때 가장 적합한 다음 미션(a, action)"이라는 의미가 된다.The AI of the present invention is learned to find a that causes Q*(s, a) to have the maximum value by applying a method called Q-learning to the Bellman Equation. Here, Q*(s, a) is a function that quantitatively indicates the suitability of the next mission. A, which causes Q*(s, a) to have the maximum value, means “the most appropriate next mission (a, action) considering the mentee’s current stage (s, state).”
본 발명의 자기 개발을 위한 강화 학습 시스템은 유저 관심 키워드와 멘티 관심 키워드의 코사인 유사도가 미리 결정된 임계치를 초과하는 경우 유저가 그룹핑될 새로운 멘토링 그룹을 매칭을 수행하는 그룹핑부(미도시)를 더 포함한다.The reinforcement learning system for self-development of the present invention further includes a grouping unit (not shown) that matches a new mentoring group into which the user will be grouped when the cosine similarity between the user's interest keyword and the mentee's interest keyword exceeds a predetermined threshold. do.
그룹핑부는 유저 관심 키워드의 벡터, 멘티 관심 키워드의 벡터들에 대하여 비지도 학습(Unsupervised Learning)의 GMM(Gaussian Mixture Model) 기반 소프트-클러스터링(soft-clustering), 비지도 학습의 협업 필터링(Collaborative Filtering), 각 멘토 및 각 멘티에 대한 상호 만족도를 또는 유저의 새로운 관심 키워드 벡터를 기초로 한 RNN(Recurrent Neural Network) 중 적어도 하나를 적용하여 유저에 대한 멘토링 그룹 매칭을 수행한다.The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning for vectors of keywords of user interest and vectors of keywords of mentee interest. , Mentoring group matching for users is performed by applying at least one of the mutual satisfaction of each mentor and each mentee or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest.
그룹핑부는 유저의 관심 키워드의 벡터와 멘토들 및 멘티들의 키워드 벡터들에 대하여 비지도 학습(Unsupervised Learning)의 GMM(Gaussian Mixture Model) 기반 소프트-클러스터링(soft-clustering), 비지도 학습의 협업 필터링(Collaborative Filtering), 각 멘토 및 각 멘티에 대한 상호 만족도를 또는 유저의 새로운 관심 키워드 벡터를 기초로 한 RNN(Recurrent Neural Network) 중 적어도 하나를 적용하여 유저에 대한 멘토-멘티의 그룹 매칭을 수행한다.The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( Collaborative Filtering), mutual satisfaction for each mentor and each mentee, or RNN (Recurrent Neural Network) based on the user's new keyword vector of interest is applied to perform group matching of mentors and mentees for the user.
멘토들 및 멘티들 각각은 하나의 키워드 벡터를 가지며, 키워드 벡터는 워드 임베딩 방식에 의해 생성된다. 대표적으로, 소프트-클러스터링(soft-clustering) 진행시 멘토들 및 멘티들 모든 각각의 유저들의 키워드 벡터들은 Semantic Space에 매핑되고, 이 공간 안에서 각 유저들을 클러스터링을 진행하게 된다.Mentors and mentees each have one keyword vector, and the keyword vector is generated by a word embedding method. Typically, during soft-clustering, the keyword vectors of all users, including mentors and mentees, are mapped to the Semantic Space, and each user is clustered within this space.
Gaussian Mixture Model (GMM)은 이름 그대로 Gaussian 분포가 여러 개 혼합된 clustering 알고리즘이다. 현실에 존재하는 복잡한 형태의 확률 분포를 K개의 Gaussian distribution을 혼합하여 표현하자는 것이 GMM의 기본 아이디어이다. 이때 K는 데이터를 분석하고자 하는 사람이 직접 설정해야 한다.Gaussian Mixture Model (GMM), as its name suggests, is a clustering algorithm that mixes multiple Gaussian distributions. The basic idea of GMM is to express the complex probability distribution that exists in reality by mixing K Gaussian distributions. At this time, K must be set directly by the person who wants to analyze the data.
Gaussian Mixture Model(GMM)은 기계 학습에서 Unsupervised Learning(클러스터링)에 많이 활용된다. 본 발명에서의 기계 학습의 여러 방법에 대한 구체적인 기술적 설명은 당업자에게 자명하고 또한 널리 공지되어 있으므로 생략한다. Gaussian Mixture Model (GMM) is widely used for Unsupervised Learning (clustering) in machine learning. Detailed technical descriptions of the various methods of machine learning in the present invention are omitted since they are obvious and widely known to those skilled in the art.
유저의 관심 키워드의 벡터와 멘토들 및 멘티들의 키워드 벡터들을 병합하고, 병합된 키워드 벡터들에 대하여 GMM(Gaussian Mixture Model) 기반 소프트-클러스터링(soft-clustering)을 적용하여 그룹 매칭을 수행할 수 있다. 그 외, 실시예에 따라 통상의 비지도 학습의 협업 필터링(Collaborative Filtering), 각 멘토 및 각 멘티에 대한 상호 만족도를 또는 유저의 새로운 관심 키워드 벡터를 기초로 한 RNN(Recurrent Neural Network)를 통해 그룹 매칭을 수행할 수 있으며, 상기 나열된 방식들의 조합을 통해서도 그룹 매칭을 수행할 수 있다.Group matching can be performed by merging the vectors of the user's keywords of interest with the keyword vectors of mentors and mentees, and applying GMM (Gaussian Mixture Model)-based soft-clustering to the merged keyword vectors. . In addition, depending on the embodiment, collaborative filtering of normal unsupervised learning, mutual satisfaction for each mentor and each mentee, or group filtering through RNN (Recurrent Neural Network) based on the user's new interest keyword vector Matching can be performed, and group matching can also be performed through a combination of the methods listed above.
도 3는 본 발명의 자기 개발을 위한 강화 학습 시스템의 에이전트와 환경간의 동작을 도식적으로 보여주는 도면이다.Figure 3 is a diagram schematically showing the operation between the agent and the environment of the reinforcement learning system for self-development of the present invention.
본 발명의 자기 개발을 위한 강화 학습 시스템은 에이전트(310)가 "행동(action)"을 취하고, 환경(330)으로부터 보상(reward)을 받는 구조로 강화 학습을 수행한다.The reinforcement learning system for self-development of the present invention performs reinforcement learning in a structure where the agent 310 takes an “action” and receives a reward from the environment 330.
컴퓨터 에이전트가 역동적인 환경에서 반복적인 시행착오 상호작용을 통해 작업 수행 방법을 학습하는 기계 학습 기법의 한 유형이다. 이 학습 접근법을 통해 에이전트는 인간 개입 또는 작업 수행을 위한 명시적인 프로그래밍 없이 작업에 대한 보상 메트릭을 최대화하는 결정을 내릴 수 있다.It is a type of machine learning technique in which a computer agent learns how to perform a task through repeated trial and error interactions in a dynamic environment. This learning approach allows the agent to make decisions that maximize the reward metric for a task without human intervention or explicit programming to perform the task.
본 발명의 자기 개발을 위한 강화 학습 시스템에서 행동은 에이전트가 유저 및 멘티들이 수행할 미션들을 추천하는 동작을 말한다.In the reinforcement learning system for self-development of the present invention, an action refers to an action in which an agent recommends missions to be performed by users and mentees.
미션(mission)은 멘토링 그룹에서 수행되는 프로젝트(project)와 유사한 개념으로서, 유저가 속한 그룹 내 멘티들에 의해 제안된 미션들 및 멘토 및 관련 분야의 다른 멘토에 의해 제안된 미션들의 총합을 의미한다. Mission is a concept similar to a project carried out in a mentoring group, and refers to the sum of missions proposed by mentees in the user's group and missions proposed by the mentor and other mentors in related fields. .
환경에서 에이전트에 제공하는 보상은 미션을 수행한 후 유저 및 멘티들에 의해 평가되는 만족도이다. 보상은 유저 및 멘티들이 평가한 만족도 결과의 가중치 평균이다. The reward provided to the agent in the environment is the satisfaction evaluated by users and mentees after completing the mission. The reward is a weighted average of the satisfaction results evaluated by users and mentees.
본 발명의 자기 개발을 위한 강화 학습 시스템은 보상이 최대값을 갖도록 에이전트가 유저에게 미션을 추천하는 방식으로 정책을 수립하여 반복 진행된다.The reinforcement learning system for self-development of the present invention proceeds repeatedly by establishing a policy in such a way that the agent recommends a mission to the user so that the reward has the maximum value.
이러한 반복 진행되는 미션 수행을 통하여 유저 및 멘티들은 자기 진로와 관련된 제반 지식들을 자연스럽게 취득하게 되고, 자신의 적성과 진로를 결정하는 데 유익한 정보를 취득할 수 있다.Through performing these repeated missions, users and mentees naturally acquire various knowledge related to their career path and can obtain useful information to determine their aptitude and career path.
도 4는 본 발명의 자기 개발을 위한 강화 학습 시스템의 에이전트의 동작 흐름도를 보여주는 도면이다.Figure 4 is a diagram showing an operation flowchart of an agent of the reinforcement learning system for self-development of the present invention.
도 4에 도시된 바와 같이, 에이전트가 현재의 상태를 고려하여 유저에게 최적으로 미션을 추천 또는 제안함으로써 시작된다(410).As shown in Figure 4, the agent begins by recommending or proposing an optimal mission to the user considering the current state (410).
유저는 현실 세계에서 에이전트에 의해 추천된 미션을 수행한다(420).The user performs the mission recommended by the agent in the real world (420).
유저는 미션 수행 후 만족도를 평가하고, 자신의 관심 키워드 벡터를 업데이트한다(430).After completing the mission, the user evaluates satisfaction and updates his/her keyword vector of interest (430).
유저가 속한 그룹 내 여러 명의 멘티(mentee)들 중 멘티 관심 키워드를 업데이트한 멘티들의 비율을 기초로 상태 전이 확률(T(s,a,s'))를 업데이트하고, 멘티 관심 키워드를 업데이트한 멘티들에게 추천된 미션의 수행 후 멘티들이 평가한 멘티 만족도 결과의 가중치 평균을 기초로 보상 함수(R(s,a,s'))를 업데이트한다(440).The state transition probability (T(s,a,s')) is updated based on the ratio of mentees who updated the mentee's keyword of interest among several mentees in the user's group, and the mentee who updated the mentee's keyword of interest After performing the mission recommended to the mentees, the reward function (R(s,a,s')) is updated based on the weighted average of the mentee satisfaction results evaluated by the mentees (440).
추가적으로, 멘티에 의해 미리 정해진 롤모델 멘토(mentor)의 멘토 관심 키워드의 벡터와 유저 관심 키워드의 벡터 간의 코사인 유사도(cosine similarity)가 멀수록 보상이 작도록 할인율을 업데이트한다.Additionally, the discount rate is updated so that the greater the cosine similarity between the vector of the mentor interest keyword of the role model mentor predetermined by the mentee and the vector of the user interest keyword, the smaller the reward.
업데이트된 상태 전이 확률, 보상 함수 및 할인율을 기초로 정책을 업데이트한다.Update the policy based on the updated state transition probability, reward function, and discount rate.
본 명세서에서는 본 발명자들이 수행한 다양한 실시예 가운데 몇 개의 예만을 들어 설명하는 것이나 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고, 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다.In this specification, only a few examples of various embodiments performed by the present inventors are described, but the technical idea of the present invention is not limited or limited thereto, and of course, it can be modified and implemented in various ways by those skilled in the art.

Claims (9)

  1. 적어도 하나의 프로세서에 의해 동작하는 적어도 하나의 에이전트(agent);At least one agent operating by at least one processor;
    자기 개발을 위한 강화 학습 훈련 알고리즘을 실행시키기 위한 명령어들을 저장하는 비일시적 저장 매체Non-transitory storage medium that stores instructions for executing reinforcement learning training algorithms for self-development
    를 포함하되,Including,
    상기 강화 학습 훈련 알고리즘은,The reinforcement learning training algorithm is,
    상기 에이전트에 의해 유저가 속한 그룹에 관련된 현재 상태(S: state)에 대한 관찰이 수행되는 단계;Observation of the current state (S) related to the group to which the user belongs is performed by the agent;
    상기 에이전트는 미리 정의된 정책(policy)에 기반하여 상기 현재 상태에 대응하는 행동(a: action)을 선택하며, 선택된 행동에 대응하는 미션을 상기 유저에게 추천하는 단계;The agent selects an action (a: action) corresponding to the current state based on a predefined policy, and recommending a mission corresponding to the selected action to the user;
    상기 유저가 상기 에이전트에 의해 추천된 상기 미션을 수행한 후 평가된 유저 만족도, 및 상기 유저의 수정된 유저 관심 키워드(keyword)의 벡터가 상기 에이전트에 의해 수신되는 단계;receiving, by the agent, user satisfaction evaluated after the user performs the mission recommended by the agent, and a vector of modified user interest keywords of the user;
    상기 에이전트에 의해 상기 유저가 속한 그룹 내 여러 명의 멘티(mentee)들 중 멘티 관심 키워드를 업데이트한 멘티들의 비율을 기초로 상태 전이 확률(state transition probability)을 업데이트하는 단계;updating, by the agent, a state transition probability based on the ratio of mentees who have updated a mentee keyword of interest among several mentees in the group to which the user belongs;
    상기 멘티 관심 키워드를 업데이트한 멘티들에게 추천된 미션의 수행 후 상기 멘티들이 평가한 멘티 만족도 결과의 가중치 평균을 기초로 보상 함수(reward function)를 업데이트하는 단계;Updating a reward function based on a weighted average of mentee satisfaction results evaluated by the mentees after performing a mission recommended to the mentees who updated the mentee interest keyword;
    상기 멘티 관심 키워드를 업데이트한 각각의 멘티가 미리 정한 롤모델 멘토(mentor)의 멘토 관심 키워드 벡터와 상기 유저 관심 키워드의 벡터 간의 코사인 유사도(cosine similarity)가 멀수록 보상이 작도록 할인율을 업데이트하는 단계; 및Updating the discount rate so that the greater the cosine similarity between the mentor interest keyword vector of a role model mentor predetermined by each mentee who updated the mentee interest keyword and the vector of the user interest keyword, the smaller the reward; and
    업데이트된 상기 상태 전이 확률, 상기 보상 함수 및 상기 할인율을 기초로 상기 정책을 업데이트하는 단계Updating the policy based on the updated state transition probability, the reward function, and the discount rate.
    를 포함하고,Including,
    업데이트된 상기 정책은 상기 보상 함수의 가중치 평균이 최대값이 되도록 상기 에이전트가 다음 행동(next action)을 취하도록 설정되는,The updated policy is set so that the agent takes the next action so that the weighted average of the reward function is the maximum value,
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
  2. 청구항 1에 있어서,In claim 1,
    상기 자기 개발을 위한 강화 학습 훈련 알고리즘은 마르코프 결정 과정(MDP : Markov Decision Process)에 기초하며, Q-러닝(Learning)을 이용하여 벨만 방정식(Bellman Equation)의 미리 정의된 변수의 최적 조건을 찾도록 수행되는 것을 특징으로 하는,The reinforcement learning training algorithm for self-development is based on the Markov Decision Process (MDP) and uses Q-Learning to find optimal conditions for predefined variables of the Bellman Equation. Characterized by being performed,
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
  3. 청구항 2에 있어서,In claim 2,
    상기 유저 관심 키워드 및 상기 멘티 관심 키워드는 자신들의 진로, 진학 및 취업과 관련된 자연어를 포함하고, 상기 멘토 관심 키워드는 자신의 멘토링 그룹에 멘토링하는 내용과 관련된 자연어를 포함하고,The user interest keywords and the mentee interest keywords include natural language related to their career, advancement, and employment, and the mentor interest keywords include natural language related to mentoring content in their mentoring group,
    상기 유저 관심 키워드의 벡터, 상기 멘티 관심 키워드의 벡터 및 상기 멘토 관심 키워드의 벡터는 신경망(NN : neural network)을 이용하여 자연어 처리의 워드 임베딩(embedding) 방식으로 생성되는, The vector of the user interest keyword, the mentee interest keyword vector, and the mentor interest keyword vector are generated by word embedding of natural language processing using a neural network (NN),
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
  4. 청구항 3에 있어서, In claim 3,
    상기 현재 상태는 상기 유저 및 상기 멘티에 의해 입력된 정보로서, 워드 임베딩(embedding) 방식으로 생성된 벡터 형태를 갖고,The current state is information input by the user and the mentee and has a vector form generated by word embedding,
    상기 행동은 상기 에이전트가 상기 유저 및 상기 멘티들이 수행할 미션들을 추천하는 동작을 포함하고, 상기 미션들은 상기 유저가 속한 그룹 내 멘티들에 의해 제안된 미션들 및 상기 멘토 및 관련 분야의 다른 멘토에 의해 제안된 미션들의 총합인, 자기 개발을 위한 강화 학습 시스템.The action includes the agent recommending missions to be performed by the user and the mentees, and the missions are missions proposed by mentees in the group to which the user belongs and to the mentor and other mentors in related fields. A reinforcement learning system for self-development, which is the sum of the missions proposed by.
  5. 청구항 1에 있어서,In claim 1,
    상기 에이전트는 상기 유저 당 하나씩 배치되고,The agent is deployed one per user,
    상기 유저 마다 배치된 에이전트들은 각각 서로 다른 환경(environment)에서 동작하되, 유저 클러스터 내 유저들의 에이전트들의 환경의 유사도는 서로 다른 유저 클러스터의 유저들의 에이전트들의 환경의 유사도보다 높은,The agents deployed for each user operate in different environments, and the similarity of the environments of the agents of the users within the user cluster is higher than the similarity of the environments of the agents of the users of the different user clusters.
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
  6. 청구항 1에 있어서, In claim 1,
    상기 현재 상태에 대한 관찰이 수행되는 단계; 상기 행동을 수행하는 단계; 상기 수신되는 단계; 상기 상태 전이 확률을 업데이트하는 단계; 상기 보상 함수를 업데이트하는 단계; 상기 할인율을 업데이트하는 단계; 및 상기 정책을 업데이트 하는 단계는 복수 회 반복되고,Observation of the current state is performed; performing said action; The receiving step; updating the state transition probability; updating the compensation function; updating the discount rate; And the step of updating the policy is repeated multiple times,
    상기 에이전트에 의해 추천된 미션을 수행한 상기 유저 및 상기 멘티들로부터 관심 키워드의 수정을 입력받고, 수정된 관심 키워드들을 기초로 상기 유저의 상기 현재 상태를 다음 상태로 업데이트하는,Receiving corrections to keywords of interest from the user and the mentees who performed the mission recommended by the agent, and updating the current status of the user to the next status based on the modified keywords of interest,
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
  7. 청구항 1에 있어서, In claim 1,
    상기 보상 함수를 업데이트하는 단계는 상기 유저 만족도를 포함하여 계산하되, 상기 유저 만족도에 가중치를 부여하여 계산되는, The step of updating the compensation function is calculated including the user satisfaction, and is calculated by weighting the user satisfaction.
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
  8. 청구항 1에 있어서,In claim 1,
    상기 유저 관심 키워드와 상기 멘티 관심 키워드의 코사인 유사도가 미리 결정된 임계치를 초과하는 경우 상기 유저가 그룹핑될 새로운 멘토링 그룹을 매칭을 수행하는 그룹핑부;A grouping unit that performs matching with a new mentoring group into which the user will be grouped when the cosine similarity between the user interest keyword and the mentee interest keyword exceeds a predetermined threshold;
    를 더 포함하는, 자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development, further comprising:
  9. 청구항 8에 있어서,In claim 8,
    상기 그룹핑부는 상기 유저 관심 키워드의 벡터, 상기 멘티 관심 키워드의 벡터들에 대하여 비지도 학습(Unsupervised Learning)의 GMM(Gaussian Mixture Model) 기반 소프트-클러스터링(soft-clustering), 비지도 학습의 협업 필터링(Collaborative Filtering), 각 멘토 및 각 멘티에 대한 상호 만족도를 또는 유저의 새로운 관심 키워드 벡터를 기초로 한 RNN(Recurrent Neural Network) 중 적어도 하나를 적용하여 상기 유저에 대한 상기 멘토링 그룹 매칭을 수행하는,The grouping unit performs soft-clustering based on GMM (Gaussian Mixture Model) of unsupervised learning and collaborative filtering of unsupervised learning ( Collaborative Filtering), which performs the mentoring group matching for the user by applying at least one of mutual satisfaction for each mentor and each mentee or a Recurrent Neural Network (RNN) based on the user's new interest keyword vector,
    자기 개발을 위한 강화 학습 시스템.A reinforcement learning system for self-development.
PCT/KR2023/015319 2022-11-25 2023-10-05 Reinforcement learning system for self-development WO2024111866A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020220159928A KR102518825B1 (en) 2022-11-25 2022-11-25 Reinforcement learning system for self-development
KR10-2022-0159928 2022-11-25

Publications (1)

Publication Number Publication Date
WO2024111866A1 true WO2024111866A1 (en) 2024-05-30

Family

ID=85918412

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/015319 WO2024111866A1 (en) 2022-11-25 2023-10-05 Reinforcement learning system for self-development

Country Status (2)

Country Link
KR (1) KR102518825B1 (en)
WO (1) WO2024111866A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102518825B1 (en) * 2022-11-25 2023-04-06 이정수 Reinforcement learning system for self-development
CN116957172B (en) * 2023-09-21 2024-01-16 山东大学 Dynamic job shop scheduling optimization method and system based on deep reinforcement learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102177568B1 (en) * 2018-04-09 2020-11-11 주식회사 뷰노 Method for semi supervised reinforcement learning using data with label and data without label together and apparatus using the same
KR20210091442A (en) * 2020-01-14 2021-07-22 주식회사 해피캔버스 System and server and operating method to curate study contents by VR contents
KR20210157337A (en) * 2020-06-18 2021-12-28 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Recommendation system optimization method, device, equipment and computer storage medium
KR102408115B1 (en) * 2020-07-30 2022-06-10 동명대학교산학협력단 System for platform of job mentoring employing incumbent of real-name basis
KR102440817B1 (en) * 2020-02-19 2022-09-06 사회복지법인 삼성생명공익재단 Reinforcement learning method, device, and program for identifying causal effect in logged data
KR102518825B1 (en) * 2022-11-25 2023-04-06 이정수 Reinforcement learning system for self-development

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130082901A (en) 2011-12-22 2013-07-22 주식회사 케이티 Apparatus and method for providing online mentor-mentee service

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102177568B1 (en) * 2018-04-09 2020-11-11 주식회사 뷰노 Method for semi supervised reinforcement learning using data with label and data without label together and apparatus using the same
KR20210091442A (en) * 2020-01-14 2021-07-22 주식회사 해피캔버스 System and server and operating method to curate study contents by VR contents
KR102440817B1 (en) * 2020-02-19 2022-09-06 사회복지법인 삼성생명공익재단 Reinforcement learning method, device, and program for identifying causal effect in logged data
KR20210157337A (en) * 2020-06-18 2021-12-28 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. Recommendation system optimization method, device, equipment and computer storage medium
KR102408115B1 (en) * 2020-07-30 2022-06-10 동명대학교산학협력단 System for platform of job mentoring employing incumbent of real-name basis
KR102518825B1 (en) * 2022-11-25 2023-04-06 이정수 Reinforcement learning system for self-development

Also Published As

Publication number Publication date
KR102518825B1 (en) 2023-04-06

Similar Documents

Publication Publication Date Title
WO2024111866A1 (en) Reinforcement learning system for self-development
WO2019050247A2 (en) Neural network learning method and device for recognizing class
Guu et al. From language to programs: Bridging reinforcement learning and maximum marginal likelihood
WO2018212494A1 (en) Method and device for identifying object
Corchado et al. Project monitoring intelligent agent system
WO2021251690A1 (en) Learning content recommendation system based on artificial intelligence training, and operation method thereof
Cross Research in design thinking
WO2019098449A1 (en) Apparatus related to metric-learning-based data classification and method thereof
WO2022211326A1 (en) Apparatus and system for evaluating user&#39;s ability through artificial intelligence model trained with transfer element applied to plurality of test domains, and operating method thereof
WO2019098418A1 (en) Neural network training method and device
CN114090780B (en) Prompt learning-based rapid picture classification method
WO2019125054A1 (en) Method for content search and electronic device therefor
WO2020032467A1 (en) Method, apparatus, and program for deriving query and response based on artificial intelligence
WO2019050297A1 (en) Neural network learning method and device
WO2022092415A1 (en) Decision-making agent generating apparatus and method
WO2022146050A1 (en) Federated artificial intelligence training method and system for depression diagnosis
WO2018212584A2 (en) Method and apparatus for classifying class, to which sentence belongs, using deep neural network
WO2020159241A1 (en) Method for processing image, and apparatus therefor
WO2019107674A1 (en) Computing apparatus and information input method of the computing apparatus
CN112509392A (en) Robot behavior teaching method based on meta-learning
WO2022145829A1 (en) Learning content recommendation system for predicting user&#39;s probability of getting correct answer by using latent factor-based collaborative filtering, and operating method thereof
WO2021182723A1 (en) Electronic device for precise behavioral profiling for implanting human intelligence into artificial intelligence, and operation method therefor
WO2022055020A1 (en) Automated machine learning method and apparatus therefor
WO2020204610A1 (en) Deep learning-based coloring method, system, and program
WO2023182713A1 (en) Method and system for generating event for object on screen by recognizing screen information including text and non-text images on basis of artificial intelligence