WO2022120970A1 - Method and system for order dispatch based on interactive reinforcement learning - Google Patents

Method and system for order dispatch based on interactive reinforcement learning Download PDF

Info

Publication number
WO2022120970A1
WO2022120970A1 PCT/CN2020/139231 CN2020139231W WO2022120970A1 WO 2022120970 A1 WO2022120970 A1 WO 2022120970A1 CN 2020139231 W CN2020139231 W CN 2020139231W WO 2022120970 A1 WO2022120970 A1 WO 2022120970A1
Authority
WO
WIPO (PCT)
Prior art keywords
human
order
order dispatching
training
data
Prior art date
Application number
PCT/CN2020/139231
Other languages
French (fr)
Chinese (zh)
Inventor
金铭
王洋
须成忠
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022120970A1 publication Critical patent/WO2022120970A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0633Lists, e.g. purchase orders, compilation or processing
    • G06Q30/0635Processing of requisition or of purchase orders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06Q50/40

Definitions

  • the invention relates to the field of Internet information technology, in particular to an order dispatching method and system based on interactive reinforcement learning.
  • the existing order dispatching technology is mainly an autonomous learning method based on reinforcement learning, constructing a Markov decision process, setting the agent, environment state, and agent action, and constructing a state transition function and a reward function according to the environment state and agent action. , based on which the optimal decision is trained to maximize the overall benefit of order dispatching.
  • the main technical problem solved by the present invention is to provide an order dispatching method based on interactive reinforcement learning, which introduces human-computer interaction in the process of autonomous learning, integrates human-computer interaction modes of human demonstration, interference and evaluation, and learns from human demonstrations.
  • Real data demonstrated by humans can better simulate real order dispatching scenarios; learn from human interference, control the performance of the agent when wrong actions occur in the autonomous learning process, and avoid erroneous results; learn from human evaluations , through manual evaluation of the self-learning results, the learning process is shifted to a better order dispatching strategy, and the learning process is accelerated, so as to obtain the optimal order dispatching strategy; an order dispatching system based on interactive reinforcement learning is also provided.
  • a technical solution adopted by the present invention is to provide an order dispatching method based on interactive reinforcement learning, which includes the following steps:
  • Step S1 imitating training for order dispatch task modeling
  • Step S2 providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;
  • Step S3 imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;
  • Step S4 imitating the behavior of human evaluation feedback through autonomous learning
  • Step S5 enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.
  • step S2 when human demonstration data is generated, a demonstration record of order dispatching is collected from the human demonstration, and Gaussian regression is used to train the human demonstration data, thereby training an order dispatching strategy.
  • step S2 if new human demonstration data is generated, the training of imitating the human demonstration data is repeated.
  • step S3 the order dispatching task is executed through the order dispatching strategy trained from step S2, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution is changed according to the action data.
  • the action of order dispatching thereby training a new order dispatching strategy.
  • step S3 if new human interference data is generated, the training of imitating the human interference data is repeated.
  • step S4 the order dispatching task is executed through the order dispatching strategy trained from step S3, and when human evaluation data is generated, the human evaluation data is collected, and the action of order dispatching task is evaluated by the reward signal, Thereby, a new order dispatching strategy is trained.
  • step S4 if new human evaluation data is generated, the training of imitating human evaluation is repeated.
  • step S3 when the human interference data is generated, the action data of the human interference order dispatching is collected, the action of executing the order dispatching is changed according to the action data, and the evaluator is trained by using the reward signal and the timing error. , so as to update the order dispatch strategy and train a new order dispatch strategy.
  • step S4 the order dispatching task is executed through the order dispatching strategy, human evaluation data is collected, the current order dispatching method is evaluated by a reward signal, and then the reward signal and the timing error are used for training evaluation. , so as to train a new order dispatching strategy.
  • An order dispatching system based on interactive reinforcement learning including:
  • the modeling module is used to train and model the order dispatching task
  • the training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;
  • An imitating human interference training module which is used for training to imitate the behavior of human intervention through autonomous learning
  • the imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning
  • the reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.
  • the present invention introduces human-computer interaction in the process of autonomous learning, integrates the human-computer interaction mode of human demonstration, interference and evaluation, learns from human demonstration, and passes the real demonstration of human beings. data, can better simulate the real order dispatching scene; learn from human interference, control the performance of the agent when wrong actions occur in the process of autonomous learning, avoid wrong results; learn from human evaluation, through manual evaluation of autonomous
  • the learning result makes the learning process deviate in the direction of a better order dispatching strategy, accelerates the learning process, and obtains the optimal order dispatching strategy.
  • Fig. 1 is the step block diagram of the order dispatch method based on interactive reinforcement learning of the present invention
  • Fig. 2 is the flow chart of the order dispatch method based on interactive reinforcement learning of the present invention
  • Fig. 3 is the learning execution flow diagram of imitating human demonstration training of the present invention.
  • Fig. 4 is the learning execution flow diagram of imitating human interference training of the present invention.
  • FIG. 5 is a block diagram showing the execution flow of learning by imitating human evaluation training according to the present invention.
  • the present invention provides an order dispatch method based on interactive reinforcement learning, comprising the following steps:
  • Step S1 imitating training for order dispatch task modeling
  • Step S2 providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;
  • Step S3 imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;
  • Step S4 imitating the behavior of human evaluation feedback through autonomous learning
  • Step S5 enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.
  • human-computer interaction is introduced in the process of autonomous learning, the human-computer interaction mode of human demonstration, interference and evaluation is integrated, learning from human demonstration, and through the real data of human demonstration, the real order dispatch can be better simulated Scenarios; learn from human interference, control the performance of the agent when wrong actions occur in the process of autonomous learning, and avoid erroneous results; learn from human evaluation, through manual evaluation of autonomous learning results, so that the learning process can be better ordered
  • the direction of the dispatch strategy is shifted to speed up the learning process, so as to obtain the optimal order dispatch strategy.
  • the optimal order dispatching strategy is first trained using the reinforcement learning method, and interactive reinforcement learning is iteratively performed for the human-computer interaction modes of human demonstration, interference, and evaluation that appear during the training process.
  • Training process learn from human demonstration, learn from human interference, learn from human evaluation, and finally enter the pure reinforcement learning stage to continue training.
  • Agent An idle vehicle is regarded as an agent, and vehicles in the same space-time node are isomorphic, that is, vehicles located in the same area in the same time interval are regarded as the same agent (with the same policy). ).
  • agent i the ith agent
  • Action indicates the allocation strategy of all available vehicles at the same time; the action space of a single agent specifies the position that the agent can reach next time, which is a set represented by 7 discrete actions, the first six discrete actions respectively means assigning the agent to one of its six adjacent grids, and the last discrete action means staying in the current grid.
  • Reward function All agents in the same grid have the same reward function, and agent i tries to maximize its discounted reward.
  • the individual reward associated with agent i 's action is defined as the one that reaches the same grid as agent i at the same time. Average reward for all agents. Individual rewards at the same time and place are the same.
  • State transition probability it gives the probability of transitioning to the next state when a joint action is taken in the current state, although the action is deterministic, each time a new vehicle and order will appear in a different grid, and Existing vehicles will go offline through a random process.
  • the reinforcement learning algorithm can be trained using the actor-critic training method (centralized training-decentralized execution), and all agents share a central judge (critic) to evaluate the order dispatching action (action) and update the order dispatching policy (policy). ), agents independently follow their learned policies during execution, without the need for a centralized critic.
  • actor-critic training method centralized training-decentralized execution
  • action order dispatching action
  • policy order dispatching policy
  • step S2 when the human demonstration data is generated, the demonstration record of order dispatch is collected from the human demonstration, and the human demonstration data is trained by using Gaussian regression, so as to train the order dispatch strategy; the human demonstration data, then repeat the training to imitate the human demonstration data.
  • learning from human demonstrations trains an autonomous learning process to imitate human behavior in an order dispatch task, where a human acts as a demonstrator, providing demonstration instances of order dispatch in terms of sequences of states and actions, and by using these demonstrations, autonomous learning
  • the process mimics the order-dispatching policy (mapping from states to actions) demonstrated by humans, and learning from demonstrations can provide a more direct path to these expected order-dispatching behaviors, quickly converging to more stable order-dispatching behaviors.
  • step S3 the order dispatching task is executed through the order dispatching strategy trained from step S2, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution order dispatching method is changed according to the action data. action to train a new order dispatching strategy; if new human interference data is generated, the training to imitate the human interference data is repeated. Further, when human interference data is generated, collect the action data of human interference order dispatching, change the action of executing the order dispatching according to the action data, and then use the reward signal and timing error to train the evaluator, so as to update the order dispatching strategy. New order dispatch strategy.
  • the learning process executes the order dispatching task according to the order dispatching that the agent learns autonomously by interacting with the environment or the order dispatching strategy ⁇ rl learned and trained from human demonstrations; when human interference data is generated, collect The action data of human interference order dispatch, according to which the action of executing order dispatch is changed; the intervention reward is calculated according to the degree of human interference, and the reward signal and the timing error associated with it are used to train a value function (evaluator, critic) And evaluate the actions taken by the actor; use the actor-critic policy gradient method to update the order dispatch strategy ⁇ rl, if new human interference data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy ⁇ rl.
  • step S4 the order dispatching task is executed through the order dispatching strategy trained from step S3, when human evaluation data is generated, the human evaluation data is collected, and the action of the order dispatching task is evaluated by the reward signal, so as to train the A new order dispatch strategy; if new human evaluation data is generated, repeat the training to imitate human evaluation; further, execute the order dispatch task through the order dispatch strategy, collect human evaluation data, and evaluate the current order dispatch through a reward signal
  • the method is good or bad, and then use the reward signal and timing error to train the evaluator, so as to train a new order dispatching strategy.
  • Humans act as supervisors, providing real-time evaluations (or reviews) that interactively shape the behavior of autonomous learning processes; it leverages human domain knowledge and intent to shape intelligence through sparse interactions in the form of evaluation feedback
  • the human evaluator only needs to understand the task goal, not the specific way of performing the task, and minimize the task of the human evaluator.
  • the learning process executes the order dispatching task according to the order dispatching that the agent learns autonomously through interaction with the environment or the order dispatching strategy ⁇ rl that is learned and trained from human interference; when human evaluation data is generated, the human Play the role of a supervisor, evaluate the actions of the system through a reward signal, and collect human evaluation data, similar to the learning phase from human intervention, through a reward signal to evaluate how well the current order is dispatched, the reward signal and its associated Timing errors are used to update the evaluator's value function and policy. If new human evaluation data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy ⁇ rl.
  • the actor-critic training method (centralized training-decentralized execution) is used for training.
  • the actor is trained, and then critic is added, and the actor is trained during the learning process from the human demonstration.
  • the critics trained during the learning process from human evaluations then assume the role of supervisors, and finally, the actors and critics are combined in a standard actor-critic reinforcement learning architecture driven by a learned reward model.
  • an order dispatching method based on interactive reinforcement learning can be called an interactive learning method Cycle-of-HRL (HRL: interactive reinforcement learning) to achieve efficient order dispatching; this method is based on interactive reinforcement learning.
  • Reinforcement learning which fuses multiple human-computer interaction modes of learning from human demonstrations, learning from human interference, and learning from human evaluations, formulates an integrated approach to interaction modes.
  • Performance metrics Once the order dispatch strategy reaches a certain level, predefined performance metrics can be used to indicate when to switch the interaction mode, or, when no improvement in the performance of the system is observed, people interacting with the system can interact in different ways switch manually.
  • Data modality limitations Depending on the task, the number of data modalities that can be demonstrated, intervened, or evaluated by humans may be limited, in which case the interactive learning method Cycle-of-HRL varies in switch between modes.
  • Dominance function After training the reward function, the difference between the dominance function (the state-action value function Q(s) and the state-value function V(s), comparing the expected return value for a given state-action pair with the expectation in that state return value), which can be computed and used to compare expected return values between human and system actions.
  • Cycle-of-HRL can switch the way of interaction when the system's dominance function exceeds the human's dominance function.
  • the present invention also provides an order dispatching system based on interactive reinforcement learning, including:
  • the modeling module is used to train and model the order dispatching task
  • the training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;
  • An imitating human interference training module which is used for training to imitate the behavior of human intervention through autonomous learning
  • the imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning
  • the reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.
  • human-computer interaction is introduced, and various human-computer interaction modes of human demonstration, interference, and evaluation are integrated, so as to reduce the search space, speed up the learning process, and improve the accuracy.
  • Learning from human demonstrations can better simulate real order dispatching scenarios through real data demonstrated by human beings; learning from human interference, when wrong actions occur during autonomous learning, control the performance of the agent to avoid erroneous results ;Learn from human evaluation and manually evaluate the self-learning results to shift the learning process to a better order dispatching strategy and speed up the learning process.
  • the present invention formulates the integration mode of the three interaction modes, and designs the standard for switching the three interaction modes.

Abstract

Provided are a method and system for order dispatch based on interactive reinforcement learning; the method is: introducing human-computer interaction in an autonomous learning process, and integrating a human-computer interaction model which incorporates human demonstration, interference, and evaluation, learning from human demonstration, and by means of real data of the human demonstration, it is possible to better simulate an actual order dispatch scenario; learning from human interference, and if an incorrect action occurs during autonomous learning, controlling the performance of an agent to avoid incorrect results; learning from human evaluation, and by means of manual evaluation of autonomous learning results, shifting the learning process toward a better dispatch strategy, accelerating the learning process, and thereby obtaining an optimal order dispatch strategy.

Description

一种基于交互式强化学习的订单分派方法及系统An order dispatch method and system based on interactive reinforcement learning 技术领域technical field
本发明涉及互联网信息技术领域,特别涉及一种基于交互式强化学习的订单分派方法及系统。The invention relates to the field of Internet information technology, in particular to an order dispatching method and system based on interactive reinforcement learning.
背景技术Background technique
在线打车应用和平台已经成为一种新颖而流行的、通过移动应用按需提供交通服务的方式。目前,滴滴、Uber、Lyft等一些打车移动应用程序在全世界范围内广受欢迎,系统每天都会为大量乘客提供服务,并产生大量的叫车订单,例如中国最大的网约车服务提供商滴滴,每天需处理约1100万份订单,在线打车服务的订单分派问题本质上是潜在乘客和司机的合理匹配,在这种场景下,在线用户到达后,需要为其分配一个最佳服务提供者。在许多情况下,服务是可重用的,服务提供者与用户匹配后将消失一段时间,用户使用完服务后将重新加入系统。在这里,离线的服务提供者是不同的司机,当一个潜在乘客发送请求后,系统会将其与附近的司机匹配,大多数情况下,司机会重新加入系统,完成服务后可以再次匹配。Online ride-hailing apps and platforms have become a new and popular way to provide on-demand transportation services through mobile apps. At present, some ride-hailing mobile applications such as Didi, Uber, Lyft are popular all over the world, and the system serves a large number of passengers every day and generates a large number of ride-hailing orders, such as China's largest online car-hailing service provider Didi needs to process about 11 million orders every day. The problem of order allocation for online taxi services is essentially a reasonable match between potential passengers and drivers. In this scenario, after online users arrive, they need to be assigned an optimal service provider. By. In many cases, the service is reusable, the service provider will disappear for a period of time after the user is matched, and the user will rejoin the system after using the service. Here, the offline service provider is a different driver, and when a potential passenger sends a request, the system matches him with a nearby driver, and in most cases, the driver rejoins the system and can be matched again after completing the service.
现有的订单分派技术主要为基于强化学习的自主学习方式,构建马尔科夫决策过程,设定智能体、环境状态、智能体动作,并依据环境状态和智能体动作构建状态转移函数和奖励函数,据此训练出最优决策,使订单分派的总体效益最大。The existing order dispatching technology is mainly an autonomous learning method based on reinforcement learning, constructing a Markov decision process, setting the agent, environment state, and agent action, and constructing a state transition function and a reward function according to the environment state and agent action. , based on which the optimal decision is trained to maximize the overall benefit of order dispatching.
现有技术基本都为通过传统的强化学习与环境的交互,进行完全自主的训练和学习过程,然而这种完全自主学习的方式缺少人的参与,学习过程需要耗费大量时间;学习过程中不能控制智能体的行为,可能出现错误的结果;学习结果难以模拟复杂的真实场景。Existing technologies are basically based on the interaction between traditional reinforcement learning and the environment to carry out completely autonomous training and learning processes. However, this completely autonomous learning method lacks human participation, and the learning process takes a lot of time; the learning process cannot be controlled. The behavior of the agent may lead to wrong results; the learning results are difficult to simulate complex real scenes.
发明内容SUMMARY OF THE INVENTION
本发明主要解决的技术问题是提供一种基于交互式强化学习的订单分派方法,在自主学习过程中引入人机交互,融合人类演示、干扰、评价的人机交互模式,从人类演示中学习,通过人类演示的真实数据,能够更好地模拟真实的订单分派场景;从人类干扰中学习,当自主学习过程中出现错误动作时,控制智能体的性能,避免出现错误结果;从人类评估中学习,通过人工评价自主学习结果,使学习过程向更好的订单分派策略方向偏移,加快学习过程,从而得到最优的订单分派策略;还提供了一种基于交互式强化学习的订单分派系统。The main technical problem solved by the present invention is to provide an order dispatching method based on interactive reinforcement learning, which introduces human-computer interaction in the process of autonomous learning, integrates human-computer interaction modes of human demonstration, interference and evaluation, and learns from human demonstrations. Real data demonstrated by humans can better simulate real order dispatching scenarios; learn from human interference, control the performance of the agent when wrong actions occur in the autonomous learning process, and avoid erroneous results; learn from human evaluations , through manual evaluation of the self-learning results, the learning process is shifted to a better order dispatching strategy, and the learning process is accelerated, so as to obtain the optimal order dispatching strategy; an order dispatching system based on interactive reinforcement learning is also provided.
为解决上述技术问题,本发明采用的一个技术方案是:提供一种基于交互式强化学习的订单分派方法,其中,包括如下步骤:In order to solve the above-mentioned technical problems, a technical solution adopted by the present invention is to provide an order dispatching method based on interactive reinforcement learning, which includes the following steps:
步骤S1、对订单分派任务建模进行模仿训练;Step S1, imitating training for order dispatch task modeling;
步骤S2、在状态和动作的序列方面提供模仿人类行为的订单分派的演示实例,通过自主学习模仿人类演示的订单分派策略行为;Step S2, providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;
步骤S3、在进入灾难性状态或人类不满意的错误状态中,通过自主学习模仿人类干预的行为;Step S3, imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;
步骤S4、通过自主学习模仿人类的评价反馈的行为;Step S4, imitating the behavior of human evaluation feedback through autonomous learning;
步骤S5、进入纯强化学习阶段进行训练,从而得到最优的订单分派策略。Step S5, enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.
作为本发明的一种改进,在步骤S2内,当产生了人类演示数据时,从人类的演示中收集订单分派的演示记录,使用高斯回归训练该人类演示数据,从而训练出订单分派策略。As an improvement of the present invention, in step S2, when human demonstration data is generated, a demonstration record of order dispatching is collected from the human demonstration, and Gaussian regression is used to train the human demonstration data, thereby training an order dispatching strategy.
作为本发明的进一步改进,在步骤S2内,如果产生了新的人类演示数据,则重复进行模仿人类演示数据的训练。As a further improvement of the present invention, in step S2, if new human demonstration data is generated, the training of imitating the human demonstration data is repeated.
作为本发明的更进一步改进,在步骤S3内,经过从步骤S2训练出的订单分派策略执行订单分派任务,当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该动作数据改变执行订单分派的动作,从而训练出新的订单分派策略。As a further improvement of the present invention, in step S3, the order dispatching task is executed through the order dispatching strategy trained from step S2, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution is changed according to the action data. The action of order dispatching, thereby training a new order dispatching strategy.
作为本发明的更进一步改进,在步骤S3内,如果产生了新的人类干扰数据,则重复进行模仿人类干扰数据的训练。As a further improvement of the present invention, in step S3, if new human interference data is generated, the training of imitating the human interference data is repeated.
作为本发明的更进一步改进,步骤S4内,经过从步骤S3训练出的订单分派策略执行订单分派任务,当产生人类评价数据时,收集人类的评价数据,通过奖励信号评价订单分派任务的动作,从而训练出新的订单分派策略。As a further improvement of the present invention, in step S4, the order dispatching task is executed through the order dispatching strategy trained from step S3, and when human evaluation data is generated, the human evaluation data is collected, and the action of order dispatching task is evaluated by the reward signal, Thereby, a new order dispatching strategy is trained.
作为本发明的更进一步改进,在步骤S4内,如果产生了新的人类评价数据,则重复进行模仿人类评价的训练。As a further improvement of the present invention, in step S4, if new human evaluation data is generated, the training of imitating human evaluation is repeated.
作为本发明的更进一步改进,在步骤S3内,当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该动作数据改变执行订单分派的动作,再使用奖励信号和时序误差训练评价者,从而更新订单分派策略,从而训练出新的订单分派策略。As a further improvement of the present invention, in step S3, when the human interference data is generated, the action data of the human interference order dispatching is collected, the action of executing the order dispatching is changed according to the action data, and the evaluator is trained by using the reward signal and the timing error. , so as to update the order dispatch strategy and train a new order dispatch strategy.
作为本发明的更进一步改进,在步骤S4内,经过订单分派策略执行订单分派任务,收集人类的评价数据,通过一个奖励信号评价当前订单分派方式的好坏,再使用奖励信号和时序误差训练评价者,从而训练出新的订单分派策略。As a further improvement of the present invention, in step S4, the order dispatching task is executed through the order dispatching strategy, human evaluation data is collected, the current order dispatching method is evaluated by a reward signal, and then the reward signal and the timing error are used for training evaluation. , so as to train a new order dispatching strategy.
一种基于交互式强化学习的订单分派系统,包括:An order dispatching system based on interactive reinforcement learning, including:
建模模块,用于对订单分派任务进行训练建模;The modeling module is used to train and model the order dispatching task;
模仿人类演示训练模块,用于通过自主学习模仿人类演示的订单分派策略行为进行训练;The training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;
模仿人类干扰训练模块,用于通过自主学习模仿人类干预的行为的训练;An imitating human interference training module, which is used for training to imitate the behavior of human intervention through autonomous learning;
模仿人类评价训练模块,用于通过自主学习模仿人类的评价反馈的行为的训练;The imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning;
强化训练模块,用于纯强化学习训练,从而得到最优的订单分派策略。The reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.
本发明的有益效果是:与现有技术相比,本发明在自主学习过程中引入人机交互,融合人类演示、干扰、评价的人机交互模式,从人类演示中学习,通过人类演示的真实数据,能够更好地模拟真实的订单分派场景;从人类干扰中学习,当自主学习过程中出现错误动作时,控制智能体的性能,避免出现错误结果;从人类评估中学习,通过人工评价自主学习结果,使学习过程向更好的订单分派策略方向偏移,加快学习过程,从而得到最优的订单分派策略。The beneficial effects of the present invention are: compared with the prior art, the present invention introduces human-computer interaction in the process of autonomous learning, integrates the human-computer interaction mode of human demonstration, interference and evaluation, learns from human demonstration, and passes the real demonstration of human beings. data, can better simulate the real order dispatching scene; learn from human interference, control the performance of the agent when wrong actions occur in the process of autonomous learning, avoid wrong results; learn from human evaluation, through manual evaluation of autonomous The learning result makes the learning process deviate in the direction of a better order dispatching strategy, accelerates the learning process, and obtains the optimal order dispatching strategy.
图1为本发明的基于交互式强化学习的订单分派方法的步骤框图;Fig. 1 is the step block diagram of the order dispatch method based on interactive reinforcement learning of the present invention;
图2为本发明的基于交互式强化学习的订单分派方法的流程框图;Fig. 2 is the flow chart of the order dispatch method based on interactive reinforcement learning of the present invention;
图3为本发明的模仿人类演示训练的学习执行流程框图;Fig. 3 is the learning execution flow diagram of imitating human demonstration training of the present invention;
图4为本发明的模仿人类干扰训练的学习执行流程框图;Fig. 4 is the learning execution flow diagram of imitating human interference training of the present invention;
图5为本发明的模仿人类评价训练的学习执行流程框图。FIG. 5 is a block diagram showing the execution flow of learning by imitating human evaluation training according to the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
如图1所示,本发明提供一种基于交互式强化学习的订单分派方法,包括如下步骤:As shown in Figure 1, the present invention provides an order dispatch method based on interactive reinforcement learning, comprising the following steps:
步骤S1、对订单分派任务建模进行模仿训练;Step S1, imitating training for order dispatch task modeling;
步骤S2、在状态和动作的序列方面提供模仿人类行为的订单分派的演示实例,通过自主学习模仿人类演示的订单分派策略行为;Step S2, providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;
步骤S3、在进入灾难性状态或人类不满意的错误状态中,通过自主学习模仿人类干预的行为;Step S3, imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;
步骤S4、通过自主学习模仿人类的评价反馈的行为;Step S4, imitating the behavior of human evaluation feedback through autonomous learning;
步骤S5、进入纯强化学习阶段进行训练,从而得到最优的订单分派策略。Step S5, enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.
在本发明中,在自主学习过程中引入人机交互,融合人类演示、干扰、评价的人机交互模式,从人类演示中学习,通过人类演示的真实数据,能够更好地模拟真实的订单分派场景;从人类干扰中学习,当自主学习过程中出现错误动作时,控制智能体的性能,避免出现错误结果;从人类评估中学习,通过人工评价自主学习结果,使学习过程向更好的订单分派策略方向偏移,加快学习过程,从而得到最优的订单分派策略。In the present invention, human-computer interaction is introduced in the process of autonomous learning, the human-computer interaction mode of human demonstration, interference and evaluation is integrated, learning from human demonstration, and through the real data of human demonstration, the real order dispatch can be better simulated Scenarios; learn from human interference, control the performance of the agent when wrong actions occur in the process of autonomous learning, and avoid erroneous results; learn from human evaluation, through manual evaluation of autonomous learning results, so that the learning process can be better ordered The direction of the dispatch strategy is shifted to speed up the learning process, so as to obtain the optimal order dispatch strategy.
如图2所示,对于订单分派任务,首先使用强化学习方法训练最优的订单分派策略,在训练过程中针对出现的人类演示、干扰、评价的人机交互模式,迭代地执行交互式强化学习训练过程:从人类演示中学习、从人类干扰中学习、从人类评价中学习,最后进入纯强化学习阶段继续进行训练。As shown in Figure 2, for the order dispatching task, the optimal order dispatching strategy is first trained using the reinforcement learning method, and interactive reinforcement learning is iteratively performed for the human-computer interaction modes of human demonstration, interference, and evaluation that appear during the training process. Training process: learn from human demonstration, learn from human interference, learn from human evaluation, and finally enter the pure reinforcement learning stage to continue training.
在步骤S1内,使用马尔可夫决策过程对订单调度任务建模,构建为一个N个智能体的马尔可夫模型G ,G = (N, S, a, P, R, γ),其中N, S, a, P, R, γ分别为智能体的数量、状态集、联合动作空间、转移概率函数、奖励函数和折扣因素,定义如下:In step S1, use the Markov decision process to model the order scheduling task, and build a Markov model G with N agents, G = (N, S, a, P, R, γ), where N , S, a, P, R, γ are the number of agents, state set, joint action space, transition probability function, reward function and discount factor, respectively, which are defined as follows:
1、智能体:将一辆空闲的车辆视为一个智能体,同一时空节点中的车辆是同构的,即在同一时间间隔内位于同一区域的车辆被视为同一智能体(具有相同的策略)。1. Agent: An idle vehicle is regarded as an agent, and vehicles in the same space-time node are isomorphic, that is, vehicles located in the same area in the same time interval are regarded as the same agent (with the same policy). ).
2、状态:在任意时刻,考虑空闲车辆和订单的空间分布(即每个网格中可用车辆的数量和订单)和当前时间,所有智能体有相同的全局状态。 agent i (第i个智能体)的状态定义为其所处网格的标识和共享全局状态。 2. State: At any moment, considering the spatial distribution of idle vehicles and orders (that is, the number of available vehicles and orders in each grid) and the current time, all agents have the same global state. The state of agent i (the ith agent) is defined by the identity and shared global state of the grid it is in.
3、动作:联合动作指示在同一时间所有可用车辆的分配策略;单个智能体的动作空间指定智能体下一次能够到达的位置,它是由7个离散动作表示的集合,前六个离散的动作分别表示将智能体分配到它六个相邻网格中的一个,最后一个离散动作表示留在当前网格中。3. Action: The joint action indicates the allocation strategy of all available vehicles at the same time; the action space of a single agent specifies the position that the agent can reach next time, which is a set represented by 7 discrete actions, the first six discrete actions respectively means assigning the agent to one of its six adjacent grids, and the last discrete action means staying in the current grid.
4、奖励函数:所有在同一网格中的智能体拥有相同的奖励函数, agent i 尝试最大化其折扣奖励,与 agent i 动作相关的个人奖励定义为在同一时刻与 agent i 到达相同网格的所有智能体的平均奖励。在同一时间、同一地点的个体奖励相同。 4. Reward function: All agents in the same grid have the same reward function, and agent i tries to maximize its discounted reward. The individual reward associated with agent i 's action is defined as the one that reaches the same grid as agent i at the same time. Average reward for all agents. Individual rewards at the same time and place are the same.
5、状态转移概率:它给出了在当前状态中采取联合动作时过渡到下一状态的概率,虽然动作是确定的,但是每次新的车辆和订单将出现在不同的网格中,并且现有的车辆将通过一个随机的过程转变为离线状态。5. State transition probability: it gives the probability of transitioning to the next state when a joint action is taken in the current state, although the action is deterministic, each time a new vehicle and order will appear in a different grid, and Existing vehicles will go offline through a random process.
在本发明内,强化学习算法进行训练可采用actor-critic训练法(集中训练-分散执行),所有智能体共享一个中央法官(critic)来评估订单分派动作(action)和更新订单分派策略(policy),在执行期间智能体独立地遵循它们学习到的策略,不需要集中式critic。In the present invention, the reinforcement learning algorithm can be trained using the actor-critic training method (centralized training-decentralized execution), and all agents share a central judge (critic) to evaluate the order dispatching action (action) and update the order dispatching policy (policy). ), agents independently follow their learned policies during execution, without the need for a centralized critic.
在本发明中,在步骤S2内,当产生了人类演示数据时,从人类的演示中收集订单分派的演示记录,使用高斯回归训练该人类演示数据,从而训练出订单分派策略;如果产生了新的人类演示数据,则重复进行模仿人类演示数据的训练。In the present invention, in step S2, when the human demonstration data is generated, the demonstration record of order dispatch is collected from the human demonstration, and the human demonstration data is trained by using Gaussian regression, so as to train the order dispatch strategy; the human demonstration data, then repeat the training to imitate the human demonstration data.
具体地讲,从人类演示中学习训练自主学习过程在订单分派任务中模仿人类行为,人扮演演示者的角色,在状态和动作的序列方面提供订单分派的演示实例,通过使用这些演示,自主学习过程模仿人类演示的订单分派策略(从状态到动作的映射),从演示中学习可以为这些预期的订单分派行为提供更直接的路径,快速收敛到更稳定的订单分派行为。Specifically, learning from human demonstrations trains an autonomous learning process to imitate human behavior in an order dispatch task, where a human acts as a demonstrator, providing demonstration instances of order dispatch in terms of sequences of states and actions, and by using these demonstrations, autonomous learning The process mimics the order-dispatching policy (mapping from states to actions) demonstrated by humans, and learning from demonstrations can provide a more direct path to these expected order-dispatching behaviors, quickly converging to more stable order-dispatching behaviors.
如图3所示,首先,使用强化学习方法训练订单分派策略πrl;当产生了人类演示数据时,从人类的演示中收集订单分派的演示记录,这些演示数据(接收到的观察和采取的行动)构成了最初的人类数据集;使用高斯回归训练演示数据,得到从人类演示数据中训练出的订单分派策略πm,使用∈-greedy策略(∈-greedy基于一个概率来对探索和利用进行折中:每次尝试时以∈的概率进行探索,以1-∈的概率利用已经学到的策略),产生随机数rand,若rand<=∈,则进行随机探索,即随机将订单分派给空闲的司机;若rand>∈,将CP(s)、CQ(s)归一化后利用双曲正切函数对二者进行缩放,比较缩放后的置信度大小,若πm的置信度>πrl的置信度,则按πm策略执行分派订单的动作,否则按πrl策略执行分派订单的策略(CP(s):已学到的策略的置信度,即从人类演示数据中训练出的订单分派策略πm的置信度;CQ(s):遵循当前自学习策略的置信度,即智能体当前通过与环境交互自主学习到的订单分派策略πrl的置信度;执行订单分派动作后,根据环境状态及反馈的奖励更新订单分派策略πrl,若产生了新的人类演示数据,则重复上述步骤;否则输出训练出的订单分派策略πrl。As shown in Figure 3, first, the order dispatching policy πrl is trained using reinforcement learning method; when human demonstration data is generated, the demonstration records of order dispatch are collected from human demonstrations, these demonstration data (observations received and actions taken) ) constitute the original human dataset; use Gaussian regression to train the demo data, get an order dispatch policy πm trained from the human demo data, and use an ∈-greedy policy (∈-greedy is based on a probability that trades off exploration and exploitation : explore with the probability of ∈ in each attempt, and use the learned strategy with the probability of 1-∈), generate a random number rand, if rand<=∈, perform random exploration, that is, randomly assign orders to idlers Driver; if rand>∈, normalize CP(s) and CQ(s) and use the hyperbolic tangent function to scale them, and compare the scaled confidence, if the confidence of πm > the confidence of πrl , the action of dispatching orders is executed according to the πm strategy, otherwise the strategy of dispatching orders is executed according to the πrl strategy (CP(s): the confidence of the learned strategy, that is, the confidence of the order dispatch strategy πm trained from the human demonstration data degree; CQ(s): the confidence degree of following the current self-learning strategy, that is, the confidence degree of the order dispatch strategy πrl that the agent currently learns autonomously by interacting with the environment; after executing the order dispatch action, it is updated according to the environment state and the feedback reward Order dispatch strategy πrl, if new human demonstration data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy πrl.
在本发明内,在步骤S3内,经过从步骤S2训练出的订单分派策略执行订单分派任务,当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该动作数据改变执行订单分派的动作,从而训练出新的订单分派策略;如果产生了新的人类干扰数据,则重复进行模仿人类干扰数据的训练。进一步说,当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该动作数据改变执行订单分派的动作,再使用奖励信号和时序误差训练评价者,从而更新订单分派策略,从而训练出新的订单分派策略。In the present invention, in step S3, the order dispatching task is executed through the order dispatching strategy trained from step S2, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution order dispatching method is changed according to the action data. action to train a new order dispatching strategy; if new human interference data is generated, the training to imitate the human interference data is repeated. Further, when human interference data is generated, collect the action data of human interference order dispatching, change the action of executing the order dispatching according to the action data, and then use the reward signal and timing error to train the evaluator, so as to update the order dispatching strategy. New order dispatch strategy.
具体地讲,当自主学习过程即将进入灾难性状态或人类不满意的错误状态时,人类扮演着监督者的角色并进行干扰,可以防止或减轻灾难性行为、避免人类不满意的错误行为。如图4所示,首先,学习过程按照智能体通过与环境交互自主学习到的订单分派或经过从人类演示中学习训练出的订单分派策略πrl执行订单分派任务;当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该数据改变执行订单分派的动作;根据人类的干扰程度计算干预奖励,该奖励信号和与之相关的时序误差被用来训练一个值函数(评价者,critic)并评估actor(执行者)所采取的动作;使用actor-critic策略梯度方法更新订单分派策略πrl,若产生了新的人类干扰数据,则重复上述步骤;否则输出训练出的订单分派策略πrl。Specifically, when the autonomous learning process is about to enter a catastrophic state or a human-unsatisfactory error state, humans play the role of supervisor and interfere, which can prevent or mitigate catastrophic behaviors and avoid human-unsatisfactory error behaviors. As shown in Figure 4, first, the learning process executes the order dispatching task according to the order dispatching that the agent learns autonomously by interacting with the environment or the order dispatching strategy πrl learned and trained from human demonstrations; when human interference data is generated, collect The action data of human interference order dispatch, according to which the action of executing order dispatch is changed; the intervention reward is calculated according to the degree of human interference, and the reward signal and the timing error associated with it are used to train a value function (evaluator, critic) And evaluate the actions taken by the actor; use the actor-critic policy gradient method to update the order dispatch strategy πrl, if new human interference data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy πrl.
在本发明内,步骤S4内,经过从步骤S3训练出的订单分派策略执行订单分派任务,当产生人类评价数据时,收集人类的评价数据,通过奖励信号评价订单分派任务的动作,从而训练出新的订单分派策略;如果产生了新的人类评价数据,则重复进行模仿人类评价的训练;进一步说,经过订单分派策略执行订单分派任务,收集人类的评价数据,通过一个奖励信号评价当前订单分派方式的好坏,再使用奖励信号和时序误差训练评价者,从而训练出新的订单分派策略。In the present invention, in step S4, the order dispatching task is executed through the order dispatching strategy trained from step S3, when human evaluation data is generated, the human evaluation data is collected, and the action of the order dispatching task is evaluated by the reward signal, so as to train the A new order dispatch strategy; if new human evaluation data is generated, repeat the training to imitate human evaluation; further, execute the order dispatch task through the order dispatch strategy, collect human evaluation data, and evaluate the current order dispatch through a reward signal The method is good or bad, and then use the reward signal and timing error to train the evaluator, so as to train a new order dispatching strategy.
具体地讲,从人类评价中学习人类充当监督者,提供实时评估(或评论),以交互方式塑造自主学习过程的行为;它利用人类领域知识和意图,通过评价反馈形式的稀疏交互来塑造智能体的动作;从人类评价中学习人类只需要理解任务目标,不需要了解任务的具体执行方式,最小化人类评价者的任务。如图5所示,首先,学习过程按照智能体通过与环境交互自主学习到的订单分派或经过从人类干扰中学习训练出的订单分派策略πrl执行订单分派任务;当产生人类评价数据时,人类扮演监督者的角色,通过一个奖励信号评估系统的动作,收集人类的评价数据,与从人类干预中学习阶段相似,通过一个奖励信号评价当前订单分派方式的好坏,奖励信号和与之相关的时序误差用于更新评价者的值函数和策略。若产生了新的人类评价数据,则重复上述步骤;否则输出训练出的订单分派策略πrl。Specifically, learning from human evaluations Humans act as supervisors, providing real-time evaluations (or reviews) that interactively shape the behavior of autonomous learning processes; it leverages human domain knowledge and intent to shape intelligence through sparse interactions in the form of evaluation feedback The human evaluator only needs to understand the task goal, not the specific way of performing the task, and minimize the task of the human evaluator. As shown in Figure 5, first, the learning process executes the order dispatching task according to the order dispatching that the agent learns autonomously through interaction with the environment or the order dispatching strategy πrl that is learned and trained from human interference; when human evaluation data is generated, the human Play the role of a supervisor, evaluate the actions of the system through a reward signal, and collect human evaluation data, similar to the learning phase from human intervention, through a reward signal to evaluate how well the current order is dispatched, the reward signal and its associated Timing errors are used to update the evaluator's value function and policy. If new human evaluation data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy πrl.
本发明中,为了集成三种不同的人机交互模式,采用actor-critic训练法(集中训练-分散执行)进行训练,最初只训练actor,然后添加critic,在从人类演示中学习过程训练actor,在从人类干扰中学习过程训练actor和critic。然后在从人类评价中学习过程训练出的critic承担监管者的角色,最后,actor和critic被结合在一个标准的由学习奖励模型驱动的actor-critic强化学习架构上。In the present invention, in order to integrate three different human-computer interaction modes, the actor-critic training method (centralized training-decentralized execution) is used for training. At first, only the actor is trained, and then critic is added, and the actor is trained during the learning process from the human demonstration. Train actors and critics in the process of learning from human perturbations. The critics trained during the learning process from human evaluations then assume the role of supervisors, and finally, the actors and critics are combined in a standard actor-critic reinforcement learning architecture driven by a learned reward model.
在本发明内,一种基于交互式强化学习的订单分派方法,可以称为交互式学习法Cycle-of-HRL(HRL:交互式强化学习),实现高效的订单分派;这种方法基于交互式强化学习,将从人类演示中学习、从人类干扰中学习和从人类评价中学习的多种人机交互模式融合,制定交互模式的集成方式。In the present invention, an order dispatching method based on interactive reinforcement learning can be called an interactive learning method Cycle-of-HRL (HRL: interactive reinforcement learning) to achieve efficient order dispatching; this method is based on interactive reinforcement learning. Reinforcement learning, which fuses multiple human-computer interaction modes of learning from human demonstrations, learning from human interference, and learning from human evaluations, formulates an integrated approach to interaction modes.
在本发明内,还可以进行模仿人类演示、干扰和评价的三种模式的切换,使用性能指标、数据模态限制和优势函数来定义三种人机交互模式的转换标准。性能指标:一旦订单分派策略达到某个级别,可以使用预定义的性能度量来指示何时切换交互模式,或者,当观察到系统的性能没有提高时,与系统交互的人可以在不同的交互方式之间手动切换。数据模态限制:根据任务的不同,数据模态的数量可能是有限的可以由人类提供的示范、干预或评估,在这种情况下,交互式学习法Cycle-of-HRL根据数据可用性在不同模式之间切换。优势函数:训练奖励函数后,优势函数(状态-动作价值函数Q(s)和状态值函数V(s)之间的差异,比较给定状态-动作对的期望返回值与该状态下的期望返回值),可以计算并用于人类和系统动作之间的期望返回值比较。有了这些信息,当系统优势功能超过人的优势功能时,Cycle-of-HRL可以切换交互方式。In the present invention, three modes of imitating human demonstration, interference and evaluation can also be switched, and the conversion criteria of the three human-computer interaction modes can be defined using performance index, data mode restriction and advantage function. Performance metrics: Once the order dispatch strategy reaches a certain level, predefined performance metrics can be used to indicate when to switch the interaction mode, or, when no improvement in the performance of the system is observed, people interacting with the system can interact in different ways switch manually. Data modality limitations: Depending on the task, the number of data modalities that can be demonstrated, intervened, or evaluated by humans may be limited, in which case the interactive learning method Cycle-of-HRL varies in switch between modes. Dominance function: After training the reward function, the difference between the dominance function (the state-action value function Q(s) and the state-value function V(s), comparing the expected return value for a given state-action pair with the expectation in that state return value), which can be computed and used to compare expected return values between human and system actions. Armed with this information, Cycle-of-HRL can switch the way of interaction when the system's dominance function exceeds the human's dominance function.
本发明还提供了一种基于交互式强化学习的订单分派系统,包括:The present invention also provides an order dispatching system based on interactive reinforcement learning, including:
建模模块,用于对订单分派任务进行训练建模;The modeling module is used to train and model the order dispatching task;
模仿人类演示训练模块,用于通过自主学习模仿人类演示的订单分派策略行为进行训练;The training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;
模仿人类干扰训练模块,用于通过自主学习模仿人类干预的行为的训练;An imitating human interference training module, which is used for training to imitate the behavior of human intervention through autonomous learning;
模仿人类评价训练模块,用于通过自主学习模仿人类的评价反馈的行为的训练;The imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning;
强化训练模块,用于纯强化学习训练,从而得到最优的订单分派策略。The reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.
在本发明内,引入人机交互,融合人类演示、干扰、评价的多种人机交互模式,减少搜索空间,加快学习过程,提高准确率。从人类演示中学习,通过人类演示的真实数据,能够更好地模拟真实的订单分派场景;从人类干扰中学习,当自主学习过程中出现错误动作时,控制智能体的性能,避免出现错误结果;从人类评估中学习,通过人工评价自主学习结果,使学习过程向更好的订单分派策略方向偏移,加快学习过程。同时本发明制定三种交互模式的集成方式,并设计三种交互模式切换的标准。In the present invention, human-computer interaction is introduced, and various human-computer interaction modes of human demonstration, interference, and evaluation are integrated, so as to reduce the search space, speed up the learning process, and improve the accuracy. Learning from human demonstrations can better simulate real order dispatching scenarios through real data demonstrated by human beings; learning from human interference, when wrong actions occur during autonomous learning, control the performance of the agent to avoid erroneous results ;Learn from human evaluation and manually evaluate the self-learning results to shift the learning process to a better order dispatching strategy and speed up the learning process. At the same time, the present invention formulates the integration mode of the three interaction modes, and designs the standard for switching the three interaction modes.
以上所述仅为本发明的实施方式,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.

Claims (10)

  1. 一种基于交互式强化学习的订单分派方法,其特征在于,包括如下步骤:An order dispatching method based on interactive reinforcement learning, characterized in that it comprises the following steps:
    步骤S1、对订单分派任务建模进行模仿训练;Step S1, imitating training for order dispatch task modeling;
    步骤S2、在状态和动作的序列方面提供模仿人类行为的订单分派的演示实例,通过自主学习模仿人类演示的订单分派策略行为;Step S2, providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;
    步骤S3、在进入灾难性状态或人类不满意的错误状态中,通过自主学习模仿人类干预的行为;Step S3, imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;
    步骤S4、通过自主学习模仿人类的评价反馈的行为;Step S4, imitating the behavior of human evaluation feedback through autonomous learning;
    步骤S5、进入纯强化学习阶段进行训练,从而得到最优的订单分派策略。Step S5, enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.
  2. 根据权利要求1所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S2内,当产生了人类演示数据时,从人类的演示中收集订单分派的演示记录,使用高斯回归训练该人类演示数据,从而训练出订单分派策略。The method for order dispatching based on interactive reinforcement learning according to claim 1, wherein in step S2, when human demonstration data is generated, a demonstration record of order dispatching is collected from the human demonstration, and a Gaussian demonstration record is used. Regress to train the human demo data to train an order dispatch policy.
  3. 根据权利要求2所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S2内,如果产生了新的人类演示数据,则重复进行模仿人类演示数据的训练。The method for dispatching orders based on interactive reinforcement learning according to claim 2, wherein in step S2, if new human demonstration data is generated, the training of imitating human demonstration data is repeated.
  4. 根据权利要求3所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S3内,经过从步骤S2训练出的订单分派策略执行订单分派任务,当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该动作数据改变执行订单分派的动作,从而训练出新的订单分派策略。An order dispatching method based on interactive reinforcement learning according to claim 3, characterized in that, in step S3, the order dispatching task is executed through the order dispatching strategy trained from step S2, and when human interference data is generated, Collect the action data of human interference in order dispatching, and change the action of executing order dispatching according to the action data, so as to train a new order dispatching strategy.
  5. 根据权利要求4所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S3内,如果产生了新的人类干扰数据,则重复进行模仿人类干扰数据的训练。An order dispatch method based on interactive reinforcement learning according to claim 4, characterized in that, in step S3, if new human interference data is generated, the training of imitating human interference data is repeated.
  6. 根据权利要求5所述的一种基于交互式强化学习的订单分派方法,其特征在于,步骤S4内,经过从步骤S3训练出的订单分派策略执行订单分派任务,当产生人类评价数据时,收集人类的评价数据,通过奖励信号评价订单分派任务的动作,从而训练出新的订单分派策略。An order dispatching method based on interactive reinforcement learning according to claim 5, characterized in that, in step S4, the order dispatching task is executed through the order dispatching strategy trained from step S3, and when human evaluation data is generated, the The human evaluation data evaluates the action of the order dispatching task through the reward signal, thereby training a new order dispatching strategy.
  7. 根据权利要求6所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S4内,如果产生了新的人类评价数据,则重复进行模仿人类评价的训练。The method for assigning orders based on interactive reinforcement learning according to claim 6, wherein in step S4, if new human evaluation data is generated, the training of imitating human evaluation is repeated.
  8. 根据权利要求5所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S3内,当产生人类干扰数据时,收集人类干扰订单分派的动作数据,按该动作数据改变执行订单分派的动作,再使用奖励信号和时序误差训练评价者,从而更新订单分派策略,从而训练出新的订单分派策略。The method for dispatching orders based on interactive reinforcement learning according to claim 5, wherein in step S3, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution is changed according to the action data. The action of order dispatching, and then use the reward signal and timing error to train the evaluator to update the order dispatching strategy, thereby training a new order dispatching strategy.
  9. 根据权利要求7所述的一种基于交互式强化学习的订单分派方法,其特征在于,在步骤S4内,经过订单分派策略执行订单分派任务,收集人类的评价数据,通过一个奖励信号评价当前订单分派方式的好坏,再使用奖励信号和时序误差训练评价者,从而训练出新的订单分派策略。An order dispatching method based on interactive reinforcement learning according to claim 7, characterized in that, in step S4, the order dispatching task is performed through an order dispatching strategy, human evaluation data is collected, and a reward signal is used to evaluate the current order According to the quality of the distribution method, the evaluator is trained using the reward signal and timing error, so as to train a new order distribution strategy.
  10. 一种基于交互式强化学习的订单分派系统,其特征在于,包括:An order dispatching system based on interactive reinforcement learning, characterized by comprising:
    建模模块,用于对订单分派任务进行训练建模;The modeling module is used to train and model the order dispatching task;
    模仿人类演示训练模块,用于通过自主学习模仿人类演示的订单分派策略行为进行训练;The training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;
    模仿人类干扰训练模块,用于通过自主学习模仿人类干预的行为的训练;An imitating human interference training module, which is used for training to imitate the behavior of human intervention through autonomous learning;
    模仿人类评价训练模块,用于通过自主学习模仿人类的评价反馈的行为的训练;The imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning;
    强化训练模块,用于纯强化学习训练,从而得到最优的订单分派策略。The reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.
PCT/CN2020/139231 2020-12-10 2020-12-25 Method and system for order dispatch based on interactive reinforcement learning WO2022120970A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011432387.X 2020-12-10
CN202011432387.XA CN112396501B (en) 2020-12-10 2020-12-10 Order dispatching method and system based on interactive reinforcement learning

Publications (1)

Publication Number Publication Date
WO2022120970A1 true WO2022120970A1 (en) 2022-06-16

Family

ID=74624981

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139231 WO2022120970A1 (en) 2020-12-10 2020-12-25 Method and system for order dispatch based on interactive reinforcement learning

Country Status (2)

Country Link
CN (1) CN112396501B (en)
WO (1) WO2022120970A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN109858574A (en) * 2018-12-14 2019-06-07 启元世界(北京)信息技术服务有限公司 The autonomous learning method and system of intelligent body towards man-machine coordination work
CN110070188A (en) * 2019-04-30 2019-07-30 山东大学 A kind of increment type cognitive development system and method merging interactive intensified learning
CN110213796A (en) * 2019-05-28 2019-09-06 大连理工大学 A kind of intelligent resource allocation methods in car networking
US20190318267A1 (en) * 2018-04-12 2019-10-17 Baidu Usa Llc System and method for training a machine learning model deployed on a simulation platform
CN111080408A (en) * 2019-12-06 2020-04-28 广东工业大学 Order information processing method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11514543B2 (en) * 2018-06-05 2022-11-29 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for ride order dispatching

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318267A1 (en) * 2018-04-12 2019-10-17 Baidu Usa Llc System and method for training a machine learning model deployed on a simulation platform
CN109858574A (en) * 2018-12-14 2019-06-07 启元世界(北京)信息技术服务有限公司 The autonomous learning method and system of intelligent body towards man-machine coordination work
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN110070188A (en) * 2019-04-30 2019-07-30 山东大学 A kind of increment type cognitive development system and method merging interactive intensified learning
CN110213796A (en) * 2019-05-28 2019-09-06 大连理工大学 A kind of intelligent resource allocation methods in car networking
CN111080408A (en) * 2019-12-06 2020-04-28 广东工业大学 Order information processing method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN112396501A (en) 2021-02-23
CN112396501B (en) 2024-03-19

Similar Documents

Publication Publication Date Title
Yao et al. Path planning method with improved artificial potential field—a reinforcement learning perspective
Martínez et al. Relational reinforcement learning with guided demonstrations
Sutton et al. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning
CN112990485A (en) Knowledge strategy selection method and device based on reinforcement learning
Vilarinho et al. Design of a multiagent system for real-time traffic control
Zhu et al. Learning by reusing previous advice in teacher-student paradigm
Kim et al. Optimizing large-scale fleet management on a road network using multi-agent deep reinforcement learning with graph neural network
Shamshirband A distributed approach for coordination between traffic lights based on game theory.
Gu et al. A human-centered safe robot reinforcement learning framework with interactive behaviors
Chen et al. When shall i be empathetic? the utility of empathetic parameter estimation in multi-agent interactions
WO2022120970A1 (en) Method and system for order dispatch based on interactive reinforcement learning
Nguyen et al. Apprenticeship bootstrapping
Guan et al. Ab-mapper: Attention and bicnet based multi-agent path planning for dynamic environment
Wu et al. Path planning for autonomous mobile robot using transfer learning-based Q-learning
Reichstaller et al. Transferring context-dependent test inputs
Zhan et al. Generative adversarial inverse reinforcement learning with deep deterministic policy gradient
Villagra et al. Behavior planning
Guan et al. AB-Mapper: Attention and BicNet Based Multi-agent Path Finding for Dynamic Crowded Environment
Broz Planning for human-robot interaction: representing time and human intention
Yu et al. Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling
Maheswaran et al. Human-agent collaborative optimization of real-time distributed dynamic multi-agent coordination
Liu et al. Blending Imitation and Reinforcement Learning for Robust Policy Improvement
Tang et al. Towards schema-based, constructivist robot learning: Validating an evolutionary search algorithm for schema chunking
Tang et al. Improved Bayesian inverse reinforcement learning based on demonstration and feedback
Lewis Adaptive representation in a behavior-based robot: An extension of the copycat architecture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20964899

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20964899

Country of ref document: EP

Kind code of ref document: A1