WO2022120970A1

WO2022120970A1 - Method and system for order dispatch based on interactive reinforcement learning

Info

Publication number: WO2022120970A1
Application number: PCT/CN2020/139231
Authority: WO
Inventors: 金铭; 王洋; 须成忠
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-10
Filing date: 2020-12-25
Publication date: 2022-06-16
Also published as: CN112396501A; CN112396501B

Abstract

Provided are a method and system for order dispatch based on interactive reinforcement learning; the method is: introducing human-computer interaction in an autonomous learning process, and integrating a human-computer interaction model which incorporates human demonstration, interference, and evaluation, learning from human demonstration, and by means of real data of the human demonstration, it is possible to better simulate an actual order dispatch scenario; learning from human interference, and if an incorrect action occurs during autonomous learning, controlling the performance of an agent to avoid incorrect results; learning from human evaluation, and by means of manual evaluation of autonomous learning results, shifting the learning process toward a better dispatch strategy, accelerating the learning process, and thereby obtaining an optimal order dispatch strategy.

Description

An order dispatch method and system based on interactive reinforcement learning

technical field

The invention relates to the field of Internet information technology, in particular to an order dispatching method and system based on interactive reinforcement learning.

Background technique

Online ride-hailing apps and platforms have become a new and popular way to provide on-demand transportation services through mobile apps. At present, some ride-hailing mobile applications such as Didi, Uber, Lyft are popular all over the world, and the system serves a large number of passengers every day and generates a large number of ride-hailing orders, such as China's largest online car-hailing service provider Didi needs to process about 11 million orders every day. The problem of order allocation for online taxi services is essentially a reasonable match between potential passengers and drivers. In this scenario, after online users arrive, they need to be assigned an optimal service provider. By. In many cases, the service is reusable, the service provider will disappear for a period of time after the user is matched, and the user will rejoin the system after using the service. Here, the offline service provider is a different driver, and when a potential passenger sends a request, the system matches him with a nearby driver, and in most cases, the driver rejoins the system and can be matched again after completing the service.

The existing order dispatching technology is mainly an autonomous learning method based on reinforcement learning, constructing a Markov decision process, setting the agent, environment state, and agent action, and constructing a state transition function and a reward function according to the environment state and agent action. , based on which the optimal decision is trained to maximize the overall benefit of order dispatching.

Existing technologies are basically based on the interaction between traditional reinforcement learning and the environment to carry out completely autonomous training and learning processes. However, this completely autonomous learning method lacks human participation, and the learning process takes a lot of time; the learning process cannot be controlled. The behavior of the agent may lead to wrong results; the learning results are difficult to simulate complex real scenes.

SUMMARY OF THE INVENTION

The main technical problem solved by the present invention is to provide an order dispatching method based on interactive reinforcement learning, which introduces human-computer interaction in the process of autonomous learning, integrates human-computer interaction modes of human demonstration, interference and evaluation, and learns from human demonstrations. Real data demonstrated by humans can better simulate real order dispatching scenarios; learn from human interference, control the performance of the agent when wrong actions occur in the autonomous learning process, and avoid erroneous results; learn from human evaluations , through manual evaluation of the self-learning results, the learning process is shifted to a better order dispatching strategy, and the learning process is accelerated, so as to obtain the optimal order dispatching strategy; an order dispatching system based on interactive reinforcement learning is also provided.

In order to solve the above-mentioned technical problems, a technical solution adopted by the present invention is to provide an order dispatching method based on interactive reinforcement learning, which includes the following steps:

Step S1, imitating training for order dispatch task modeling;

Step S2, providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;

Step S3, imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;

Step S4, imitating the behavior of human evaluation feedback through autonomous learning;

Step S5, enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.

As an improvement of the present invention, in step S2, when human demonstration data is generated, a demonstration record of order dispatching is collected from the human demonstration, and Gaussian regression is used to train the human demonstration data, thereby training an order dispatching strategy.

As a further improvement of the present invention, in step S2, if new human demonstration data is generated, the training of imitating the human demonstration data is repeated.

As a further improvement of the present invention, in step S3, the order dispatching task is executed through the order dispatching strategy trained from step S2, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution is changed according to the action data. The action of order dispatching, thereby training a new order dispatching strategy.

As a further improvement of the present invention, in step S3, if new human interference data is generated, the training of imitating the human interference data is repeated.

As a further improvement of the present invention, in step S4, the order dispatching task is executed through the order dispatching strategy trained from step S3, and when human evaluation data is generated, the human evaluation data is collected, and the action of order dispatching task is evaluated by the reward signal, Thereby, a new order dispatching strategy is trained.

As a further improvement of the present invention, in step S4, if new human evaluation data is generated, the training of imitating human evaluation is repeated.

As a further improvement of the present invention, in step S3, when the human interference data is generated, the action data of the human interference order dispatching is collected, the action of executing the order dispatching is changed according to the action data, and the evaluator is trained by using the reward signal and the timing error. , so as to update the order dispatch strategy and train a new order dispatch strategy.

As a further improvement of the present invention, in step S4, the order dispatching task is executed through the order dispatching strategy, human evaluation data is collected, the current order dispatching method is evaluated by a reward signal, and then the reward signal and the timing error are used for training evaluation. , so as to train a new order dispatching strategy.

An order dispatching system based on interactive reinforcement learning, including:

The modeling module is used to train and model the order dispatching task;

The training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;

An imitating human interference training module, which is used for training to imitate the behavior of human intervention through autonomous learning;

The imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning;

The reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.

The beneficial effects of the present invention are: compared with the prior art, the present invention introduces human-computer interaction in the process of autonomous learning, integrates the human-computer interaction mode of human demonstration, interference and evaluation, learns from human demonstration, and passes the real demonstration of human beings. data, can better simulate the real order dispatching scene; learn from human interference, control the performance of the agent when wrong actions occur in the process of autonomous learning, avoid wrong results; learn from human evaluation, through manual evaluation of autonomous The learning result makes the learning process deviate in the direction of a better order dispatching strategy, accelerates the learning process, and obtains the optimal order dispatching strategy.

Fig. 1 is the step block diagram of the order dispatch method based on interactive reinforcement learning of the present invention;

Fig. 2 is the flow chart of the order dispatch method based on interactive reinforcement learning of the present invention;

Fig. 3 is the learning execution flow diagram of imitating human demonstration training of the present invention;

Fig. 4 is the learning execution flow diagram of imitating human interference training of the present invention;

FIG. 5 is a block diagram showing the execution flow of learning by imitating human evaluation training according to the present invention.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

As shown in Figure 1, the present invention provides an order dispatch method based on interactive reinforcement learning, comprising the following steps:

Step S1, imitating training for order dispatch task modeling;

In the present invention, human-computer interaction is introduced in the process of autonomous learning, the human-computer interaction mode of human demonstration, interference and evaluation is integrated, learning from human demonstration, and through the real data of human demonstration, the real order dispatch can be better simulated Scenarios; learn from human interference, control the performance of the agent when wrong actions occur in the process of autonomous learning, and avoid erroneous results; learn from human evaluation, through manual evaluation of autonomous learning results, so that the learning process can be better ordered The direction of the dispatch strategy is shifted to speed up the learning process, so as to obtain the optimal order dispatch strategy.

As shown in Figure 2, for the order dispatching task, the optimal order dispatching strategy is first trained using the reinforcement learning method, and interactive reinforcement learning is iteratively performed for the human-computer interaction modes of human demonstration, interference, and evaluation that appear during the training process. Training process: learn from human demonstration, learn from human interference, learn from human evaluation, and finally enter the pure reinforcement learning stage to continue training.

In step S1, use the Markov decision process to model the order scheduling task, and build a Markov model G with N agents, G = (N, S, a, P, R, γ), where N , S, a, P, R, γ are the number of agents, state set, joint action space, transition probability function, reward function and discount factor, respectively, which are defined as follows:

1. Agent: An idle vehicle is regarded as an agent, and vehicles in the same space-time node are isomorphic, that is, vehicles located in the same area in the same time interval are regarded as the same agent (with the same policy). ).

2. State: At any moment, considering the spatial distribution of idle vehicles and orders (that is, the number of available vehicles and orders in each grid) and the current time, all agents have the same global state. The state of agent _i (the ith agent) is defined by the identity and shared global state of the grid it is in.

3. Action: The joint action indicates the allocation strategy of all available vehicles at the same time; the action space of a single agent specifies the position that the agent can reach next time, which is a set represented by 7 discrete actions, the first six discrete actions respectively means assigning the agent to one of its six adjacent grids, and the last discrete action means staying in the current grid.

4. Reward function: All agents in the same grid have the same reward function, and agent _i tries to maximize its discounted reward. The individual reward associated with agent _i 's action is defined as the one that reaches the same grid as agent _i at the same time. Average reward for all agents. Individual rewards at the same time and place are the same.

5. State transition probability: it gives the probability of transitioning to the next state when a joint action is taken in the current state, although the action is deterministic, each time a new vehicle and order will appear in a different grid, and Existing vehicles will go offline through a random process.

In the present invention, the reinforcement learning algorithm can be trained using the actor-critic training method (centralized training-decentralized execution), and all agents share a central judge (critic) to evaluate the order dispatching action (action) and update the order dispatching policy (policy). ), agents independently follow their learned policies during execution, without the need for a centralized critic.

In the present invention, in step S2, when the human demonstration data is generated, the demonstration record of order dispatch is collected from the human demonstration, and the human demonstration data is trained by using Gaussian regression, so as to train the order dispatch strategy; the human demonstration data, then repeat the training to imitate the human demonstration data.

Specifically, learning from human demonstrations trains an autonomous learning process to imitate human behavior in an order dispatch task, where a human acts as a demonstrator, providing demonstration instances of order dispatch in terms of sequences of states and actions, and by using these demonstrations, autonomous learning The process mimics the order-dispatching policy (mapping from states to actions) demonstrated by humans, and learning from demonstrations can provide a more direct path to these expected order-dispatching behaviors, quickly converging to more stable order-dispatching behaviors.

As shown in Figure 3, first, the order dispatching policy πrl is trained using reinforcement learning method; when human demonstration data is generated, the demonstration records of order dispatch are collected from human demonstrations, these demonstration data (observations received and actions taken) ) constitute the original human dataset; use Gaussian regression to train the demo data, get an order dispatch policy πm trained from the human demo data, and use an ∈-greedy policy (∈-greedy is based on a probability that trades off exploration and exploitation : explore with the probability of ∈ in each attempt, and use the learned strategy with the probability of 1-∈), generate a random number rand, if rand<=∈, perform random exploration, that is, randomly assign orders to idlers Driver; if rand>∈, normalize CP(s) and CQ(s) and use the hyperbolic tangent function to scale them, and compare the scaled confidence, if the confidence of πm > the confidence of πrl , the action of dispatching orders is executed according to the πm strategy, otherwise the strategy of dispatching orders is executed according to the πrl strategy (CP(s): the confidence of the learned strategy, that is, the confidence of the order dispatch strategy πm trained from the human demonstration data degree; CQ(s): the confidence degree of following the current self-learning strategy, that is, the confidence degree of the order dispatch strategy πrl that the agent currently learns autonomously by interacting with the environment; after executing the order dispatch action, it is updated according to the environment state and the feedback reward Order dispatch strategy πrl, if new human demonstration data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy πrl.

In the present invention, in step S3, the order dispatching task is executed through the order dispatching strategy trained from step S2, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution order dispatching method is changed according to the action data. action to train a new order dispatching strategy; if new human interference data is generated, the training to imitate the human interference data is repeated. Further, when human interference data is generated, collect the action data of human interference order dispatching, change the action of executing the order dispatching according to the action data, and then use the reward signal and timing error to train the evaluator, so as to update the order dispatching strategy. New order dispatch strategy.

Specifically, when the autonomous learning process is about to enter a catastrophic state or a human-unsatisfactory error state, humans play the role of supervisor and interfere, which can prevent or mitigate catastrophic behaviors and avoid human-unsatisfactory error behaviors. As shown in Figure 4, first, the learning process executes the order dispatching task according to the order dispatching that the agent learns autonomously by interacting with the environment or the order dispatching strategy πrl learned and trained from human demonstrations; when human interference data is generated, collect The action data of human interference order dispatch, according to which the action of executing order dispatch is changed; the intervention reward is calculated according to the degree of human interference, and the reward signal and the timing error associated with it are used to train a value function (evaluator, critic) And evaluate the actions taken by the actor; use the actor-critic policy gradient method to update the order dispatch strategy πrl, if new human interference data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy πrl.

In the present invention, in step S4, the order dispatching task is executed through the order dispatching strategy trained from step S3, when human evaluation data is generated, the human evaluation data is collected, and the action of the order dispatching task is evaluated by the reward signal, so as to train the A new order dispatch strategy; if new human evaluation data is generated, repeat the training to imitate human evaluation; further, execute the order dispatch task through the order dispatch strategy, collect human evaluation data, and evaluate the current order dispatch through a reward signal The method is good or bad, and then use the reward signal and timing error to train the evaluator, so as to train a new order dispatching strategy.

Specifically, learning from human evaluations Humans act as supervisors, providing real-time evaluations (or reviews) that interactively shape the behavior of autonomous learning processes; it leverages human domain knowledge and intent to shape intelligence through sparse interactions in the form of evaluation feedback The human evaluator only needs to understand the task goal, not the specific way of performing the task, and minimize the task of the human evaluator. As shown in Figure 5, first, the learning process executes the order dispatching task according to the order dispatching that the agent learns autonomously through interaction with the environment or the order dispatching strategy πrl that is learned and trained from human interference; when human evaluation data is generated, the human Play the role of a supervisor, evaluate the actions of the system through a reward signal, and collect human evaluation data, similar to the learning phase from human intervention, through a reward signal to evaluate how well the current order is dispatched, the reward signal and its associated Timing errors are used to update the evaluator's value function and policy. If new human evaluation data is generated, repeat the above steps; otherwise, output the trained order dispatch strategy πrl.

In the present invention, in order to integrate three different human-computer interaction modes, the actor-critic training method (centralized training-decentralized execution) is used for training. At first, only the actor is trained, and then critic is added, and the actor is trained during the learning process from the human demonstration. Train actors and critics in the process of learning from human perturbations. The critics trained during the learning process from human evaluations then assume the role of supervisors, and finally, the actors and critics are combined in a standard actor-critic reinforcement learning architecture driven by a learned reward model.

In the present invention, an order dispatching method based on interactive reinforcement learning can be called an interactive learning method Cycle-of-HRL (HRL: interactive reinforcement learning) to achieve efficient order dispatching; this method is based on interactive reinforcement learning. Reinforcement learning, which fuses multiple human-computer interaction modes of learning from human demonstrations, learning from human interference, and learning from human evaluations, formulates an integrated approach to interaction modes.

In the present invention, three modes of imitating human demonstration, interference and evaluation can also be switched, and the conversion criteria of the three human-computer interaction modes can be defined using performance index, data mode restriction and advantage function. Performance metrics: Once the order dispatch strategy reaches a certain level, predefined performance metrics can be used to indicate when to switch the interaction mode, or, when no improvement in the performance of the system is observed, people interacting with the system can interact in different ways switch manually. Data modality limitations: Depending on the task, the number of data modalities that can be demonstrated, intervened, or evaluated by humans may be limited, in which case the interactive learning method Cycle-of-HRL varies in switch between modes. Dominance function: After training the reward function, the difference between the dominance function (the state-action value function Q(s) and the state-value function V(s), comparing the expected return value for a given state-action pair with the expectation in that state return value), which can be computed and used to compare expected return values between human and system actions. Armed with this information, Cycle-of-HRL can switch the way of interaction when the system's dominance function exceeds the human's dominance function.

The present invention also provides an order dispatching system based on interactive reinforcement learning, including:

The modeling module is used to train and model the order dispatching task;

In the present invention, human-computer interaction is introduced, and various human-computer interaction modes of human demonstration, interference, and evaluation are integrated, so as to reduce the search space, speed up the learning process, and improve the accuracy. Learning from human demonstrations can better simulate real order dispatching scenarios through real data demonstrated by human beings; learning from human interference, when wrong actions occur during autonomous learning, control the performance of the agent to avoid erroneous results ;Learn from human evaluation and manually evaluate the self-learning results to shift the learning process to a better order dispatching strategy and speed up the learning process. At the same time, the present invention formulates the integration mode of the three interaction modes, and designs the standard for switching the three interaction modes.

The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied to other related technologies Fields are similarly included in the scope of patent protection of the present invention.

Claims

An order dispatching method based on interactive reinforcement learning, characterized in that it comprises the following steps:

Step S1, imitating training for order dispatch task modeling;

Step S2, providing a demonstration instance of order dispatch imitating human behavior in terms of the sequence of states and actions, and imitating the order dispatch strategy behavior demonstrated by human through autonomous learning;

Step S3, imitating the behavior of human intervention through autonomous learning when entering a catastrophic state or an error state that human beings are not satisfied with;

Step S4, imitating the behavior of human evaluation feedback through autonomous learning;

Step S5, enter the pure reinforcement learning stage for training, so as to obtain the optimal order dispatching strategy.
The method for order dispatching based on interactive reinforcement learning according to claim 1, wherein in step S2, when human demonstration data is generated, a demonstration record of order dispatching is collected from the human demonstration, and a Gaussian demonstration record is used. Regress to train the human demo data to train an order dispatch policy.
The method for dispatching orders based on interactive reinforcement learning according to claim 2, wherein in step S2, if new human demonstration data is generated, the training of imitating human demonstration data is repeated.
An order dispatching method based on interactive reinforcement learning according to claim 3, characterized in that, in step S3, the order dispatching task is executed through the order dispatching strategy trained from step S2, and when human interference data is generated, Collect the action data of human interference in order dispatching, and change the action of executing order dispatching according to the action data, so as to train a new order dispatching strategy.
An order dispatch method based on interactive reinforcement learning according to claim 4, characterized in that, in step S3, if new human interference data is generated, the training of imitating human interference data is repeated.
An order dispatching method based on interactive reinforcement learning according to claim 5, characterized in that, in step S4, the order dispatching task is executed through the order dispatching strategy trained from step S3, and when human evaluation data is generated, the The human evaluation data evaluates the action of the order dispatching task through the reward signal, thereby training a new order dispatching strategy.
The method for assigning orders based on interactive reinforcement learning according to claim 6, wherein in step S4, if new human evaluation data is generated, the training of imitating human evaluation is repeated.
The method for dispatching orders based on interactive reinforcement learning according to claim 5, wherein in step S3, when human interference data is generated, the action data of human interference order dispatching is collected, and the execution is changed according to the action data. The action of order dispatching, and then use the reward signal and timing error to train the evaluator to update the order dispatching strategy, thereby training a new order dispatching strategy.
An order dispatching method based on interactive reinforcement learning according to claim 7, characterized in that, in step S4, the order dispatching task is performed through an order dispatching strategy, human evaluation data is collected, and a reward signal is used to evaluate the current order According to the quality of the distribution method, the evaluator is trained using the reward signal and timing error, so as to train a new order distribution strategy.
An order dispatching system based on interactive reinforcement learning, characterized by comprising:

The modeling module is used to train and model the order dispatching task;

The training module of imitating human demonstration is used for training through autonomous learning to imitate the order dispatching strategy behavior of human demonstration;

An imitating human interference training module, which is used for training to imitate the behavior of human intervention through autonomous learning;

The imitating human evaluation training module is used to imitate the behavior of human evaluation feedback through autonomous learning;

The reinforcement training module is used for pure reinforcement learning training to obtain the optimal order dispatching strategy.