WO2020244081A1 - Constrained spatiotemporal contextual bandits for real-time ride-hailing recommendation - Google Patents

Constrained spatiotemporal contextual bandits for real-time ride-hailing recommendation Download PDF

Info

Publication number
WO2020244081A1
WO2020244081A1 PCT/CN2019/104790 CN2019104790W WO2020244081A1 WO 2020244081 A1 WO2020244081 A1 WO 2020244081A1 CN 2019104790 W CN2019104790 W CN 2019104790W WO 2020244081 A1 WO2020244081 A1 WO 2020244081A1
Authority
WO
WIPO (PCT)
Prior art keywords
hailing
time
data
budget
discount
Prior art date
Application number
PCT/CN2019/104790
Other languages
French (fr)
Inventor
Qingyang Li
Mengyue YANG
Zhiwei QIN
Jieping Ye
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Publication of WO2020244081A1 publication Critical patent/WO2020244081A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/02Reservations, e.g. for tickets, services or events
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/34Route searching; Route guidance
    • G01C21/3407Route searching; Route guidance specially adapted for specific applications
    • G01C21/3438Rendez-vous, i.e. searching a destination where several users can meet, and the routes to this destination for these users; Ride sharing, i.e. searching a route such that at least two users can share a vehicle for at least part of the route
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0207Discounts or incentives, e.g. coupons or rebates
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/20Monitoring the location of vehicles belonging to a group, e.g. fleet of vehicles, countable or determined number of vehicles
    • G08G1/202Dispatching vehicles on the basis of a location, e.g. taxi dispatching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • H04W4/024Guidance services

Definitions

  • the present disclosure generally relates to ride-hailing recommendation, and more specifically, to methods and systems for ride-hailing recommendation based on constrained spatiotemporal contextual bandits.
  • a vehicle dispatch platform can automatically receive ride-hailing requests from user devices (passenger side) , provide price quotes, and upon user acceptances, allocate the ride-hailing requests to devices of vehicle drivers (driver side) for providing respective transportation services.
  • ride-hailing requests For the platform, it has been challenging to optimize the distribution of limited promotional resources in order to maximize the number of accepted ride hailing orders.
  • the distribution often involves real-time activity recommendations to the passenger side. For example, when a passenger logs in the vehicle dispatch platform from the passenger side to check out a ride hailing price (also referred to as passenger bubbling) , the platform may send a discount coupon to the passenger (e.g., directly applied to the price) to encourage ordering.
  • the difficulties of the real-time activity recommendation come from many aspects.
  • the time and location of each instance of passenger bubbling e.g., when a user fills in the destination inquiry and chooses a service mode associated with a corresponding pricing tier
  • the number of drivers in different geographical locations are different. If a passenger bubbles at a location with sparse drivers, even if a discount coupon is issued and applied, the order may not be completed.
  • budget planning in the space-time dimension can be difficult for the platform.
  • the number of times that the platform issues discount coupons or the coupon budget each day is limited, and the revenue generated by passenger bubbling in different time and space dimensions may be different.
  • issuing coupons may generate a higher revenue than during day time.
  • the current daily coupon budget may have been exhausted by afternoon.
  • Various embodiments of the present disclosure include systems, methods, and non-transitory computer readable media for ride-hailing recommendation.
  • a recommendation method may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model.
  • the method may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request.
  • the method may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.
  • a promotion budget may include a remaining budget amount with respect to the time within a period and with respect to the location.
  • a model may include a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period.
  • the discount allocation may be subject to a fixed budget ceiling for the period.
  • the trained model may be a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm.
  • LinUCB-ALP LinUCB-Adaptive-Linear-Programming
  • training a model with historical ride hailing data to obtain the trained model may include pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model.
  • the pre-trained model may be trained with another portion of the historical ride hailing data to obtain the trained model.
  • the pre-trained model may be a Lin-upper-confidence-bound (LinUCB) algorithm.
  • LinUCB Lin-upper-confidence-bound
  • a model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. Training the model with historical ride hailing data to obtain a trained model may include learning the infinite contextual space through a Gaussian Mixture Model.
  • historical ride hailing data may include one or more dimensions.
  • the dimensions may include a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension.
  • obtaining the time, the location, and the promotion budget associated with the request may include obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request.
  • Inputting the obtained time, location, and promotion budget to the trained model to determine the price discount may include inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount.
  • the background user information dimension may include one or more attributes including gender, registration date, registration location, application login history, and ride-hailing order history.
  • the real-time user information dimension may include one or more attributes including number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order.
  • the weather dimension may include one or more attributes including humidity, precipitation, wind, UV metric, air pollution metric, and weather condition.
  • historical ride hailing data may correspond to a geographical area mapped into a plurality of grids.
  • the spatial dimension may include one or more of attributes.
  • the attributes may include a grid index of the one grid, a number of vehicles in the one grid, a number of ride-hailing orders accepted in the one grid, a number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid.
  • the temporal dimension may include one or more attributes including month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour.
  • the discount distribution dimension may include one or more attributes including whether discount was offered, discount type offered, offered price discount and whether discount was used.
  • the price discount may comprise: no discount or a nonzero discount.
  • the trained model may define a plurality of arms, and training the model may include evaluating a spatiotemporal distribution based on the historical ride hailing data. Training the model may further include defining a linear payoff function, and determining a reward expectation of each arm of the plurality of arms. Training the model may further include selecting, among the plurality of arms, an arm with the highest reward expectation. The trained model may be trained based on the spatiotemporal distribution, the linear payoff function, and the selected arm.
  • a recommendation system comprises one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of the preceding embodiments.
  • a non-transitory computer-readable storage medium is configured with instructions executable by one or more processors to cause the one or more processors to perform the method of any of the preceding embodiments.
  • a recommendation apparatus comprises a plurality of modules for performing the method of any of the preceding embodiments.
  • a recommendation system may include one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform operations.
  • the operations may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model.
  • the operations may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request.
  • the operations may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.
  • a non-transitory computer-readable storage medium may be configured with instructions executable by one or more processors to cause the one or more processors to perform operations.
  • the operations may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model.
  • the operations may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request.
  • the operations may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.
  • FIG. 1 illustrates an exemplary system for ride-hailing recommendation, in accordance with various embodiments.
  • FIG. 2 illustrates an exemplary algorithm for allocating a budget with an empirical spatial-temporal distribution, in accordance with various embodiments.
  • FIG. 3 illustrates a flowchart of an exemplary method for ride-hailing recommendation, in accordance with various embodiments.
  • FIG. 4 illustrates a block diagram of an exemplary computing device in which various of the embodiments described herein may be implemented.
  • FIG. 5 illustrates a block diagram of an exemplary computer system in which any of the embodiments described herein may be implemented.
  • the disclosed systems and computer-implemented methods may determine real-time ride-hailing recommendations subject to spatiotemporal constraints. For example, discount coupon issuance decisions with respect to location and time under a budget constraint may be automatically made for an online ride-hailing platform to maximize the long-term benefits of the platform.
  • Two problems may need to be solved in order to maximize performance.
  • coupons may be issued such that there are available drivers in the geographical area to pick up riders when the coupons are used.
  • a contextual multi-armed bandit algorithm may be used to solve the problem of space-time sequence decision-making and to maximize long-term benefits.
  • limited budgets may be properly allocated in the space-time dimension in order to not run out too early.
  • Constrained Spatiotemporal Contextual Bandits may be used to solve the problem of space-time sequence decision with budget constraint.
  • the disclosed systems and methods may have the technical effect of automatically determining the optimal recommendation decisions for each user in consideration of the location of the user, time of the day, and budget of the platform.
  • the contextual multi-armed bandit algorithm may be used in various recommendation scenarios.
  • the activity recommendation application in a large-scale ride-hailing platform using contextual bandits task with budget and spatial-temporal constrains (referred to as budget constraints bandits) is described.
  • budget constraints bandits the activity recommendation application in a large-scale ride-hailing platform using contextual bandits task with budget and spatial-temporal constrains
  • NP-hard Nondeterministic Polynomial time
  • Existing bandit algorithms of budget constraints attempt to solve the problem by just giving a budget constraints to stop training, which can lead to a quick exhaustion of the budget.
  • a situation that some industrial settings prefer to allocate budget uniformly is contemplated, because it will catch the changes of an online environment.
  • a target may be configured to “not spend all in an early time, ” for which linear programming may be used to balance instantaneous and long-term rewards.
  • Empirical Adaptive-Linear-Programming offers a general recipe for changing budget during an online learning process. It requires an empirical distribution of all finite context sets. In a real setting, obtaining context distribution through empirical estimation is almost an infinite context set and hardly possible because features of every passenger are different in time and geographic location.
  • the empirical estimation in EALP may be replaced with the estimation of spatial-temporal context distribution.
  • the contextual bandit settings of infinite context set may be combined to overcome the reasonable budget allocation problem under spatial-temporal constrains. Because online learning in a real application environment can be very costly and unsafe, an balanced environment simulator may be trained by history logging data to make offline learning feasible.
  • the multi-armed bandit is a sequential decision problem, in which an agent receives a random reward by playing one of K arms at each round and wants to maximize its cumulated reward.
  • the agent learns the inherent trade-off between exploration, identifies and understands the reward from an action, and gathers as much reward as possible from an action.
  • the observed d-dimension features may be combined with the bandit learning (referred to as contextual multi-armed bandit) to get reasonable policies.
  • Contextual bandits add contextual information to the MAB problem.
  • the corresponding algorithm may be referred to as contextual MAB algorithm. Due to the extra information features, reference to context is necessary in many applications. To a large extent, the effect of bandit algorithm will be promoted since it is more common to have relevant contextual information than not.
  • the agent observes a d-dimensional feature vector before making decision. During the learning time, the agent learns the relationship between contexts and rewards (e.g., payoffs) .
  • the decision making process may be extended to considering the cost in real time, which is the budget constraint MAB setting. Since the decision making process is constrained by a budget, it may also be referred to as budget MAB. In budget MABs, playing an arm may generate consumption, and maximizing the cumulative reward under a budget constraint for the total assumption is the target in this setting.
  • FIG. 1 illustrates an exemplary environment 100 for ride-hailing recommendation, in accordance with various embodiments.
  • the example environment 100 may include a computing system 102, a network 120, user devices 140, vehicles 150, a storage device 160, and satellites 170.
  • the computing system 102 may include one or more processors and memory (e.g., permanent memory, temporary memory) .
  • the processor (s) may be configured to perform various operations by interpreting machine-readable instructions stored in the memory.
  • the computing system 102 may have access to other computing resources through network 120.
  • Network 120 may include wireless access points 130-1 and 130-2. Wireless access points 130-1 and 130-2 may allow user devices 140, vehicles 150, and satellites 170 to communicate with network 120.
  • network 120, user devices 140, and vehicles 150 may communicate with satellites 170.
  • Satellites 170 may include satellites 170-1, 170-2, and 170-3.
  • the system 102 may be configured to obtain data (e.g., location, time, and fees for multiple vehicle transportation trips) from the data store 160 (e.g., a database or dataset of historical transportation trips) , the user devices 140, and vehicles 150.
  • data store 160 e.g., a database or dataset of historical transportation trips
  • system 102 may obtain GPS (Global Positioning System) coordinates of vehicles 150.
  • the positioning technology used in the present disclosure may be based on GPS, a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof.
  • GLONASS global navigation satellite system
  • COMPASS compass navigation system
  • Galileo positioning system a Galileo positioning system
  • QZSS quasi-zenith satellite system
  • WiFi wireless fidelity positioning technology
  • the user devices 140 may include mobile device 140-1 (e.g., a smart phone, smart watch) , tablet 140-2, laptop 140-3, and other computing devices 140-4 (e.g., desktop computer, server) .
  • User devices 140 may be used by riders on a ride sharing platform.
  • the vehicles 150 may include vehicles 150-1, 150-2, and 150-3.
  • the vehicles 150 may include cars, bikes, scooters, trucks, boats, trains, or autonomous vehicles.
  • the vehicles 150 may include mobile devices of drivers of the vehicles. For example, communications between the computing system 102 and the vehicles 150 may take place between computing system 102 and mobile devices of the drivers. In another example, locations of vehicles 150 may correspond to the locations of the mobile devices of the drivers.
  • environment 100 may implement an online information or service platform.
  • the service platform may be referred to as a vehicle (service hailing, ride sharing, or ride order dispatching) platform.
  • the platform may accept requests for transportation, identify vehicles to fulfill the requests, arrange for pick-ups, and process transactions.
  • a user may use user device 140-1 (e.g., a mobile phone installed with a software application associated with the platform) to request transportation from the platform.
  • the system 102 may receive the request and reply with price quote data and price discount data for one or more trips.
  • the system 102 may relay trip information to various drivers of vehicles 150, for example, by posting the request to mobile phones carried by the drivers.
  • a vehicle driver may accept the posted transportation request and obtain pick-up location information. Fees such as transportation fees can be transacted among the system 102, the user devices 140, and the vehicles 150.
  • the location of the origin and destination, the price discount information, the fee, and the time can be obtained by the system 102.
  • the computing system 102 may include a historical data component 111, a model training component 112, request receiving component 113, budget component 114, discount determination component 115, and discount transmitting component 116.
  • the computing system 102 may include other components.
  • one or more of the system 102, the user devices 140, and the vehicles 150 may be integrated in a single device or system.
  • the system 102, the user devices 140, and the vehicles 150 may operate as separate devices. While the computing system 102 is shown in FIG. 1 as a single entity, this is merely for ease of reference and is not meant to be limiting.
  • One or more components or one or more functionalities of the computing system 102 described herein may be implemented in a single computing device or multiple computing devices.
  • one or more components or one or more functionalities of the computing system 102 described herein may be implemented in one or more networks (e.g., enterprise networks) , one or more endpoints, one or more servers, or one or more clouds.
  • a server may include hardware or software which manages access to a centralized resource or service in a network.
  • a cloud may include a cluster of servers and other devices which are distributed across a network.
  • the system 102 above may be installed with appropriate software (e.g., platform program, etc. ) and/or hardware (e.g., wires, wireless connections, etc. ) to access other devices of the environment 100.
  • the various components may correspond to various modules, and the computing system 102 may correspond to a recommendation apparatus.
  • Each module may correspond to instructions stored in a non-transitory computer-readable storage medium, and the instructions are executable by one or more processors to cause the one or more processors to perform the steps described with respect to the various components.
  • the historical data component 111 may be configured to obtain historical ride hailing data.
  • historical ride hailing data may be obtained from storage device 160.
  • historical ride hailing data may include user features. The user features may occur independently when spatial-temporal distributions are evaluated. Features may include spatial-temporal dimensions, and the dimensions may indicate the user’s space location.
  • the historical ride hailing data may include one or more of the following dimensions: a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension.
  • the background user information dimension may include one or more attributes including gender, registration date, registration location, application login history, and ride-hailing order history.
  • the real-time user information dimension may include one or more attributes including number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order.
  • the number of recently completed orders may include all of the orders completed within a predetermined period of time (i.e., the past hour, the past day, the past week) .
  • the weather dimension may include one or more attributes including humidity, precipitation, wind, UV metric, air pollution metric, and weather condition.
  • the temporal dimension may include one or more attributes including month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour.
  • the discount distribution dimension may include one or more attributes including whether discount was offered, discount type offered, offered price discount, and whether discount was used.
  • historical ride hailing data may correspond to a geographical area mapped into a plurality of grids.
  • the spatial dimension may include one or more of attributes.
  • the attributes may include a grid index of the one grid, a number of vehicles in the one grid, a number of ride-hailing orders accepted in the one grid, a number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid.
  • the attributes of the spatial dimension may correspond to a period of time.
  • the attributes may be determined at the point in time when the order was placed.
  • the attributes may be determined based on a period of time (e.g., prior minute, prior hour, prior day) associated with the order.
  • the model training component 112 may be configured to train a model with the historical ride hailing data to obtain a trained model.
  • the trained model may define a plurality of arms.
  • a model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space.
  • the multi-armed bandits adaptive linear programming algorithm may include a K-armed budget constraints contextual bandit problem.
  • training a model with historical ride hailing data to obtain the trained model may include pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model. The pre-trained model may be trained with another portion of the historical ride hailing data to obtain the trained model.
  • the pre-trained model may include a Lin-upper-confidence-bound (LinUCB) algorithm
  • the trained model may include a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm.
  • LinUCB is an algorithm in which the confidence interval may be computed efficiently in closed form when the payoff model is linear.
  • a spatial-temporal limited budget allocation MAB with the adaptive linear programming (ALP) may be formulated in LinUCB similarly to ALP with the upper-confidence-bound (ALP-UCB) .
  • ALP-UCB the contexts used in the MAB may be in an infinite space. For example, in many possible industry recommendation scenarios, the context may denote the combined features of passengers. The finite context assumption of ALP-UCB does not work in these scenarios. Thus, the context distribution may be seen as a uniform distribution.
  • training the model may include evaluating a spatiotemporal distribution based on the historical ride hailing data.
  • training the model with historical ride hailing data to obtain a trained model may include learning the infinite contextual space through a Gaussian Mixture Model (GMM) .
  • GMM Gaussian Mixture Model
  • EM-based cluster-GMM may be used to learn the user bubble distribution in spatial-temporal dimensions. is set to denote J different gaussian distribution.
  • G (x) may be used to find which gaussian distribution (spatial-temporal distribution) the context x belongs to.
  • training the model may include defining a linear payoff function, and determining a reward expectation of each arm of the plurality of arms. Training the model may further include selecting, among the plurality of arms, an arm with the highest reward expectation. The trained model may be trained based on the spatiotemporal distribution, the linear payoff function, and the selected arm.
  • D a is a design matrix of dimension m by d at trial t, whose rows correspond to m training inputs.
  • I d is a d by d identity matrix. The algorithm updates the parameters to improve the policy with a current observation
  • the budget constraints contextual bandits can be formalized as follows. Assuming the time-horizon T and budget B are known, the total t-trail payoff is defined as in the learning process, and the total optimal payoff from the algorithm is defined as the target is to maximize the total payoff during T rounds under the constraints of the budget and time-horizon, which can be formalized as:
  • linear programming (LP) is proposed.
  • LP linear programming
  • the LP function provides the policy of whether to choose or skip an action provided by the MAB policy.
  • b denotes the remaining budget
  • denotes the remaining time-horizon in round t.
  • the average budget constraint can be replaced as ALP is an advanced linear programming with this dynamic average budget constraint.
  • the cost for taking the action a t in round t denotes the cost for taking the action a t in round t.
  • the MAB algorithm makes a decision according to the expected reward of every arm, and thus To simplify the matter, it can be assumed that The original intention of using linear programming is to decide whether the system in a current round should retain the choice under the budget constraint.
  • p j ⁇ [0, 1] is the probability that the ALP selects the current action provided by the MAB algorithm, and the probability vector is denoted as For a given budget B and a time-horizon T (T may represent the remaining time) , the ALP problem is considered as:
  • p j ( ⁇ ) denotes the solution of equations (1) and (2)
  • v ( ⁇ ) denotes the maximum expected reward in a single round with advanced average budget.
  • linear contextual bandits with uniform allocation are described.
  • the disclosed bandits algorithm with budget constraints may consider situations that people are unwilling to spend all their budgets too early because the online environment is dynamic.
  • Traditional budget constrain bandit algorithms always choose the action in greedy policy, which may cause the budget to be allocated too early (e.g., when implementing the trained policy in an online environment, the budget will quickly run out) .
  • the request receiving component 113 may be configured to receive, from a user device associated with a user accessing an online platform, a request for a ride-hailing service.
  • the online platform may be associated with the ride-hailing service.
  • the user may access the online platform through a mobile device.
  • the user may enter a destination, and request a ride to that destination.
  • the ride-hailing service may obtain the location of the mobile device, and determine a route from the location of the mobile device to the destination.
  • the ride-hailing service may dispatch a driver to the location of the requesting user.
  • the budget component 114 may be configured to obtain a time, a location, and a promotion budget associated with the request.
  • the time may include a day of the week, a month, a time of day, and peak hours.
  • the location may include geographical location. For example, the location may include which area of a grid the user is positioned. Different promotional budgets may be provided to different grids. The location may affect the availability of drivers, the completion rate of orders, and supply and demand.
  • a promotion budget may include a remaining budget amount with respect to the time within a period and with respect to the location.
  • obtaining the time, the location, and the promotion budget associated with the request may include obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request.
  • the discount determination component 115 may be configured to input the obtained time, location, and promotion budget to the trained model to determine a price discount for the request.
  • a model may include a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period.
  • the discount allocation may be subject to a fixed budget ceiling for the period.
  • inputting the obtained time, location, and promotion budget to the trained model to determine the price discount may include inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount.
  • the amount of the price discount may be decided by the model based on the likelihood that the user will accept the discount.
  • the price discount may include no discount or a nonzero discount.
  • the price discount may include five dollars ($5) off the ride or ten percent (10%) off the ride.
  • the price discount may be determined to be zero (i.e., 0) .
  • the discount transmitting component 116 may be configured to transmit the determined price discount to the user device to notify the user. For example, a coupon may be sent to the user devices 140 in response to requesting the ride-hailing service. A full cost price may be displayed to the user in addition to the price discount.
  • FIG. 2 illustrates an exemplary algorithm for allocating a budget with an empirical spatial-temporal distribution.
  • the context of passenger in the setting, is divided into two parts.
  • the first part of the context includes the user bubble time and geographic latitude and longitude features. These three features indicate the spatial-temporal categories, and N denotes the number of spatial-temporal categories.
  • the spatial-temporal categories may be clustered by an Expectation-Maximization-based cluster or a GMM. GMM may be used in the following embodiments.
  • the spatial-temporal context can be changed into a finite context space.
  • Every distribution in is a normal distribution and is learned by GMM.
  • GMM may be a gaussian mixture distribution.
  • Algorithm 1 shows when bubble context x is coming in round t, it is first decided which distribution in the bubble context x belongs to by using GMM prediction Then, the remaining budget of this spatial-temporal is checked. If the remaining budget is bigger than zero, then the recommended action to this bubble user is predicted by linear contextual bandit. The action is executed in the real environment, and whether this bubble user chooses the recommended service is received as reward Then, the parameters in LinUCB are updated.
  • budget allocation via a spatial-temporal empirical distribution may lead to a waste of budget.
  • the learning process is not continuous. In other words, for passengers later in the day, the remaining budget of the platform may be insufficient to issue coupons.
  • ALP may be used to soften the learning process.
  • the spatiotemporal features may be treated as a finite context by using GMM estimation.
  • Other features represent the user information and some environment information, it becomes an infinite context set. Different from the spatiotemporal features, the information cannot be clustered, because clustering will lose information. And if a huge number of categories are set to cluster to reduce the information loss, the distribution of the user context classes is difficult to evaluate because the empirical evaluation of the distribution needs a large amount of history data. Since the human behaviors slowly change, not all the data can be used to estimate the human distribution. Only using the spatiotemporal features to execute the ALP-UCB algorithm is a feasible option, but may lose some personalized information.
  • the disclosed algorithm combines the advantage of ALP-UCB and LinUCB.
  • the first part of the features of history data may be used to evaluate the spatiotemporal distribution which indicates the bubbling probability in different time and spatial location. Instead of allocating the budget in a fix distribution, the allocation strategy will change with the remaining budget and time-horizon.
  • a linear payoff function like LinUCB is assumed as By calculating the reward expectation of each arm, the reward estimation score is as follows:
  • the empirical reward of spatiotemporal context j can be set as
  • the evaluation of the disclosed methods is different with supervised learning because online learning in a real application platform is costly and will make online system unstable.
  • an offline environment may be used to help train the algorithm.
  • the simulator is first trained to simulate the online environment.
  • the experimental setup is as follows.
  • a history dataset including user bubble data within a geographical location for a certain period may be collected.
  • the user data may be collected from online ride-hailing histories from a ride-hailing platform.
  • Each piece of user data may comprise a bubble time, a bubble spatial-temporal location, and bubble user information.
  • each piece of data has a bandit feedback, which includes the action and the send feature (i.e., send feature is used as the feedback in the disclosed system) .
  • Each dataset may be chronologically ordered by bubble time.
  • the first 3/4 of the datasets may be used for training the offline simulator, and the rest 1/4 are provided to perform three experiments: (1) pre-training the bandit algorithm (some baseline may not to be pre-trained) , (2) learning the bandit algorithm, and (3) simulating the online test setting.
  • the problem of activity recommendation under a budget limitation may be implemented in a real-world application of ride-hailing platform, and the data may be collected from a real online environment.
  • Agent the ride-hailing platform is set as the agent for recommendation problem. The agent is to be trained to know how to make decisions for different users according to their personalized features.
  • State the state of the agent is the set of bubble user features on ride-hailing platform, and stands for the personalized information of the bubble user and the information of the environment (e.g., weather conditions) .
  • Action an action is that the platform recommends activity to each passenger under a constraint limitation, subject to the condition that the cost of actions in each turn cannot exceed the budget limit.
  • Reward the optimal target of the agent is to maximize the cumulative reward from the start to round t.
  • the agent has a reward expectation function and makes a decision according to the expected reward. Then, the environment will return to the agent a real reward for executing the best action.
  • Simulator to replace the online environment in the early time, an offline simulator may be used.
  • the history data to train the simulators needs to be large and balanced for each action.
  • xgboost may be used as the classification machine.
  • History reward may be inputted as binary label to each classification machine.
  • Almost 3/4 history data may be used to train simulators, and the data is re-sampled according to different labels (rewards) to achieve balance learning.
  • the Matthews correlation coefficient is checked from each simulator model with other actions’ dataset (e.g., after learning s 1 , the dataset in which history except action 1 is used as the test set X A- ⁇ 1 ⁇ , and the true label set in history is set as R A- ⁇ 1 ⁇ as of s 1 , then the prediction reward is obtained, and R A- ⁇ 1 ⁇ and to are used to calculate the MCC) .
  • the MCC may be used to find whether there is covariate shift between training set and test set
  • UCB-Spent greedy for this algorithm, a fix budget B is set in LinUCB. The agent cannot perceive the budget information until the budget has run out. It means that if the B>0, the agent is not subject to any restrictions.
  • UCB-ALP for this algorithm, three dimensions (time, latitude, longitude) are selected, and GMM is used to cluster the three dimensions as 100 spatiotemporal gaussian distributions. And the context of this experiment setting is set as finite spatiotemporal context set with 100 members. The user personalized features are dropped in this experiment setting because cluster for users is not realistic as it has been considered before.
  • UCB-Normal distribution for this algorithm, the budget is allocated into different spatial-temporals as a fixed budget allocation strategy.
  • the spatial-temporals is learned by GMM, and empirical estimation is used to get the distribution of different spatial-temporals.
  • Algorithm 1 in FIG. 2 shows the allocation of budget B to different spatial-temporal distributions. Spatial-temporals may include historical ride hailing data.
  • UCB-Even distribution for this algorithm, the time is cut by 7 peaks, and the geographic locations are mapped into a grid world.
  • One grid is an area of the hexagon, and the radius is five kilometers.
  • history data is used to get the 7 ⁇ 4147 distributions by using empirical estimation. And the budget is allocated with spatiotemporal distribution.
  • Pretrain (warm) LinUCB-ALP to make the learning process more stable and robust, the parameters to initialize the LinUCB-ALP learned by LinUCB which can warm up the LinUCB-ALP learning process.
  • the datasets for pre-training and for training LinUCB-ALP may be different.
  • the online learning algorithm e.g., bandit algorithm
  • the strategy for online learning in an online environment is to fix the parameters for several days and then collect the data of these days to update the algorithm.
  • the online platform setting may be simulated by fixing the model parameters and predicting.
  • the dataset may be divided into two parts: (1) the first part is used to learn the six algorithms on the simulators. (2) the second part is used to predict the six algorithms according to the fixed parameters learned by (1) .
  • Two matrices (1) average reward (AR) and (2) budget using ratio (BUR) may be used.
  • Table II and Table III summarize the results of all the compared methods with respect to two evaluation metrics. From results, the greedy LinUCB algorithm with budget limit on both learning dataset and deployment dataset performs well in exploration. The budget does not run out quickly.
  • the fix budget UCB-Even UCB-Normal have the close cumulative rewards, UCB algorithm with Normal distribution is slightly better than the UCB algorithm with even distribution.
  • the UCB algorithm with Normal distribution has a more uniform budget allocation. But the allocation is also nonuniform.
  • UCB-ALP the performance is lower than all the algorithms. The last two algorithms both performed well in cumulative reward and budget using ratio. And they both have a more uniform allocation. Thus, pre-training can improve the performance.
  • FIG. 3 illustrates a flowchart of an exemplary method 300, according to various embodiments of the present disclosure.
  • the method 300 may be implemented in various environments including, for example, the environment 100 of FIG. 1.
  • the method 300 may be performed by computing system 102 of FIG. 1 and computer system 500 of FIG. 5.
  • the operations of the method 300 presented below are intended to be illustrative. Depending on the implementation, the method 300 may include additional, fewer, or alternative steps performed in various orders or in parallel.
  • the method 300 may be implemented in various computing systems or devices including one or more processors.
  • historical ride hailing data (e.g., a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension) may be obtained.
  • a model may be trained with the historical ride hailing data to obtain a trained model.
  • the model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space.
  • a request for a ride-hailing service may be received from a user device associated with a user accessing an online platform.
  • a time, a location, and a promotion budget associated with the request may be obtained.
  • the obtained time, location, and promotion budget may be input to the trained model to determine a price discount for the request.
  • the determined price discount may be transmitted to the user device to notify the user.
  • FIG. 4 is a block diagram that illustrates a computing device 400 upon which various of the embodiments described herein may be implemented.
  • the computing device 400 may correspond to user devices 140 and the mobile devices of the drivers of the vehicles 150 of FIG. 1 described above.
  • the computer device 400 includes communication platform 410 or other communication mechanism for communicating information, display 420, and graphics processing unit (GPU) 430 and central processing unit (CPU) 440 for processing information.
  • Display 420 may provide a user interface functionality, such as a graphical user interface ( “GUI” ) .
  • CPU 440 may be, for example, one or more general purpose microprocessors.
  • the computer device 400 also includes an input/output (IO) 450 and a memory 460.
  • Memory 460 may be a random access memory (RAM) , cache and/or other dynamic storage devices, for storing information, including operating system (OS) 470 and applications 480.
  • a storage 490 such as a magnetic disk, optical disk, or USB thumb drive (Flash drive) , etc., is provided for storing information and instructions. Information may be read into memory 460 from another storage medium, such as storage device 490. Execution of the sequences of instructions contained in memory 460 causes CPU 440 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
  • FIG. 5 is a block diagram that illustrates a computer system 500 upon which any of the embodiments described herein may be implemented.
  • the system 500 may correspond to the system 102 of FIG. 1 described above.
  • the computer system 500 includes a bus 502 or other communication mechanism for communicating information, one or more hardware processors 504 coupled with bus 502 for processing information.
  • Hardware processor (s) 504 may be, for example, one or more general purpose microprocessors.
  • the processor (s) 504 may correspond to the processor 104 described above.
  • the computer system 500 also includes a main memory 506, such as a random access memory (RAM) , cache and/or other dynamic storage devices, coupled to bus 502 for storing information and instructions to be executed by processor (s) 504.
  • Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor (s) 504.
  • Such instructions when stored in storage media accessible to processor (s) 504, render computer system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions.
  • the computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor (s) 504.
  • a storage device 510 such as a magnetic disk, optical disk, or USB thumb drive (Flash drive) , etc., is provided and coupled to bus 502 for storing information and instructions.
  • the computer system 500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor (s) 504 executing one or more sequences of one or more instructions contained in main memory 506. Such instructions may be read into main memory 506 from another storage medium, such as storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor (s) 504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
  • the main memory 506, the ROM 508, and/or the storage 510 may include non-transitory storage media.
  • non-transitory media, ” and similar terms, as used herein refers to media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media.
  • Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510.
  • Volatile media includes dynamic memory, such as main memory 506.
  • non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.
  • the computer system 500 also includes a network/communication interface 518 coupled to bus 502.
  • Communication interface 518 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks.
  • the communication interface 518 may be implemented as one or more network ports.
  • communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line.
  • ISDN integrated services digital network
  • communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN) .
  • LAN local area network
  • Wireless links may also be implemented.
  • communication interface 518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
  • the computer system 500 can send messages and receive data, including program code, through the network (s) , network link and communication interface 518.
  • a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 518.
  • the received code may be executed by processor (s) 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.
  • the system and method in the present disclosure is described primarily in regard to ride-hailing recommendation, it should also be understood that the present disclosure is not intended to be limiting.
  • the system or method of the present disclosure may be applied to any other kind of services.
  • the system or method of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof.
  • the vehicle of the transportation systems may include a taxi, a private car, a carpool, a bus, a train, a bullet train, a high speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, a driverless vehicle, or the like, or any combination thereof.
  • the transportation system may also include any transportation system for management and/or distribution, for example, a system for sending and/or receiving an express.
  • the application of the system or method of the present disclosure may be implemented on a user device and include a webpage, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.
  • the various operations of exemplary methods described herein may be performed, at least partially, by an algorithm.
  • the algorithm may be comprised in program codes or instructions stored in a memory (e.g., a non-transitory computer-readable storage medium described above) .
  • Such algorithm may comprise a machine learning algorithm.
  • a machine learning algorithm may not explicitly program computers to perform a function, but can learn from training data to make a predictions model that performs the function.
  • processors may be temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented engines that operate to perform one or more operations or functions described herein.
  • the methods described herein may be at least partially processor-implemented, with a particular processor or processors being an example of hardware.
  • a particular processor or processors being an example of hardware.
  • the operations of a method may be performed by one or more processors or processor-implemented engines.
  • the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS) .
  • SaaS software as a service
  • at least some of the operations may be performed by a group of computers (as examples of machines including processors) , with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an Application Program Interface (API) ) .
  • API Application Program Interface
  • processors or processor-implemented engines may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm) . In other exemplary embodiments, the processors or processor-implemented engines may be distributed across a number of geographic locations.
  • the term “or” may be construed in either an inclusive or exclusive sense. Moreover, plural instances may be provided for resources, operations, or structures described herein as a single instance. Additionally, boundaries between various resources, operations, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. In general, structures and functionality presented as separate resources in the exemplary configurations may be implemented as a combined structure or resource. Similarly, structures and functionality presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within a scope of embodiments of the present disclosure as represented by the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
  • Conditional language such as, among others, “can, ” “could, ” “might, ” or “may, ” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.
  • passenger " “requester, “ “service requester, “ “customer” and “user” in the present disclosure are used interchangeably to refer to an individual, an entity, or a tool that may request or order a service.
  • driver “ “provider, “ and “service provider” in the present disclosure are used interchangeably to refer to an individual, an entity, or a tool that may provide a service or facilitate the providing of the service.
  • service request “ “request for a service, “ “requests, “ and “order” in the present disclosure are used interchangeably to refer to a request that may be initiated by a passenger, a service requester, a customer, a driver, a provider, a service provider, or the like, or any combination thereof.
  • the service request may be accepted by any one of a passenger, a service requester, a customer, a driver, a provider, or a service provider.
  • the terms “user device, ” “service provider terminal, ” “provider terminal, ” and “ driver terminal” in the present disclosure are used interchangeably to refer to a computing device (e.g., mobile terminal) that is used by a service provider to provide a service or facilitate the providing of the service.
  • the term “service requester terminal, “ “requester terminal, ” and “passenger terminal” in the present disclosure are used interchangeably to refer to a mobile terminal that is used by a service requester to request or order a service.
  • the term “distance” between two locations may refer to a linear distance between the two locations and/or or a route distance along a route between the two location.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)

Abstract

Ride hailing recommendations may be provided in real-time using contextual bandits with budget and spatiotemporal contrarians. Historical ride hailing data may be obtained. A model may be trained with the historical ride hailing data to obtain a trained model. A request for a ride-hailing service may be received from a user device associated with a user accessing an online platform. A time, a location, and a promotion budget associated with the request may be obtained.The obtained time, location, and promotion budget may be input to the trained model to determine a price discount for the request. The determined price discount may be transmitted to the user device to notify the user.

Description

CONSTRAINED SPATIOTEMPORAL CONTEXTUAL BANDITS FOR REAL-TIME RIDE-HAILING RECOMMENDATION
CROSS-REFERENCE TO TELATED APPLICATIONS
The present application is based on and claims priority to the International Patent Application No. PCT/CN2019/090128, filed on June 5, 2019, and titled “CONSTRAINED SPATIOTEMPORAL CONTEXTUAL BANDITS FOR REAL-TIME RIDE-HAILING RECOMMENDATION, the entire contents of which are incorporated herein by reference in the entirety.
TECHNICAL FIELD
The present disclosure generally relates to ride-hailing recommendation, and more specifically, to methods and systems for ride-hailing recommendation based on constrained spatiotemporal contextual bandits.
BACKGROUND
A vehicle dispatch platform can automatically receive ride-hailing requests from user devices (passenger side) , provide price quotes, and upon user acceptances, allocate the ride-hailing requests to devices of vehicle drivers (driver side) for providing respective transportation services. For the platform, it has been challenging to optimize the distribution of limited promotional resources in order to maximize the number of accepted ride hailing orders. The distribution often involves real-time activity recommendations to the passenger side. For example, when a passenger logs in the vehicle dispatch platform from the passenger side to check out a ride hailing price (also referred to as passenger bubbling) , the platform may send a discount coupon to the passenger (e.g., directly applied to the price) to encourage ordering.
The difficulties of the real-time activity recommendation come from many  aspects. In particular, the time and location of each instance of passenger bubbling (e.g., when a user fills in the destination inquiry and chooses a service mode associated with a corresponding pricing tier) is unique. For example, the number of drivers in different geographical locations are different. If a passenger bubbles at a location with sparse drivers, even if a discount coupon is issued and applied, the order may not be completed. Thus, it is desirable to provide a solution that makes real-time activity recommendations according to individual circumstances to improve the efficiency and effect of the recommendations and maximize the total order number.
For another example, budget planning in the space-time dimension can be difficult for the platform. The number of times that the platform issues discount coupons or the coupon budget each day is limited, and the revenue generated by passenger bubbling in different time and space dimensions may be different. For example, for passengers bubbling at night, issuing coupons may generate a higher revenue than during day time. However, the current daily coupon budget may have been exhausted by afternoon. Thus, it is important to determine optimal budget allocation in different time and space dimensions to maximize the total order number.
SUMMARY
Various embodiments of the present disclosure include systems, methods, and non-transitory computer readable media for ride-hailing recommendation.
According to one aspect, a recommendation method may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model. The method may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a  promotion budget associated with the request. The method may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.
In some embodiments, a promotion budget may include a remaining budget amount with respect to the time within a period and with respect to the location.
In some embodiments, a model may include a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period. The discount allocation may be subject to a fixed budget ceiling for the period.
In some embodiments, the trained model may be a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm.
In some embodiments, training a model with historical ride hailing data to obtain the trained model may include pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model. The pre-trained model may be trained with another portion of the historical ride hailing data to obtain the trained model.
In some embodiments, the pre-trained model may be a Lin-upper-confidence-bound (LinUCB) algorithm.
In some embodiments, a model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. Training the model with historical ride hailing data to obtain a trained model may include learning the infinite contextual space through a Gaussian Mixture Model.
In some embodiments, historical ride hailing data may include one or more dimensions. The dimensions may include a background user information dimension, a real-time user information dimension, a weather dimension, a  spatial dimension, a temporal dimension, and a discount distribution dimension.
In some embodiments, obtaining the time, the location, and the promotion budget associated with the request may include obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request. Inputting the obtained time, location, and promotion budget to the trained model to determine the price discount may include inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount.
In some embodiments, the background user information dimension may include one or more attributes including gender, registration date, registration location, application login history, and ride-hailing order history.
In some embodiments, the real-time user information dimension may include one or more attributes including number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order.
In some embodiments, the weather dimension may include one or more attributes including humidity, precipitation, wind, UV metric, air pollution metric, and weather condition.
In some embodiments, historical ride hailing data may correspond to a geographical area mapped into a plurality of grids. For each piece of the historical ride hailing data associated with an order in one of the plurality of grids, the spatial dimension may include one or more of attributes. The attributes may include a grid index of the one grid, a number of vehicles in the one grid, a number of ride-hailing orders accepted in the one grid, a number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid.
In some embodiments, for each piece of the historical ride hailing data requesting a historical order, the temporal dimension may include one or more attributes including month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour.
In some embodiments, for each piece of the historical ride hailing data completing a historical order, the discount distribution dimension may include one or more attributes including whether discount was offered, discount type offered, offered price discount and whether discount was used.
In some embodiments, the price discount may comprise: no discount or a nonzero discount.
In some embodiments, the trained model may define a plurality of arms, and training the model may include evaluating a spatiotemporal distribution based on the historical ride hailing data. Training the model may further include defining a linear payoff function, and determining a reward expectation of each arm of the plurality of arms. Training the model may further include selecting, among the plurality of arms, an arm with the highest reward expectation. The trained model may be trained based on the spatiotemporal distribution, the linear payoff function, and the selected arm.
According to some embodiments, a recommendation system comprises one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of the preceding embodiments.
According to some embodiments, a non-transitory computer-readable storage medium is configured with instructions executable by one or more processors to cause the one or more processors to perform the method of any of the preceding embodiments.
According to some embodiments, a recommendation apparatus comprises a plurality of modules for performing the method of any of the preceding embodiments.
According to another aspect, a recommendation system may include one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform operations. The operations may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model. The operations may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request. The operations may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.
According to another aspect, a non-transitory computer-readable storage medium may be configured with instructions executable by one or more processors to cause the one or more processors to perform operations. The operations may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model. The operations may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request. The operations may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.
These and other features of the systems, methods, and non-transitory  computer readable media disclosed herein, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for purposes of illustration and description only and are not intended as a definition of the limits of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
Certain features of various embodiments of the present technology are set forth with particularity in the appended claims. A better understanding of the features and advantages of the technology will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:
FIG. 1 illustrates an exemplary system for ride-hailing recommendation, in accordance with various embodiments.
FIG. 2 illustrates an exemplary algorithm for allocating a budget with an empirical spatial-temporal distribution, in accordance with various embodiments.
FIG. 3 illustrates a flowchart of an exemplary method for ride-hailing recommendation, in accordance with various embodiments.
FIG. 4 illustrates a block diagram of an exemplary computing device in which various of the embodiments described herein may be implemented.
FIG. 5 illustrates a block diagram of an exemplary computer system in which any of the embodiments described herein may be implemented.
DETAILED DESCRIPTION
The disclosed systems and computer-implemented methods may determine real-time ride-hailing recommendations subject to spatiotemporal constraints. For example, discount coupon issuance decisions with respect to location and time under a budget constraint may be automatically made for an online ride-hailing platform to maximize the long-term benefits of the platform. Two problems may need to be solved in order to maximize performance. First, coupons may be issued such that there are available drivers in the geographical area to pick up riders when the coupons are used. A contextual multi-armed bandit algorithm may be used to solve the problem of space-time sequence decision-making and to maximize long-term benefits. Second, limited budgets may be properly allocated in the space-time dimension in order to not run out too early. Constrained Spatiotemporal Contextual Bandits may be used to solve the problem of space-time sequence decision with budget constraint. The disclosed systems and methods may have the technical effect of automatically determining the optimal recommendation decisions for each user in consideration of the location of the user, time of the day, and budget of the platform.
In some embodiments, personalized recommendations to individual users by making use of both environment information and user information are provided. The contextual multi-armed bandit algorithm may be used in various recommendation scenarios. In various embodiments of this disclosure, the activity recommendation application in a large-scale ride-hailing platform using contextual bandits task with budget and spatial-temporal constrains (referred to as budget constraints bandits) is described. In real time, the constraints significantly complicate the exploration and exploitation trade-off, it is a NP-hard (Nondeterministic Polynomial time) problem. Existing bandit algorithms of budget constraints attempt to solve the problem by just giving a budget  constraints to stop training, which can lead to a quick exhaustion of the budget. In this disclosure, a situation that some industrial settings prefer to allocate budget uniformly is contemplated, because it will catch the changes of an online environment. For example, a target may be configured to “not spend all in an early time, ” for which linear programming may be used to balance instantaneous and long-term rewards.
Further, Empirical Adaptive-Linear-Programming (EALP) offers a general recipe for changing budget during an online learning process. It requires an empirical distribution of all finite context sets. In a real setting, obtaining context distribution through empirical estimation is almost an infinite context set and hardly possible because features of every passenger are different in time and geographic location. In some embodiments of this disclosure, the empirical estimation in EALP may be replaced with the estimation of spatial-temporal context distribution. Also, the contextual bandit settings of infinite context set may be combined to overcome the reasonable budget allocation problem under spatial-temporal constrains. Because online learning in a real application environment can be very costly and unsafe, an balanced environment simulator may be trained by history logging data to make offline learning feasible.
The multi-armed bandit (MAB) is a sequential decision problem, in which an agent receives a random reward by playing one of K arms at each round and wants to maximize its cumulated reward. The agent learns the inherent trade-off between exploration, identifies and understands the reward from an action, and gathers as much reward as possible from an action. The observed d-dimension features may be combined with the bandit learning (referred to as contextual multi-armed bandit) to get reasonable policies.
Contextual bandits add contextual information to the MAB problem. The corresponding algorithm may be referred to as contextual MAB algorithm. Due  to the extra information features, reference to context is necessary in many applications. To a large extent, the effect of bandit algorithm will be promoted since it is more common to have relevant contextual information than not. In the generalized contextual multi-armed bandit problem, the agent observes a d-dimensional feature vector before making decision. During the learning time, the agent learns the relationship between contexts and rewards (e.g., payoffs) . In some embodiments, based on the assumption of the linear payoff function, the decision making process may be extended to considering the cost in real time, which is the budget constraint MAB setting. Since the decision making process is constrained by a budget, it may also be referred to as budget MAB. In budget MABs, playing an arm may generate consumption, and maximizing the cumulative reward under a budget constraint for the total assumption is the target in this setting.
FIG. 1 illustrates an exemplary environment 100 for ride-hailing recommendation, in accordance with various embodiments. The example environment 100 may include a computing system 102, a network 120, user devices 140, vehicles 150, a storage device 160, and satellites 170. The computing system 102 may include one or more processors and memory (e.g., permanent memory, temporary memory) . The processor (s) may be configured to perform various operations by interpreting machine-readable instructions stored in the memory. The computing system 102 may have access to other computing resources through network 120.
Network 120 may include wireless access points 130-1 and 130-2. Wireless access points 130-1 and 130-2 may allow user devices 140, vehicles 150, and satellites 170 to communicate with network 120. In some embodiments, network 120, user devices 140, and vehicles 150 may communicate with satellites 170. Satellites 170 may include satellites 170-1,  170-2, and 170-3. In some embodiments, the system 102 may be configured to obtain data (e.g., location, time, and fees for multiple vehicle transportation trips) from the data store 160 (e.g., a database or dataset of historical transportation trips) , the user devices 140, and vehicles 150. For example, system 102 may obtain GPS (Global Positioning System) coordinates of vehicles 150.
The positioning technology used in the present disclosure may be based on GPS, a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof. One or more of the above positioning systems may be used interchangeably in the present disclosure.
The user devices 140 may include mobile device 140-1 (e.g., a smart phone, smart watch) , tablet 140-2, laptop 140-3, and other computing devices 140-4 (e.g., desktop computer, server) . User devices 140 may be used by riders on a ride sharing platform. The vehicles 150 may include vehicles 150-1, 150-2, and 150-3. The vehicles 150 may include cars, bikes, scooters, trucks, boats, trains, or autonomous vehicles. In some embodiments, the vehicles 150 may include mobile devices of drivers of the vehicles. For example, communications between the computing system 102 and the vehicles 150 may take place between computing system 102 and mobile devices of the drivers. In another example, locations of vehicles 150 may correspond to the locations of the mobile devices of the drivers.
In some embodiments, environment 100 may implement an online information or service platform. The service platform may be referred to as a vehicle (service hailing, ride sharing, or ride order dispatching) platform. The platform may accept requests for transportation, identify vehicles to fulfill the requests, arrange for pick-ups, and process transactions. For example, a user  may use user device 140-1 (e.g., a mobile phone installed with a software application associated with the platform) to request transportation from the platform. The system 102 may receive the request and reply with price quote data and price discount data for one or more trips. When the user selects a trip, the system 102 may relay trip information to various drivers of vehicles 150, for example, by posting the request to mobile phones carried by the drivers. A vehicle driver may accept the posted transportation request and obtain pick-up location information. Fees such as transportation fees can be transacted among the system 102, the user devices 140, and the vehicles 150. In some embodiments, for each trip, the location of the origin and destination, the price discount information, the fee, and the time can be obtained by the system 102.
The computing system 102 may include a historical data component 111, a model training component 112, request receiving component 113, budget component 114, discount determination component 115, and discount transmitting component 116. The computing system 102 may include other components. In some embodiments, one or more of the system 102, the user devices 140, and the vehicles 150 may be integrated in a single device or system. Alternatively, the system 102, the user devices 140, and the vehicles 150 may operate as separate devices. While the computing system 102 is shown in FIG. 1 as a single entity, this is merely for ease of reference and is not meant to be limiting. One or more components or one or more functionalities of the computing system 102 described herein may be implemented in a single computing device or multiple computing devices. In some embodiments, one or more components or one or more functionalities of the computing system 102 described herein may be implemented in one or more networks (e.g., enterprise networks) , one or more endpoints, one or more servers, or one or more clouds. A server may include hardware or software which manages access to a  centralized resource or service in a network. A cloud may include a cluster of servers and other devices which are distributed across a network. The system 102 above may be installed with appropriate software (e.g., platform program, etc. ) and/or hardware (e.g., wires, wireless connections, etc. ) to access other devices of the environment 100.
In some embodiments, the various components may correspond to various modules, and the computing system 102 may correspond to a recommendation apparatus. Each module may correspond to instructions stored in a non-transitory computer-readable storage medium, and the instructions are executable by one or more processors to cause the one or more processors to perform the steps described with respect to the various components.
The historical data component 111 may be configured to obtain historical ride hailing data. For example, historical ride hailing data may be obtained from storage device 160. In some embodiments, historical ride hailing data may include user features. The user features may occur independently when spatial-temporal distributions are evaluated. Features may include spatial-temporal dimensions, and the dimensions may indicate the user’s space location.
In some embodiments, the historical ride hailing data may include one or more of the following dimensions: a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension. In some embodiments, the background user information dimension may include one or more attributes including gender, registration date, registration location, application login history, and ride-hailing order history. In some embodiments, the real-time user information dimension may include one or more attributes including number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order. In some  embodiments, the number of recently completed orders may include all of the orders completed within a predetermined period of time (i.e., the past hour, the past day, the past week) .
In some embodiments, the weather dimension may include one or more attributes including humidity, precipitation, wind, UV metric, air pollution metric, and weather condition. In some embodiments, for each piece of the historical ride hailing data requesting a historical order, the temporal dimension may include one or more attributes including month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour. In some embodiments, for each piece of the historical ride hailing data completing a historical order, the discount distribution dimension may include one or more attributes including whether discount was offered, discount type offered, offered price discount, and whether discount was used.
In some embodiments, historical ride hailing data may correspond to a geographical area mapped into a plurality of grids. For each piece of the historical ride hailing data associated with an order in one of the plurality of grids, the spatial dimension may include one or more of attributes. The attributes may include a grid index of the one grid, a number of vehicles in the one grid, a number of ride-hailing orders accepted in the one grid, a number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid. In some embodiments, the attributes of the spatial dimension may correspond to a period of time. For example, the attributes may be determined at the point in time when the order was placed. In another example, the attributes may be determined based on a period of time (e.g., prior minute, prior hour, prior day) associated with the order.
The model training component 112 may be configured to train a model with the historical ride hailing data to obtain a trained model. In some  embodiments, the trained model may define a plurality of arms. In some embodiments, a model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. In some embodiments, the multi-armed bandits adaptive linear programming algorithm may include a K-armed budget constraints contextual bandit problem. In some embodiments, training a model with historical ride hailing data to obtain the trained model may include pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model. The pre-trained model may be trained with another portion of the historical ride hailing data to obtain the trained model.
In some embodiments, the pre-trained model may include a Lin-upper-confidence-bound (LinUCB) algorithm, and the trained model may include a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm. LinUCB is an algorithm in which the confidence interval may be computed efficiently in closed form when the payoff model is linear. In some embodiments, a spatial-temporal limited budget allocation MAB with the adaptive linear programming (ALP) may be formulated in LinUCB similarly to ALP with the upper-confidence-bound (ALP-UCB) . However, unlike ALP-UCB, the contexts used in the MAB may be in an infinite space. For example, in many possible industry recommendation scenarios, the context may denote the combined features of passengers. The finite context assumption of ALP-UCB does not work in these scenarios. Thus, the context distribution may be seen as a uniform distribution.
In some embodiments, training the model may include evaluating a spatiotemporal distribution based on the historical ride hailing data. In some embodiments, training the model with historical ride hailing data to obtain a trained model may include learning the infinite contextual space through a Gaussian Mixture Model (GMM) . For example, EM-based cluster-GMM may be  used to learn the user bubble distribution in spatial-temporal dimensions. 
Figure PCTCN2019104790-appb-000001
Figure PCTCN2019104790-appb-000002
is set to denote J different gaussian distribution. After learning GMM, G (x) , may be used to find which gaussian distribution (spatial-temporal distribution) the context x belongs to.
In some embodiments with respect to linear contextual multi-armed bandit setting, for a k-armed stochastic bandit system, in each round t, the agent observes an action set A t∈ {1, 2, ..., K} , and user feature context x t arrives independently with identical distribution P {x t=j} =π j. In some embodiments, training the model may include defining a linear payoff function, and determining a reward expectation of each arm of the plurality of arms. Training the model may further include selecting, among the plurality of arms, an arm with the highest reward expectation. The trained model may be trained based on the spatiotemporal distribution, the linear payoff function, and the selected arm. For example, based on observed payoffs in previous trails, the agent calculates the expectation of reward
Figure PCTCN2019104790-appb-000003
and receives the payoff 
Figure PCTCN2019104790-appb-000004
and cost 
Figure PCTCN2019104790-appb-000005
after executing the action in the environment. If a t =0, then 
Figure PCTCN2019104790-appb-000006
The agent chooses an arm a t∈A t by selecting the arm (e.g., whether and which coupon to issue) with max expectation at trail t. With a probability of at least 1-δ (δ is epsilon-greedy) , the regret (e.g., the difference between the algorithm output and the theoretical best decision) in trail t has an upper bound which is constrained by a constant 
Figure PCTCN2019104790-appb-000007
Algorithm select the best arm by following formulation.
Figure PCTCN2019104790-appb-000008
Where 
Figure PCTCN2019104790-appb-000009
D a is a design matrix of dimension m by d at trial t, whose rows correspond to m training inputs. I d is a d by d identity matrix. The algorithm updates the parameters to improve the policy with a current observation
Figure PCTCN2019104790-appb-000010
In some embodiments with respect to MAB subject to budget limit, the budget constraints contextual bandits can be formalized as follows. Assuming the time-horizon T and budget B are known, the total t-trail payoff is defined as 
Figure PCTCN2019104790-appb-000011
in the learning process, and the total optimal payoff from the algorithm is defined as 
Figure PCTCN2019104790-appb-000012
the target is to maximize the total payoff during T rounds under the constraints of the budget and time-horizon, which can be formalized as:
Figure PCTCN2019104790-appb-000013
The regret of the algorithm is:
R (T, B) =U* (T, B) -U (T, B)
Because the budget and time can grow to infinite with the proportion of ρ=B/T, linear programming can be used to solve this problem.
In some embodiments with respect to ALP with expected reward, to formulate the problem that converts hard budget constraint to an average budget constraint, linear programming (LP) is proposed. When fixing the average budget constraint as ρ = B/T, the LP function provides the policy of whether to choose or skip an action provided by the MAB policy. Further, considering the remaining budget is changing during the learning process, b denotes the remaining budget, and τ denotes the remaining time-horizon in round t. The average budget constraint can be replaced as
Figure PCTCN2019104790-appb-000014
ALP is an advanced linear programming with this dynamic average budget constraint. In some embodiments, 
Figure PCTCN2019104790-appb-000015
denotes the spatial-temporal class set learned by GMM.
In some embodiments with respect to regret analysis, 
Figure PCTCN2019104790-appb-000016
denotes the cost for taking the action a t in round t. The cost may be regarded as a unit-cost, such that if an action is not dummy (a t=0) , then the 
Figure PCTCN2019104790-appb-000017
The quality of a t can be captured as 
Figure PCTCN2019104790-appb-000018
which is the expected reward provided by the bandit algorithm before making a decision in round t. 
Figure PCTCN2019104790-appb-000019
is the best action in a decision round which has been decided by MAB algorithm, and 
Figure PCTCN2019104790-appb-000020
is the expected reward of the best arm 
Figure PCTCN2019104790-appb-000021
The MAB algorithm makes a decision according to the expected reward of every arm, and thus 
Figure PCTCN2019104790-appb-000022
To simplify the matter, it can be assumed that 
Figure PCTCN2019104790-appb-000023
The original intention of using linear programming is to decide whether the system in a current round should retain the choice under the budget constraint.
In some embodiments, p j∈ [0, 1]is the probability that the ALP selects the current action provided by the MAB algorithm, and the probability vector is denoted as 
Figure PCTCN2019104790-appb-000024
For a given budget B and a time-horizon T (T may represent the remaining time) , the ALP problem is considered as:
Figure PCTCN2019104790-appb-000025
Figure PCTCN2019104790-appb-000026
Figure PCTCN2019104790-appb-000027
p j (ρ) denotes the solution of equations (1) and (2) , and v (ρ) denotes the maximum expected reward in a single round with advanced average budget.
In some embodiments, linear contextual bandits with uniform allocation are described. The disclosed bandits algorithm with budget constraints may consider situations that people are unwilling to spend all their budgets too early  because the online environment is dynamic. Traditional budget constrain bandit algorithms always choose the action in greedy policy, which may cause the budget to be allocated too early (e.g., when implementing the trained policy in an online environment, the budget will quickly run out) .
In a dynamic system, people’s behavior is also affected by the strategy produced by the algorithm. That means, after using all of the budget, the algorithm has no chance for exploration and cannot adapt to the changing environment. Thus, spending all of the budget too early will be a disaster in the online system. In light of that, an algorithm is disclosed to find how to allocate the budget reasonably in the spatial-temporal dimension.
The request receiving component 113 may be configured to receive, from a user device associated with a user accessing an online platform, a request for a ride-hailing service. In some embodiments, the online platform may be associated with the ride-hailing service. For example, the user may access the online platform through a mobile device. The user may enter a destination, and request a ride to that destination. The ride-hailing service may obtain the location of the mobile device, and determine a route from the location of the mobile device to the destination. In some embodiments, the ride-hailing service may dispatch a driver to the location of the requesting user.
The budget component 114 may be configured to obtain a time, a location, and a promotion budget associated with the request. In some embodiments, the time may include a day of the week, a month, a time of day, and peak hours. In some embodiments, the location may include geographical location. For example, the location may include which area of a grid the user is positioned. Different promotional budgets may be provided to different grids. The location may affect the availability of drivers, the completion rate of orders, and supply and demand. In some embodiments, a promotion budget may  include a remaining budget amount with respect to the time within a period and with respect to the location. In some embodiments, obtaining the time, the location, and the promotion budget associated with the request may include obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request.
The discount determination component 115 may be configured to input the obtained time, location, and promotion budget to the trained model to determine a price discount for the request. In some embodiments, a model may include a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period. The discount allocation may be subject to a fixed budget ceiling for the period.
In some embodiments, inputting the obtained time, location, and promotion budget to the trained model to determine the price discount may include inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount. For example, the amount of the price discount may be decided by the model based on the likelihood that the user will accept the discount. In some embodiments, the price discount may include no discount or a nonzero discount. For example, the price discount may include five dollars ($5) off the ride or ten percent (10%) off the ride. In another example, when there is no discount, the price discount may be determined to be zero (i.e., 0) .
The discount transmitting component 116 may be configured to transmit the determined price discount to the user device to notify the user. For example, a coupon may be sent to the user devices 140 in response to requesting the ride-hailing service. A full cost price may be displayed to the  user in addition to the price discount.
FIG. 2 illustrates an exemplary algorithm for allocating a budget with an empirical spatial-temporal distribution. In some embodiments with respect to allocating budget with empirical spatial-temporal distribution, in the setting, the context of passenger is divided into two parts. The first part of the context includes the user bubble time and geographic latitude and longitude features. These three features indicate the spatial-temporal categories, and N denotes the number of spatial-temporal categories. The spatial-temporal categories may be clustered by an Expectation-Maximization-based cluster or a GMM. GMM may be used in the following embodiments. To that end, the spatial-temporal context can be changed into a finite context space. It may be assumed that people’s bubbling behavior is independent and obeys the spatial-temporal distribution 
Figure PCTCN2019104790-appb-000028
where every distribution in 
Figure PCTCN2019104790-appb-000029
is a normal distribution and 
Figure PCTCN2019104790-appb-000030
is learned by GMM. 
Figure PCTCN2019104790-appb-000031
may be a gaussian mixture distribution. The history bubble data can be used to train the GMM and get the probability of every spatial-temporal occurrence distribution 
Figure PCTCN2019104790-appb-000032
by empirical estimation. For a given budget B, the budget is allocated to different spatial-temporal distributions b n=Bg n.
As shown in FIG. 2, Algorithm 1 shows when bubble context x is coming in round t, it is first decided which distribution in 
Figure PCTCN2019104790-appb-000033
the bubble context x belongs to by using GMM prediction 
Figure PCTCN2019104790-appb-000034
Then, the remaining budget of this spatial-temporal 
Figure PCTCN2019104790-appb-000035
is checked. If the remaining budget is bigger than zero, then the recommended action to this bubble user is predicted by linear contextual bandit. The action is executed in the real environment, and whether this bubble user chooses the recommended service is received as reward 
Figure PCTCN2019104790-appb-000036
Then, the parameters in LinUCB are updated.
In some embodiments with respect to linear contextual bandits with ALP,  because human behavior has a degree of randomness, budget allocation via a spatial-temporal empirical distribution may lead to a waste of budget. And for an infinite time-horizon, the learning process is not continuous. In other words, for passengers later in the day, the remaining budget of the platform may be insufficient to issue coupons.
In some embodiments, ALP may be used to soften the learning process. Here, the spatiotemporal features may be treated as a finite context by using GMM estimation. Other features represent the user information and some environment information, it becomes an infinite context set. Different from the spatiotemporal features, the information cannot be clustered, because clustering will lose information. And if a huge number of categories are set to cluster to reduce the information loss, the distribution of the user context classes is difficult to evaluate because the empirical evaluation of the distribution needs a large amount of history data. Since the human behaviors slowly change, not all the data can be used to estimate the human distribution. Only using the spatiotemporal features to execute the ALP-UCB algorithm is a feasible option, but may lose some personalized information.
In some embodiments, the disclosed algorithm combines the advantage of ALP-UCB and LinUCB. The first part of the features of history data may be used to evaluate the spatiotemporal distribution
Figure PCTCN2019104790-appb-000037
which indicates the bubbling probability in different time and spatial location. Instead of allocating the budget in a fix distribution, the allocation strategy will change with the remaining budget and time-horizon. A linear payoff function like LinUCB is assumed as 
Figure PCTCN2019104790-appb-000038
By calculating the reward expectation of each arm, the reward estimation score is as follows:
Figure PCTCN2019104790-appb-000039
Then, the best a t with the highest reward estimation is chosen
Figure PCTCN2019104790-appb-000040
c j (t) is the number of times that the spatiotemporal context j occurred, and s j, k (t) =s j, k (t-1) +p t, k is the number of total reward estimation score. When getting the reward estimation of x, the empirical reward of spatiotemporal context j can be set as 
Figure PCTCN2019104790-appb-000041
Figure PCTCN2019104790-appb-000042
Figure PCTCN2019104790-appb-000043
In some embodiments, the evaluation of the disclosed methods is different with supervised learning because online learning in a real application platform is costly and will make online system unstable. Thus, an offline environment may be used to help train the algorithm. In one embodiment, with historical bandit data, the simulator is first trained to simulate the online environment. With respect to evaluating the LinUCB-ALP scheme and the baseline comparison, the experimental setup is as follows.
Data collection. In one example, a history dataset including user bubble data within a geographical location for a certain period may be collected. The user data may be collected from online ride-hailing histories from a ride-hailing platform. Each piece of user data may comprise a bubble time, a bubble spatial-temporal location, and bubble user information. Also, each piece of data has a bandit feedback, which includes the action and the send feature (i.e., send feature is used as the feedback in the disclosed system) . Each dataset may be chronologically ordered by bubble time. The first 3/4 of the datasets may be used for training the offline simulator, and the rest 1/4 are provided to perform three experiments: (1) pre-training the bandit algorithm (some baseline may not to be pre-trained) , (2) learning the bandit algorithm, and (3) simulating the online test setting.
Environment Setting. The problem of activity recommendation under a budget limitation may be implemented in a real-world application of ride-hailing platform, and the data may be collected from a real online environment. For a ride-recommendation activity, considering the user’s personality, contextual MAB may be used to model the recommendation process defined as M= (S, A, r) , and it is a special case of Markov decision process (MDP) , the elements of which are defined as follows. Agent: the ride-hailing platform is set as the agent for recommendation problem. The agent is to be trained to know how to make decisions for different users according to their personalized features. State: the state of the agent is the set of bubble user features on ride-hailing platform, and stands for the personalized information of the bubble user and the information of the environment (e.g., weather conditions) . Action: an action is that the platform recommends activity to each passenger under a constraint limitation, subject to the condition that the cost of actions in each turn cannot exceed the budget limit. Reward: the optimal target of the agent is to maximize the cumulative reward from the start to round t. The agent has a reward expectation function and makes a decision according to the expected reward. Then, the environment will return to the agent a real reward for executing the best action. Simulator: to replace the online environment in the early time, an offline simulator may be used. The simulator may be trained by supervised learning with a large amount of history data. For context x and policy a, the simulator gives the reward r=s a (x, a) , and r can be treated as the environment feedback.
In some embodiments, with the history logging data, supervised learning is used to train the simulator S (x, a) = {s 1 (x) , s 2 (x) ....... s k (x) } , A= {1, 2, ..... k} . k models are learned that each kind of action maintains a model (e.g., the kth action maintains the simulator model s k (x) ) . To avoid the bias of simulator, the history data to train the simulators needs to be large and balanced for each  action. For each model in the simulator, xgboost may be used as the classification machine. History reward may be inputted as binary label to each classification machine. Almost 3/4 history data may be used to train simulators, and the data is re-sampled according to different labels (rewards) to achieve balance learning.
In some embodiments, after training the simulator S, the Matthews correlation coefficient (MCC) is checked from each simulator model with other actions’ dataset (e.g., after learning s 1, the dataset in which history except action 1 is used as the test set X A- {1} , and the true label set in history is set as R A- {1} as of s 1, then the prediction reward
Figure PCTCN2019104790-appb-000044
is obtained, and R A- {1} and 
Figure PCTCN2019104790-appb-000045
to are used to calculate the MCC) . The MCC may be used to find whether there is covariate shift between training set and test set 
Figure PCTCN2019104790-appb-000046
In the disclosed setting, there may be four types of recommendations, and the MCC matrix is shown in Table I.
TABLE I
MCC FOR EACH SIMUL ATOR MODEL
Figure PCTCN2019104790-appb-000047
As shown in Table I, there is a high correlation between the training data and testing data for simulator. So the simulator can be used for evaluating the disclosed bandit algorithms.
In some embodiments, except the history policy, five extensive experiments may be conducted to evaluate the effectiveness of the disclosed  method in ride-hailing recommendation.
UCB-Spent greedy: for this algorithm, a fix budget B is set in LinUCB. The agent cannot perceive the budget information until the budget has run out. It means that if the B>0, the agent is not subject to any restrictions.
UCB-ALP: for this algorithm, three dimensions (time, latitude, longitude) are selected, and GMM is used to cluster the three dimensions as 100 spatiotemporal gaussian distributions. And the context of this experiment setting is set as finite spatiotemporal context set with 100 members. The user personalized features are dropped in this experiment setting because cluster for users is not realistic as it has been considered before.
UCB-Normal distribution: for this algorithm, the budget is allocated into different spatial-temporals as a fixed budget allocation strategy. The spatial-temporals is learned by GMM, and empirical estimation is used to get the distribution of different spatial-temporals. Algorithm 1 in FIG. 2 shows the allocation of budget B to different spatial-temporal distributions. Spatial-temporals may include historical ride hailing data.
UCB-Even distribution: for this algorithm, the time is cut by 7 peaks, and the geographic locations are mapped into a grid world. One grid is an area of the hexagon, and the radius is five kilometers. Similar to the UCB-Normal distribution, history data is used to get the 7×4147 distributions by using empirical estimation. And the budget is allocated with spatiotemporal distribution.
Pretrain (warm) LinUCB-ALP: to make the learning process more stable and robust, the parameters to initialize the LinUCB-ALP learned by LinUCB which can warm up the LinUCB-ALP learning process. The datasets for pre-training and for training LinUCB-ALP may be different.
In some embodiments with respect to performance metrics, in a real  platform online environment, the online learning algorithm (e.g., bandit algorithm) cannot learn continuously, because it will make the online environment unstable. Thus, the strategy for online learning in an online environment is to fix the parameters for several days and then collect the data of these days to update the algorithm. In addition to learning on simulator, the online platform setting may be simulated by fixing the model parameters and predicting. To that end, the dataset may be divided into two parts: (1) the first part is used to learn the six algorithms on the simulators. (2) the second part is used to predict the six algorithms according to the fixed parameters learned by (1) . Two matrices (1) average reward (AR) and (2) budget using ratio (BUR) may be used.
Figure PCTCN2019104790-appb-000048
Figure PCTCN2019104790-appb-000049
In some embodiments, Table II and Table III summarize the results of all the compared methods with respect to two evaluation metrics. From results, the 
Figure PCTCN2019104790-appb-000050
greedy LinUCB algorithm with budget limit on both learning dataset and deployment dataset performs well in exploration. The budget does not run out quickly. The fix budget UCB-Even UCB-Normal have the close cumulative rewards, UCB algorithm with Normal distribution is slightly better than the UCB algorithm with even distribution. The UCB algorithm with Normal distribution has a more uniform budget allocation. But the allocation is also nonuniform. For UCB-ALP, the performance is lower than all the algorithms. The last two algorithms both performed well in cumulative reward and budget using ratio. And they both have a more uniform allocation. Thus, pre-training can improve the performance.
Figure PCTCN2019104790-appb-000051
FIG. 3 illustrates a flowchart of an exemplary method 300, according to various embodiments of the present disclosure. The method 300 may be implemented in various environments including, for example, the environment 100 of FIG. 1. The method 300 may be performed by computing system 102 of FIG. 1 and computer system 500 of FIG. 5. The operations of the method 300 presented below are intended to be illustrative. Depending on the implementation, the method 300 may include additional, fewer, or alternative steps performed in various orders or in parallel. The method 300 may be implemented in various computing systems or devices including one or more processors.
At block 302, historical ride hailing data (e.g., a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension) may be obtained. At block 304, a model may be trained with the historical ride hailing data to obtain a trained model. For example, the model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. At block 306, a request for a ride-hailing service may be received from a user device associated with a user accessing an online platform. At block 308, a time, a location, and a promotion budget associated with the request may be obtained. At block 310, the obtained time, location, and  promotion budget may be input to the trained model to determine a price discount for the request. At block 312, the determined price discount may be transmitted to the user device to notify the user.
FIG. 4 is a block diagram that illustrates a computing device 400 upon which various of the embodiments described herein may be implemented. The computing device 400 may correspond to user devices 140 and the mobile devices of the drivers of the vehicles 150 of FIG. 1 described above. The computer device 400 includes communication platform 410 or other communication mechanism for communicating information, display 420, and graphics processing unit (GPU) 430 and central processing unit (CPU) 440 for processing information. Display 420 may provide a user interface functionality, such as a graphical user interface ( “GUI” ) . CPU 440 may be, for example, one or more general purpose microprocessors.
The computer device 400 also includes an input/output (IO) 450 and a memory 460. Memory 460 may be a random access memory (RAM) , cache and/or other dynamic storage devices, for storing information, including operating system (OS) 470 and applications 480. A storage 490, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive) , etc., is provided for storing information and instructions. Information may be read into memory 460 from another storage medium, such as storage device 490. Execution of the sequences of instructions contained in memory 460 causes CPU 440 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
FIG. 5 is a block diagram that illustrates a computer system 500 upon which any of the embodiments described herein may be implemented. The system 500 may correspond to the system 102 of FIG. 1 described above. The computer system 500 includes a bus 502 or other communication mechanism for  communicating information, one or more hardware processors 504 coupled with bus 502 for processing information. Hardware processor (s) 504 may be, for example, one or more general purpose microprocessors. The processor (s) 504 may correspond to the processor 104 described above.
The computer system 500 also includes a main memory 506, such as a random access memory (RAM) , cache and/or other dynamic storage devices, coupled to bus 502 for storing information and instructions to be executed by processor (s) 504. Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor (s) 504. Such instructions, when stored in storage media accessible to processor (s) 504, render computer system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions. The computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor (s) 504. A storage device 510, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive) , etc., is provided and coupled to bus 502 for storing information and instructions.
The computer system 500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor (s) 504 executing one or more sequences of one or more instructions contained in main memory 506. Such instructions may be read into main memory 506 from another storage medium, such as storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor (s) 504 to perform the process steps described  herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The main memory 506, the ROM 508, and/or the storage 510 may include non-transitory storage media. The term “non-transitory media, ” and similar terms, as used herein refers to media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510. Volatile media includes dynamic memory, such as main memory 506. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.
The computer system 500 also includes a network/communication interface 518 coupled to bus 502. Communication interface 518 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. The communication interface 518 may be implemented as one or more network ports. For example, communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN) . Wireless links may also be implemented. In any such implementation, communication interface 518 sends  and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
The computer system 500 can send messages and receive data, including program code, through the network (s) , network link and communication interface 518. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 518.
The received code may be executed by processor (s) 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.
Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code modules executed by one or more computer systems or computer processors comprising computer hardware. The processes and algorithms may be implemented partially or wholly in application-specific circuitry.
The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The exemplary blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed exemplary embodiments. The exemplary systems  and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed exemplary embodiments.
Moreover, while the system and method in the present disclosure is described primarily in regard to ride-hailing recommendation, it should also be understood that the present disclosure is not intended to be limiting. The system or method of the present disclosure may be applied to any other kind of services. For example, the system or method of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof. The vehicle of the transportation systems may include a taxi, a private car, a carpool, a bus, a train, a bullet train, a high speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, a driverless vehicle, or the like, or any combination thereof. The transportation system may also include any transportation system for management and/or distribution, for example, a system for sending and/or receiving an express. The application of the system or method of the present disclosure may be implemented on a user device and include a webpage, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.
The various operations of exemplary methods described herein may be performed, at least partially, by an algorithm. The algorithm may be comprised in program codes or instructions stored in a memory (e.g., a non-transitory computer-readable storage medium described above) . Such algorithm may comprise a machine learning algorithm. In some embodiments, a machine learning algorithm may not explicitly program computers to perform a function, but can learn from training data to make a predictions model that performs the function.
The various operations of exemplary methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented engines that operate to perform one or more operations or functions described herein.
Similarly, the methods described herein may be at least partially processor-implemented, with a particular processor or processors being an example of hardware. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented engines. Moreover, the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS) . For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors) , with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an Application Program Interface (API) ) .
The performance of certain of the operations may be distributed among the processors, not only residing within a single machine, but deployed across a number of machines. In some exemplary embodiments, the processors or processor-implemented engines may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm) . In other exemplary embodiments, the processors or processor-implemented engines may be distributed across a number of geographic locations.
Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as  separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.
Although an overview of the subject matter has been described with reference to specific exemplary embodiments, various modifications and changes may be made to these embodiments without departing from the broader scope of embodiments of the present disclosure. Such embodiments of the subject matter may be referred to herein, individually or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single disclosure or concept if more than one is, in fact, disclosed.
The embodiments illustrated herein are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.
Any process descriptions, elements, or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include  one or more executable instructions for implementing specific logical functions or steps in the process. Alternate implementations are included within the scope of the embodiments described herein in which elements or functions may be deleted, executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those skilled in the art.
As used herein, the term “or” may be construed in either an inclusive or exclusive sense. Moreover, plural instances may be provided for resources, operations, or structures described herein as a single instance. Additionally, boundaries between various resources, operations, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. In general, structures and functionality presented as separate resources in the exemplary configurations may be implemented as a combined structure or resource. Similarly, structures and functionality presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within a scope of embodiments of the present disclosure as represented by the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Conditional language, such as, among others, “can, ” “could, ” “might, ” or “may, ” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more  embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.
The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "passenger, " "requester, " "service requester, " "customer" and "user" in the present disclosure are used interchangeably to refer to an individual, an entity, or a tool that may request or order a service. Also, the term "driver, " "provider, " and "service provider" in the present disclosure are used interchangeably to refer to an individual, an entity, or a tool that may provide a service or facilitate the providing of the service.
The term "service request, " "request for a service, " "requests, " and "order" in the present disclosure are used interchangeably to refer to a request that may be initiated by a passenger, a service requester, a customer, a driver, a provider, a service provider, or the like, or any combination thereof. The service request may be accepted by any one of a passenger, a service requester, a customer, a driver, a provider, or a service provider.
The terms “user device, ” "service provider terminal, ” “provider terminal, ”  and " driver terminal" in the present disclosure are used interchangeably to refer to a computing device (e.g., mobile terminal) that is used by a service provider to provide a service or facilitate the providing of the service. The term "service requester terminal, " “requester terminal, ” and "passenger terminal" in the present disclosure are used interchangeably to refer to a mobile terminal that is used by a service requester to request or order a service. The term “distance” between two locations may refer to a linear distance between the two locations and/or or a route distance along a route between the two location.

Claims (20)

  1. A recommendation method, comprising:
    obtaining historical ride hailing data;
    training a model with the historical ride hailing data to obtain a trained model;
    receiving, from a user device associated with a user accessing an online platform, a request for a ride-hailing service;
    obtaining a time, a location, and a promotion budget associated with the request;
    inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request; and
    transmitting the determined price discount to the user device to notify the user.
  2. The method of claim 1, wherein:
    the promotion budget comprises a remaining budget amount with respect to the time within a period and with respect to the location.
  3. The method of any of claims 1-2, wherein:
    the model comprises a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period; and
    the discount allocation is subject to a fixed budget ceiling for the period.
  4. The method of any of claims 1-3, wherein the trained model comprises a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm.
  5. The method of any of claims 1-4, wherein training the model with the  historical ride hailing data to obtain the trained model comprises:
    pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model; and
    training the pre-trained model with another portion of the historical ride hailing data to obtain the trained model.
  6. The method of claim 5, wherein the pre-trained model comprises a Lin-upper-confidence-bound (LinUCB) algorithm.
  7. The method of any of claims 1-6, wherein:
    the model comprises a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space; and
    training the model with the historical ride hailing data to obtain the trained model comprises learning the infinite contextual space through a Gaussian Mixture Model.
  8. The method of any of claims 1-7, wherein the historical ride hailing data comprises one or more of the following dimensions:
    a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension.
  9. The method of claim 8, wherein:
    obtaining the time, the location, and the promotion budget associated with the request comprises: obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request; and
    inputting the obtained time, location, and promotion budget to the trained model to determine the price discount comprises: inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount.
  10. The method of claim 8, wherein the background user information dimension comprises one or more of the following attributes:
    gender, registration date, registration location, application login history, and ride-hailing order history.
  11. The method of claim 8, wherein the real-time user information dimension comprises one or more of the following attributes:
    number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order.
  12. The method of claim 8, wherein the weather dimension comprises one or more of the following attributes:
    humidity, precipitation, wind, UV metric, air pollution metric, and weather condition.
  13. The method of claim 8, wherein:
    the historical ride hailing data corresponds to a geographical area mapped into a plurality of grids; and
    for each piece of the historical ride hailing data associated with an order in one of the plurality of grids, the spatial dimension comprises one or more of the following attributes:
    grid index of the one grid, number of vehicles in the one grid, number of ride- hailing orders accepted in the one grid, number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid.
  14. The method of claim 8, wherein for each piece of the historical ride hailing data requesting a historical order, the temporal dimension comprises one or more of the following attributes:
    month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour.
  15. The method of claim 8, wherein for each piece of the historical ride hailing data completing a historical order, the discount distribution dimension comprises one or more of the following attributes:
    whether discount was offered, discount type offered, offered price discount, and whether discount was used.
  16. The method of any of claims 1-15, wherein the price discount comprises: no discount or a nonzero discount.
  17. The method of any of claims 1-16, wherein the trained model defines a plurality of arms, and wherein training the model comprises:
    evaluating, based on the historical ride hailing data, a spatiotemporal distribution;
    defining a linear payoff function;
    determining a reward expectation of each arm of the plurality of arms;
    selecting, among the plurality of arms, an arm with the highest reward expectation;
    training, based on the spatiotemporal distribution, the linear payoff function, and the selected arm, the trained model.
  18. A recommendation system, comprising:
    one or more processors; and
    one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of claims 1 to 17.
  19. A recommendation apparatus comprising a plurality of modules for performing the method of any of claims 1 to 17.
  20. A non-transitory computer-readable storage medium configured with instructions executable by one or more processors to cause the one or more processors to perform the method of any of claims 1 to 17.
PCT/CN2019/104790 2019-06-05 2019-09-06 Constrained spatiotemporal contextual bandits for real-time ride-hailing recommendation WO2020244081A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNPCT/CN2019/090128 2019-06-05
CN2019090128 2019-06-05

Publications (1)

Publication Number Publication Date
WO2020244081A1 true WO2020244081A1 (en) 2020-12-10

Family

ID=73652360

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/104790 WO2020244081A1 (en) 2019-06-05 2019-09-06 Constrained spatiotemporal contextual bandits for real-time ride-hailing recommendation

Country Status (1)

Country Link
WO (1) WO2020244081A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023034625A1 (en) * 2021-09-03 2023-03-09 Protech Electronics Llc System and method for identifying advanced driver assist systems for vehicles

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013090192A1 (en) * 2011-12-12 2013-06-20 Oracle International Corporation Advice of promotion for usage based subscribers
CN104537502A (en) * 2015-01-15 2015-04-22 北京嘀嘀无限科技发展有限公司 Method and device for processing orders
CN105160711A (en) * 2015-08-20 2015-12-16 北京嘀嘀无限科技发展有限公司 Dynamic price adjustment method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013090192A1 (en) * 2011-12-12 2013-06-20 Oracle International Corporation Advice of promotion for usage based subscribers
CN104537502A (en) * 2015-01-15 2015-04-22 北京嘀嘀无限科技发展有限公司 Method and device for processing orders
CN105160711A (en) * 2015-08-20 2015-12-16 北京嘀嘀无限科技发展有限公司 Dynamic price adjustment method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023034625A1 (en) * 2021-09-03 2023-03-09 Protech Electronics Llc System and method for identifying advanced driver assist systems for vehicles

Similar Documents

Publication Publication Date Title
US20210232984A1 (en) Order allocation system and method
US10989548B2 (en) Systems and methods for determining estimated time of arrival
CN109863526B (en) System and method for providing information for on-demand services
US20200050938A1 (en) Systems and methods for improvement of index prediction and model building
US11011057B2 (en) Systems and methods for generating personalized destination recommendations
US11138888B2 (en) System and method for ride order dispatching
US20200005420A1 (en) Systems and methods for transportation capacity dispatch
WO2019232693A1 (en) System and method for ride order dispatching
US11507894B2 (en) System and method for ride order dispatching
JP7047096B2 (en) Systems and methods for determining estimated arrival times for online-to-offline services
CN111476588A (en) Order demand prediction method and device, electronic equipment and readable storage medium
CN110998568A (en) Navigation determination system and method for embarkable vehicle seeking passengers
TW201901185A (en) System and method for determining estimated arrival time
US11068815B2 (en) Systems and methods for vehicle scheduling
WO2022127517A1 (en) Hierarchical adaptive contextual bandits for resource-constrained recommendation
WO2020244081A1 (en) Constrained spatiotemporal contextual bandits for real-time ride-hailing recommendation
US20220327650A1 (en) Transportation bubbling at a ride-hailing platform and machine learning
WO2020248220A1 (en) Reinforcement learning method for incentive policy based on historic data trajectory construction
US20220196413A1 (en) Systems and methods for simulating transportation order bubbling behavior
CN111260104B (en) Order information dynamic adjustment method and device
WO2020243963A1 (en) Systems and methods for determining recommended information of service request
WO2021051221A1 (en) Systems and methods for evaluating driving path
CN116402323A (en) Taxi scheduling method
CN111260103A (en) Order information dynamic adjustment method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19931838

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19931838

Country of ref document: EP

Kind code of ref document: A1