WO2020244081A1

WO2020244081A1 - Constrained spatiotemporal contextual bandits for real-time ride-hailing recommendation

Info

Publication number: WO2020244081A1
Application number: PCT/CN2019/104790
Authority: WO
Inventors: Qingyang Li; Mengyue YANG; Zhiwei QIN; Jieping Ye
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2019-06-05
Filing date: 2019-09-06
Publication date: 2020-12-10

Abstract

Ride hailing recommendations may be provided in real-time using contextual bandits with budget and spatiotemporal contrarians. Historical ride hailing data may be obtained. A model may be trained with the historical ride hailing data to obtain a trained model. A request for a ride-hailing service may be received from a user device associated with a user accessing an online platform. A time, a location, and a promotion budget associated with the request may be obtained.The obtained time, location, and promotion budget may be input to the trained model to determine a price discount for the request. The determined price discount may be transmitted to the user device to notify the user.

Description

CONSTRAINED SPATIOTEMPORAL CONTEXTUAL BANDITS FOR REAL-TIME RIDE-HAILING RECOMMENDATION

CROSS-REFERENCE TO TELATED APPLICATIONS

The present application is based on and claims priority to the International Patent Application No. PCT/CN2019/090128, filed on June 5, 2019, and titled “CONSTRAINED SPATIOTEMPORAL CONTEXTUAL BANDITS FOR REAL-TIME RIDE-HAILING RECOMMENDATION, the entire contents of which are incorporated herein by reference in the entirety.

TECHNICAL FIELD

The present disclosure generally relates to ride-hailing recommendation, and more specifically, to methods and systems for ride-hailing recommendation based on constrained spatiotemporal contextual bandits.

BACKGROUND

A vehicle dispatch platform can automatically receive ride-hailing requests from user devices (passenger side) , provide price quotes, and upon user acceptances, allocate the ride-hailing requests to devices of vehicle drivers (driver side) for providing respective transportation services. For the platform, it has been challenging to optimize the distribution of limited promotional resources in order to maximize the number of accepted ride hailing orders. The distribution often involves real-time activity recommendations to the passenger side. For example, when a passenger logs in the vehicle dispatch platform from the passenger side to check out a ride hailing price (also referred to as passenger bubbling) , the platform may send a discount coupon to the passenger (e.g., directly applied to the price) to encourage ordering.

The difficulties of the real-time activity recommendation come from many aspects. In particular, the time and location of each instance of passenger bubbling (e.g., when a user fills in the destination inquiry and chooses a service mode associated with a corresponding pricing tier) is unique. For example, the number of drivers in different geographical locations are different. If a passenger bubbles at a location with sparse drivers, even if a discount coupon is issued and applied, the order may not be completed. Thus, it is desirable to provide a solution that makes real-time activity recommendations according to individual circumstances to improve the efficiency and effect of the recommendations and maximize the total order number.

For another example, budget planning in the space-time dimension can be difficult for the platform. The number of times that the platform issues discount coupons or the coupon budget each day is limited, and the revenue generated by passenger bubbling in different time and space dimensions may be different. For example, for passengers bubbling at night, issuing coupons may generate a higher revenue than during day time. However, the current daily coupon budget may have been exhausted by afternoon. Thus, it is important to determine optimal budget allocation in different time and space dimensions to maximize the total order number.

SUMMARY

Various embodiments of the present disclosure include systems, methods, and non-transitory computer readable media for ride-hailing recommendation.

According to one aspect, a recommendation method may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model. The method may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request. The method may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.

In some embodiments, a promotion budget may include a remaining budget amount with respect to the time within a period and with respect to the location.

In some embodiments, a model may include a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period. The discount allocation may be subject to a fixed budget ceiling for the period.

In some embodiments, the trained model may be a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm.

In some embodiments, training a model with historical ride hailing data to obtain the trained model may include pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model. The pre-trained model may be trained with another portion of the historical ride hailing data to obtain the trained model.

In some embodiments, the pre-trained model may be a Lin-upper-confidence-bound (LinUCB) algorithm.

In some embodiments, a model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. Training the model with historical ride hailing data to obtain a trained model may include learning the infinite contextual space through a Gaussian Mixture Model.

In some embodiments, historical ride hailing data may include one or more dimensions. The dimensions may include a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension.

In some embodiments, obtaining the time, the location, and the promotion budget associated with the request may include obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request. Inputting the obtained time, location, and promotion budget to the trained model to determine the price discount may include inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount.

In some embodiments, the background user information dimension may include one or more attributes including gender, registration date, registration location, application login history, and ride-hailing order history.

In some embodiments, the real-time user information dimension may include one or more attributes including number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order.

In some embodiments, the weather dimension may include one or more attributes including humidity, precipitation, wind, UV metric, air pollution metric, and weather condition.

In some embodiments, historical ride hailing data may correspond to a geographical area mapped into a plurality of grids. For each piece of the historical ride hailing data associated with an order in one of the plurality of grids, the spatial dimension may include one or more of attributes. The attributes may include a grid index of the one grid, a number of vehicles in the one grid, a number of ride-hailing orders accepted in the one grid, a number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid.

In some embodiments, for each piece of the historical ride hailing data requesting a historical order, the temporal dimension may include one or more attributes including month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour.

In some embodiments, for each piece of the historical ride hailing data completing a historical order, the discount distribution dimension may include one or more attributes including whether discount was offered, discount type offered, offered price discount and whether discount was used.

In some embodiments, the price discount may comprise: no discount or a nonzero discount.

In some embodiments, the trained model may define a plurality of arms, and training the model may include evaluating a spatiotemporal distribution based on the historical ride hailing data. Training the model may further include defining a linear payoff function, and determining a reward expectation of each arm of the plurality of arms. Training the model may further include selecting, among the plurality of arms, an arm with the highest reward expectation. The trained model may be trained based on the spatiotemporal distribution, the linear payoff function, and the selected arm.

According to some embodiments, a recommendation system comprises one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of the preceding embodiments.

According to some embodiments, a non-transitory computer-readable storage medium is configured with instructions executable by one or more processors to cause the one or more processors to perform the method of any of the preceding embodiments.

According to some embodiments, a recommendation apparatus comprises a plurality of modules for performing the method of any of the preceding embodiments.

According to another aspect, a recommendation system may include one or more processors and one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform operations. The operations may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model. The operations may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request. The operations may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.

According to another aspect, a non-transitory computer-readable storage medium may be configured with instructions executable by one or more processors to cause the one or more processors to perform operations. The operations may include obtaining historical ride hailing data, and training a model with the historical ride hailing data to obtain a trained model. The operations may further include receiving a request for a ride-hailing service from a user device associated with a user accessing an online platform, and obtaining a time, a location, and a promotion budget associated with the request. The operations may further included inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request, and transmitting the determined price discount to the user device to notify the user.

These and other features of the systems, methods, and non-transitory computer readable media disclosed herein, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description and the appended claims with reference to the accompanying drawings, all of which form a part of this specification, wherein like reference numerals designate corresponding parts in the various figures. It is to be expressly understood, however, that the drawings are for purposes of illustration and description only and are not intended as a definition of the limits of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain features of various embodiments of the present technology are set forth with particularity in the appended claims. A better understanding of the features and advantages of the technology will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:

FIG. 1 illustrates an exemplary system for ride-hailing recommendation, in accordance with various embodiments.

FIG. 2 illustrates an exemplary algorithm for allocating a budget with an empirical spatial-temporal distribution, in accordance with various embodiments.

FIG. 3 illustrates a flowchart of an exemplary method for ride-hailing recommendation, in accordance with various embodiments.

FIG. 4 illustrates a block diagram of an exemplary computing device in which various of the embodiments described herein may be implemented.

FIG. 5 illustrates a block diagram of an exemplary computer system in which any of the embodiments described herein may be implemented.

DETAILED DESCRIPTION

The disclosed systems and computer-implemented methods may determine real-time ride-hailing recommendations subject to spatiotemporal constraints. For example, discount coupon issuance decisions with respect to location and time under a budget constraint may be automatically made for an online ride-hailing platform to maximize the long-term benefits of the platform. Two problems may need to be solved in order to maximize performance. First, coupons may be issued such that there are available drivers in the geographical area to pick up riders when the coupons are used. A contextual multi-armed bandit algorithm may be used to solve the problem of space-time sequence decision-making and to maximize long-term benefits. Second, limited budgets may be properly allocated in the space-time dimension in order to not run out too early. Constrained Spatiotemporal Contextual Bandits may be used to solve the problem of space-time sequence decision with budget constraint. The disclosed systems and methods may have the technical effect of automatically determining the optimal recommendation decisions for each user in consideration of the location of the user, time of the day, and budget of the platform.

In some embodiments, personalized recommendations to individual users by making use of both environment information and user information are provided. The contextual multi-armed bandit algorithm may be used in various recommendation scenarios. In various embodiments of this disclosure, the activity recommendation application in a large-scale ride-hailing platform using contextual bandits task with budget and spatial-temporal constrains (referred to as budget constraints bandits) is described. In real time, the constraints significantly complicate the exploration and exploitation trade-off, it is a NP-hard (Nondeterministic Polynomial time) problem. Existing bandit algorithms of budget constraints attempt to solve the problem by just giving a budget constraints to stop training, which can lead to a quick exhaustion of the budget. In this disclosure, a situation that some industrial settings prefer to allocate budget uniformly is contemplated, because it will catch the changes of an online environment. For example, a target may be configured to “not spend all in an early time, ” for which linear programming may be used to balance instantaneous and long-term rewards.

Further, Empirical Adaptive-Linear-Programming (EALP) offers a general recipe for changing budget during an online learning process. It requires an empirical distribution of all finite context sets. In a real setting, obtaining context distribution through empirical estimation is almost an infinite context set and hardly possible because features of every passenger are different in time and geographic location. In some embodiments of this disclosure, the empirical estimation in EALP may be replaced with the estimation of spatial-temporal context distribution. Also, the contextual bandit settings of infinite context set may be combined to overcome the reasonable budget allocation problem under spatial-temporal constrains. Because online learning in a real application environment can be very costly and unsafe, an balanced environment simulator may be trained by history logging data to make offline learning feasible.

The multi-armed bandit (MAB) is a sequential decision problem, in which an agent receives a random reward by playing one of K arms at each round and wants to maximize its cumulated reward. The agent learns the inherent trade-off between exploration, identifies and understands the reward from an action, and gathers as much reward as possible from an action. The observed d-dimension features may be combined with the bandit learning (referred to as contextual multi-armed bandit) to get reasonable policies.

Contextual bandits add contextual information to the MAB problem. The corresponding algorithm may be referred to as contextual MAB algorithm. Due to the extra information features, reference to context is necessary in many applications. To a large extent, the effect of bandit algorithm will be promoted since it is more common to have relevant contextual information than not. In the generalized contextual multi-armed bandit problem, the agent observes a d-dimensional feature vector before making decision. During the learning time, the agent learns the relationship between contexts and rewards (e.g., payoffs) . In some embodiments, based on the assumption of the linear payoff function, the decision making process may be extended to considering the cost in real time, which is the budget constraint MAB setting. Since the decision making process is constrained by a budget, it may also be referred to as budget MAB. In budget MABs, playing an arm may generate consumption, and maximizing the cumulative reward under a budget constraint for the total assumption is the target in this setting.

FIG. 1 illustrates an exemplary environment 100 for ride-hailing recommendation, in accordance with various embodiments. The example environment 100 may include a computing system 102, a network 120, user devices 140, vehicles 150, a storage device 160, and satellites 170. The computing system 102 may include one or more processors and memory (e.g., permanent memory, temporary memory) . The processor (s) may be configured to perform various operations by interpreting machine-readable instructions stored in the memory. The computing system 102 may have access to other computing resources through network 120.

Network 120 may include wireless access points 130-1 and 130-2. Wireless access points 130-1 and 130-2 may allow user devices 140, vehicles 150, and satellites 170 to communicate with network 120. In some embodiments, network 120, user devices 140, and vehicles 150 may communicate with satellites 170. Satellites 170 may include satellites 170-1, 170-2, and 170-3. In some embodiments, the system 102 may be configured to obtain data (e.g., location, time, and fees for multiple vehicle transportation trips) from the data store 160 (e.g., a database or dataset of historical transportation trips) , the user devices 140, and vehicles 150. For example, system 102 may obtain GPS (Global Positioning System) coordinates of vehicles 150.

The positioning technology used in the present disclosure may be based on GPS, a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof. One or more of the above positioning systems may be used interchangeably in the present disclosure.

The user devices 140 may include mobile device 140-1 (e.g., a smart phone, smart watch) , tablet 140-2, laptop 140-3, and other computing devices 140-4 (e.g., desktop computer, server) . User devices 140 may be used by riders on a ride sharing platform. The vehicles 150 may include vehicles 150-1, 150-2, and 150-3. The vehicles 150 may include cars, bikes, scooters, trucks, boats, trains, or autonomous vehicles. In some embodiments, the vehicles 150 may include mobile devices of drivers of the vehicles. For example, communications between the computing system 102 and the vehicles 150 may take place between computing system 102 and mobile devices of the drivers. In another example, locations of vehicles 150 may correspond to the locations of the mobile devices of the drivers.

In some embodiments, environment 100 may implement an online information or service platform. The service platform may be referred to as a vehicle (service hailing, ride sharing, or ride order dispatching) platform. The platform may accept requests for transportation, identify vehicles to fulfill the requests, arrange for pick-ups, and process transactions. For example, a user may use user device 140-1 (e.g., a mobile phone installed with a software application associated with the platform) to request transportation from the platform. The system 102 may receive the request and reply with price quote data and price discount data for one or more trips. When the user selects a trip, the system 102 may relay trip information to various drivers of vehicles 150, for example, by posting the request to mobile phones carried by the drivers. A vehicle driver may accept the posted transportation request and obtain pick-up location information. Fees such as transportation fees can be transacted among the system 102, the user devices 140, and the vehicles 150. In some embodiments, for each trip, the location of the origin and destination, the price discount information, the fee, and the time can be obtained by the system 102.

The computing system 102 may include a historical data component 111, a model training component 112, request receiving component 113, budget component 114, discount determination component 115, and discount transmitting component 116. The computing system 102 may include other components. In some embodiments, one or more of the system 102, the user devices 140, and the vehicles 150 may be integrated in a single device or system. Alternatively, the system 102, the user devices 140, and the vehicles 150 may operate as separate devices. While the computing system 102 is shown in FIG. 1 as a single entity, this is merely for ease of reference and is not meant to be limiting. One or more components or one or more functionalities of the computing system 102 described herein may be implemented in a single computing device or multiple computing devices. In some embodiments, one or more components or one or more functionalities of the computing system 102 described herein may be implemented in one or more networks (e.g., enterprise networks) , one or more endpoints, one or more servers, or one or more clouds. A server may include hardware or software which manages access to a centralized resource or service in a network. A cloud may include a cluster of servers and other devices which are distributed across a network. The system 102 above may be installed with appropriate software (e.g., platform program, etc. ) and/or hardware (e.g., wires, wireless connections, etc. ) to access other devices of the environment 100.

In some embodiments, the various components may correspond to various modules, and the computing system 102 may correspond to a recommendation apparatus. Each module may correspond to instructions stored in a non-transitory computer-readable storage medium, and the instructions are executable by one or more processors to cause the one or more processors to perform the steps described with respect to the various components.

The historical data component 111 may be configured to obtain historical ride hailing data. For example, historical ride hailing data may be obtained from storage device 160. In some embodiments, historical ride hailing data may include user features. The user features may occur independently when spatial-temporal distributions are evaluated. Features may include spatial-temporal dimensions, and the dimensions may indicate the user’s space location.

In some embodiments, the historical ride hailing data may include one or more of the following dimensions: a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension. In some embodiments, the background user information dimension may include one or more attributes including gender, registration date, registration location, application login history, and ride-hailing order history. In some embodiments, the real-time user information dimension may include one or more attributes including number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order. In some embodiments, the number of recently completed orders may include all of the orders completed within a predetermined period of time (i.e., the past hour, the past day, the past week) .

In some embodiments, the weather dimension may include one or more attributes including humidity, precipitation, wind, UV metric, air pollution metric, and weather condition. In some embodiments, for each piece of the historical ride hailing data requesting a historical order, the temporal dimension may include one or more attributes including month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour. In some embodiments, for each piece of the historical ride hailing data completing a historical order, the discount distribution dimension may include one or more attributes including whether discount was offered, discount type offered, offered price discount, and whether discount was used.

In some embodiments, historical ride hailing data may correspond to a geographical area mapped into a plurality of grids. For each piece of the historical ride hailing data associated with an order in one of the plurality of grids, the spatial dimension may include one or more of attributes. The attributes may include a grid index of the one grid, a number of vehicles in the one grid, a number of ride-hailing orders accepted in the one grid, a number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid. In some embodiments, the attributes of the spatial dimension may correspond to a period of time. For example, the attributes may be determined at the point in time when the order was placed. In another example, the attributes may be determined based on a period of time (e.g., prior minute, prior hour, prior day) associated with the order.

The model training component 112 may be configured to train a model with the historical ride hailing data to obtain a trained model. In some embodiments, the trained model may define a plurality of arms. In some embodiments, a model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. In some embodiments, the multi-armed bandits adaptive linear programming algorithm may include a K-armed budget constraints contextual bandit problem. In some embodiments, training a model with historical ride hailing data to obtain the trained model may include pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model. The pre-trained model may be trained with another portion of the historical ride hailing data to obtain the trained model.

In some embodiments, the pre-trained model may include a Lin-upper-confidence-bound (LinUCB) algorithm, and the trained model may include a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm. LinUCB is an algorithm in which the confidence interval may be computed efficiently in closed form when the payoff model is linear. In some embodiments, a spatial-temporal limited budget allocation MAB with the adaptive linear programming (ALP) may be formulated in LinUCB similarly to ALP with the upper-confidence-bound (ALP-UCB) . However, unlike ALP-UCB, the contexts used in the MAB may be in an infinite space. For example, in many possible industry recommendation scenarios, the context may denote the combined features of passengers. The finite context assumption of ALP-UCB does not work in these scenarios. Thus, the context distribution may be seen as a uniform distribution.

In some embodiments, training the model may include evaluating a spatiotemporal distribution based on the historical ride hailing data. In some embodiments, training the model with historical ride hailing data to obtain a trained model may include learning the infinite contextual space through a Gaussian Mixture Model (GMM) . For example, EM-based cluster-GMM may be used to learn the user bubble distribution in spatial-temporal dimensions.

is set to denote J different gaussian distribution. After learning GMM, G (x) , may be used to find which gaussian distribution (spatial-temporal distribution) the context x belongs to.

In some embodiments with respect to linear contextual multi-armed bandit setting, for a k-armed stochastic bandit system, in each round t, the agent observes an action set A _t∈ {1, 2, ..., K} , and user feature context x _t arrives independently with identical distribution P {x _t=j} =π _j. In some embodiments, training the model may include defining a linear payoff function, and determining a reward expectation of each arm of the plurality of arms. Training the model may further include selecting, among the plurality of arms, an arm with the highest reward expectation. The trained model may be trained based on the spatiotemporal distribution, the linear payoff function, and the selected arm. For example, based on observed payoffs in previous trails, the agent calculates the expectation of reward

and receives the payoff

and cost

after executing the action in the environment. If a _t =0, then

The agent chooses an arm a _t∈A _t by selecting the arm (e.g., whether and which coupon to issue) with max expectation at trail t. With a probability of at least 1-δ (δ is epsilon-greedy) , the regret (e.g., the difference between the algorithm output and the theoretical best decision) in trail t has an upper bound which is constrained by a constant

Algorithm select the best arm by following formulation.

Where

D _a is a design matrix of dimension m by d at trial t, whose rows correspond to m training inputs. I _d is a d by d identity matrix. The algorithm updates the parameters to improve the policy with a current observation

In some embodiments with respect to MAB subject to budget limit, the budget constraints contextual bandits can be formalized as follows. Assuming the time-horizon T and budget B are known, the total t-trail payoff is defined as

in the learning process, and the total optimal payoff from the algorithm is defined as

the target is to maximize the total payoff during T rounds under the constraints of the budget and time-horizon, which can be formalized as:

The regret of the algorithm is:

R (T, B) =U* (T, B) -U (T, B)

Because the budget and time can grow to infinite with the proportion of ρ=B/T, linear programming can be used to solve this problem.

In some embodiments with respect to ALP with expected reward, to formulate the problem that converts hard budget constraint to an average budget constraint, linear programming (LP) is proposed. When fixing the average budget constraint as ρ = B/T, the LP function provides the policy of whether to choose or skip an action provided by the MAB policy. Further, considering the remaining budget is changing during the learning process, b denotes the remaining budget, and τ denotes the remaining time-horizon in round t. The average budget constraint can be replaced as

ALP is an advanced linear programming with this dynamic average budget constraint. In some embodiments,

denotes the spatial-temporal class set learned by GMM.

In some embodiments with respect to regret analysis,

denotes the cost for taking the action a _t in round t. The cost may be regarded as a unit-cost, such that if an action is not dummy (a _t=0) , then the

The quality of a _t can be captured as

which is the expected reward provided by the bandit algorithm before making a decision in round t.

is the best action in a decision round which has been decided by MAB algorithm, and

is the expected reward of the best arm

The MAB algorithm makes a decision according to the expected reward of every arm, and thus

To simplify the matter, it can be assumed that

The original intention of using linear programming is to decide whether the system in a current round should retain the choice under the budget constraint.

In some embodiments, p _j∈ [0, 1]is the probability that the ALP selects the current action provided by the MAB algorithm, and the probability vector is denoted as

For a given budget B and a time-horizon T (T may represent the remaining time) , the ALP problem is considered as:

p _j (ρ) denotes the solution of equations (1) and (2) , and v (ρ) denotes the maximum expected reward in a single round with advanced average budget.

In some embodiments, linear contextual bandits with uniform allocation are described. The disclosed bandits algorithm with budget constraints may consider situations that people are unwilling to spend all their budgets too early because the online environment is dynamic. Traditional budget constrain bandit algorithms always choose the action in greedy policy, which may cause the budget to be allocated too early (e.g., when implementing the trained policy in an online environment, the budget will quickly run out) .

In a dynamic system, people’s behavior is also affected by the strategy produced by the algorithm. That means, after using all of the budget, the algorithm has no chance for exploration and cannot adapt to the changing environment. Thus, spending all of the budget too early will be a disaster in the online system. In light of that, an algorithm is disclosed to find how to allocate the budget reasonably in the spatial-temporal dimension.

The request receiving component 113 may be configured to receive, from a user device associated with a user accessing an online platform, a request for a ride-hailing service. In some embodiments, the online platform may be associated with the ride-hailing service. For example, the user may access the online platform through a mobile device. The user may enter a destination, and request a ride to that destination. The ride-hailing service may obtain the location of the mobile device, and determine a route from the location of the mobile device to the destination. In some embodiments, the ride-hailing service may dispatch a driver to the location of the requesting user.

The budget component 114 may be configured to obtain a time, a location, and a promotion budget associated with the request. In some embodiments, the time may include a day of the week, a month, a time of day, and peak hours. In some embodiments, the location may include geographical location. For example, the location may include which area of a grid the user is positioned. Different promotional budgets may be provided to different grids. The location may affect the availability of drivers, the completion rate of orders, and supply and demand. In some embodiments, a promotion budget may include a remaining budget amount with respect to the time within a period and with respect to the location. In some embodiments, obtaining the time, the location, and the promotion budget associated with the request may include obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request.

The discount determination component 115 may be configured to input the obtained time, location, and promotion budget to the trained model to determine a price discount for the request. In some embodiments, a model may include a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period. The discount allocation may be subject to a fixed budget ceiling for the period.

In some embodiments, inputting the obtained time, location, and promotion budget to the trained model to determine the price discount may include inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount. For example, the amount of the price discount may be decided by the model based on the likelihood that the user will accept the discount. In some embodiments, the price discount may include no discount or a nonzero discount. For example, the price discount may include five dollars ($5) off the ride or ten percent (10%) off the ride. In another example, when there is no discount, the price discount may be determined to be zero (i.e., 0) .

The discount transmitting component 116 may be configured to transmit the determined price discount to the user device to notify the user. For example, a coupon may be sent to the user devices 140 in response to requesting the ride-hailing service. A full cost price may be displayed to the user in addition to the price discount.

FIG. 2 illustrates an exemplary algorithm for allocating a budget with an empirical spatial-temporal distribution. In some embodiments with respect to allocating budget with empirical spatial-temporal distribution, in the setting, the context of passenger is divided into two parts. The first part of the context includes the user bubble time and geographic latitude and longitude features. These three features indicate the spatial-temporal categories, and N denotes the number of spatial-temporal categories. The spatial-temporal categories may be clustered by an Expectation-Maximization-based cluster or a GMM. GMM may be used in the following embodiments. To that end, the spatial-temporal context can be changed into a finite context space. It may be assumed that people’s bubbling behavior is independent and obeys the spatial-temporal distribution

where every distribution in

is a normal distribution and

is learned by GMM.

may be a gaussian mixture distribution. The history bubble data can be used to train the GMM and get the probability of every spatial-temporal occurrence distribution

by empirical estimation. For a given budget B, the budget is allocated to different spatial-temporal distributions b _n=Bg _n.

As shown in FIG. 2, Algorithm 1 shows when bubble context x is coming in round t, it is first decided which distribution in

the bubble context x belongs to by using GMM prediction

Then, the remaining budget of this spatial-temporal

is checked. If the remaining budget is bigger than zero, then the recommended action to this bubble user is predicted by linear contextual bandit. The action is executed in the real environment, and whether this bubble user chooses the recommended service is received as reward

Then, the parameters in LinUCB are updated.

In some embodiments with respect to linear contextual bandits with ALP, because human behavior has a degree of randomness, budget allocation via a spatial-temporal empirical distribution may lead to a waste of budget. And for an infinite time-horizon, the learning process is not continuous. In other words, for passengers later in the day, the remaining budget of the platform may be insufficient to issue coupons.

In some embodiments, ALP may be used to soften the learning process. Here, the spatiotemporal features may be treated as a finite context by using GMM estimation. Other features represent the user information and some environment information, it becomes an infinite context set. Different from the spatiotemporal features, the information cannot be clustered, because clustering will lose information. And if a huge number of categories are set to cluster to reduce the information loss, the distribution of the user context classes is difficult to evaluate because the empirical evaluation of the distribution needs a large amount of history data. Since the human behaviors slowly change, not all the data can be used to estimate the human distribution. Only using the spatiotemporal features to execute the ALP-UCB algorithm is a feasible option, but may lose some personalized information.

In some embodiments, the disclosed algorithm combines the advantage of ALP-UCB and LinUCB. The first part of the features of history data may be used to evaluate the spatiotemporal distribution

which indicates the bubbling probability in different time and spatial location. Instead of allocating the budget in a fix distribution, the allocation strategy will change with the remaining budget and time-horizon. A linear payoff function like LinUCB is assumed as

By calculating the reward expectation of each arm, the reward estimation score is as follows:

Then, the best a _t with the highest reward estimation is chosen

c _j (t) is the number of times that the spatiotemporal context j occurred, and s _j, k (t) =s _j, k (t-1) +p _t, k is the number of total reward estimation score. When getting the reward estimation of x, the empirical reward of spatiotemporal context j can be set as

In some embodiments, the evaluation of the disclosed methods is different with supervised learning because online learning in a real application platform is costly and will make online system unstable. Thus, an offline environment may be used to help train the algorithm. In one embodiment, with historical bandit data, the simulator is first trained to simulate the online environment. With respect to evaluating the LinUCB-ALP scheme and the baseline comparison, the experimental setup is as follows.

Data collection. In one example, a history dataset including user bubble data within a geographical location for a certain period may be collected. The user data may be collected from online ride-hailing histories from a ride-hailing platform. Each piece of user data may comprise a bubble time, a bubble spatial-temporal location, and bubble user information. Also, each piece of data has a bandit feedback, which includes the action and the send feature (i.e., send feature is used as the feedback in the disclosed system) . Each dataset may be chronologically ordered by bubble time. The first 3/4 of the datasets may be used for training the offline simulator, and the rest 1/4 are provided to perform three experiments: (1) pre-training the bandit algorithm (some baseline may not to be pre-trained) , (2) learning the bandit algorithm, and (3) simulating the online test setting.

Environment Setting. The problem of activity recommendation under a budget limitation may be implemented in a real-world application of ride-hailing platform, and the data may be collected from a real online environment. For a ride-recommendation activity, considering the user’s personality, contextual MAB may be used to model the recommendation process defined as M= (S, A, r) , and it is a special case of Markov decision process (MDP) , the elements of which are defined as follows. Agent: the ride-hailing platform is set as the agent for recommendation problem. The agent is to be trained to know how to make decisions for different users according to their personalized features. State: the state of the agent is the set of bubble user features on ride-hailing platform, and stands for the personalized information of the bubble user and the information of the environment (e.g., weather conditions) . Action: an action is that the platform recommends activity to each passenger under a constraint limitation, subject to the condition that the cost of actions in each turn cannot exceed the budget limit. Reward: the optimal target of the agent is to maximize the cumulative reward from the start to round t. The agent has a reward expectation function and makes a decision according to the expected reward. Then, the environment will return to the agent a real reward for executing the best action. Simulator: to replace the online environment in the early time, an offline simulator may be used. The simulator may be trained by supervised learning with a large amount of history data. For context x and policy a, the simulator gives the reward r=s _a (x, a) , and r can be treated as the environment feedback.

In some embodiments, with the history logging data, supervised learning is used to train the simulator S (x, a) = {s ₁ (x) , s ₂ (x) ....... s _k (x) } , A= {1, 2, ..... k} . k models are learned that each kind of action maintains a model (e.g., the kth action maintains the simulator model s _k (x) ) . To avoid the bias of simulator, the history data to train the simulators needs to be large and balanced for each action. For each model in the simulator, xgboost may be used as the classification machine. History reward may be inputted as binary label to each classification machine. Almost 3/4 history data may be used to train simulators, and the data is re-sampled according to different labels (rewards) to achieve balance learning.

In some embodiments, after training the simulator S, the Matthews correlation coefficient (MCC) is checked from each simulator model with other actions’ dataset (e.g., after learning s ₁, the dataset in which history except action 1 is used as the test set X _A- {1} , and the true label set in history is set as R _A- {1} as of s ₁, then the prediction reward

is obtained, and R _A- {1} and

to are used to calculate the MCC) . The MCC may be used to find whether there is covariate shift between training set and test set

In the disclosed setting, there may be four types of recommendations, and the MCC matrix is shown in Table I.

TABLE I

MCC FOR EACH SIMUL ATOR MODEL

As shown in Table I, there is a high correlation between the training data and testing data for simulator. So the simulator can be used for evaluating the disclosed bandit algorithms.

In some embodiments, except the history policy, five extensive experiments may be conducted to evaluate the effectiveness of the disclosed method in ride-hailing recommendation.

UCB-Spent greedy: for this algorithm, a fix budget B is set in LinUCB. The agent cannot perceive the budget information until the budget has run out. It means that if the B＞0, the agent is not subject to any restrictions.

UCB-ALP: for this algorithm, three dimensions (time, latitude, longitude) are selected, and GMM is used to cluster the three dimensions as 100 spatiotemporal gaussian distributions. And the context of this experiment setting is set as finite spatiotemporal context set with 100 members. The user personalized features are dropped in this experiment setting because cluster for users is not realistic as it has been considered before.

UCB-Normal distribution: for this algorithm, the budget is allocated into different spatial-temporals as a fixed budget allocation strategy. The spatial-temporals is learned by GMM, and empirical estimation is used to get the distribution of different spatial-temporals. Algorithm 1 in FIG. 2 shows the allocation of budget B to different spatial-temporal distributions. Spatial-temporals may include historical ride hailing data.

UCB-Even distribution: for this algorithm, the time is cut by 7 peaks, and the geographic locations are mapped into a grid world. One grid is an area of the hexagon, and the radius is five kilometers. Similar to the UCB-Normal distribution, history data is used to get the 7×4147 distributions by using empirical estimation. And the budget is allocated with spatiotemporal distribution.

Pretrain (warm) LinUCB-ALP: to make the learning process more stable and robust, the parameters to initialize the LinUCB-ALP learned by LinUCB which can warm up the LinUCB-ALP learning process. The datasets for pre-training and for training LinUCB-ALP may be different.

In some embodiments with respect to performance metrics, in a real platform online environment, the online learning algorithm (e.g., bandit algorithm) cannot learn continuously, because it will make the online environment unstable. Thus, the strategy for online learning in an online environment is to fix the parameters for several days and then collect the data of these days to update the algorithm. In addition to learning on simulator, the online platform setting may be simulated by fixing the model parameters and predicting. To that end, the dataset may be divided into two parts: (1) the first part is used to learn the six algorithms on the simulators. (2) the second part is used to predict the six algorithms according to the fixed parameters learned by (1) . Two matrices (1) average reward (AR) and (2) budget using ratio (BUR) may be used.

In some embodiments, Table II and Table III summarize the results of all the compared methods with respect to two evaluation metrics. From results, the

greedy LinUCB algorithm with budget limit on both learning dataset and deployment dataset performs well in exploration. The budget does not run out quickly. The fix budget UCB-Even UCB-Normal have the close cumulative rewards, UCB algorithm with Normal distribution is slightly better than the UCB algorithm with even distribution. The UCB algorithm with Normal distribution has a more uniform budget allocation. But the allocation is also nonuniform. For UCB-ALP, the performance is lower than all the algorithms. The last two algorithms both performed well in cumulative reward and budget using ratio. And they both have a more uniform allocation. Thus, pre-training can improve the performance.

FIG. 3 illustrates a flowchart of an exemplary method 300, according to various embodiments of the present disclosure. The method 300 may be implemented in various environments including, for example, the environment 100 of FIG. 1. The method 300 may be performed by computing system 102 of FIG. 1 and computer system 500 of FIG. 5. The operations of the method 300 presented below are intended to be illustrative. Depending on the implementation, the method 300 may include additional, fewer, or alternative steps performed in various orders or in parallel. The method 300 may be implemented in various computing systems or devices including one or more processors.

At block 302, historical ride hailing data (e.g., a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension) may be obtained. At block 304, a model may be trained with the historical ride hailing data to obtain a trained model. For example, the model may include a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space. At block 306, a request for a ride-hailing service may be received from a user device associated with a user accessing an online platform. At block 308, a time, a location, and a promotion budget associated with the request may be obtained. At block 310, the obtained time, location, and promotion budget may be input to the trained model to determine a price discount for the request. At block 312, the determined price discount may be transmitted to the user device to notify the user.

FIG. 4 is a block diagram that illustrates a computing device 400 upon which various of the embodiments described herein may be implemented. The computing device 400 may correspond to user devices 140 and the mobile devices of the drivers of the vehicles 150 of FIG. 1 described above. The computer device 400 includes communication platform 410 or other communication mechanism for communicating information, display 420, and graphics processing unit (GPU) 430 and central processing unit (CPU) 440 for processing information. Display 420 may provide a user interface functionality, such as a graphical user interface ( “GUI” ) . CPU 440 may be, for example, one or more general purpose microprocessors.

The computer device 400 also includes an input/output (IO) 450 and a memory 460. Memory 460 may be a random access memory (RAM) , cache and/or other dynamic storage devices, for storing information, including operating system (OS) 470 and applications 480. A storage 490, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive) , etc., is provided for storing information and instructions. Information may be read into memory 460 from another storage medium, such as storage device 490. Execution of the sequences of instructions contained in memory 460 causes CPU 440 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

FIG. 5 is a block diagram that illustrates a computer system 500 upon which any of the embodiments described herein may be implemented. The system 500 may correspond to the system 102 of FIG. 1 described above. The computer system 500 includes a bus 502 or other communication mechanism for communicating information, one or more hardware processors 504 coupled with bus 502 for processing information. Hardware processor (s) 504 may be, for example, one or more general purpose microprocessors. The processor (s) 504 may correspond to the processor 104 described above.

The computer system 500 also includes a main memory 506, such as a random access memory (RAM) , cache and/or other dynamic storage devices, coupled to bus 502 for storing information and instructions to be executed by processor (s) 504. Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor (s) 504. Such instructions, when stored in storage media accessible to processor (s) 504, render computer system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions. The computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor (s) 504. A storage device 510, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive) , etc., is provided and coupled to bus 502 for storing information and instructions.

The computer system 500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor (s) 504 executing one or more sequences of one or more instructions contained in main memory 506. Such instructions may be read into main memory 506 from another storage medium, such as storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor (s) 504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

The main memory 506, the ROM 508, and/or the storage 510 may include non-transitory storage media. The term “non-transitory media, ” and similar terms, as used herein refers to media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 510. Volatile media includes dynamic memory, such as main memory 506. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.

The computer system 500 also includes a network/communication interface 518 coupled to bus 502. Communication interface 518 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. The communication interface 518 may be implemented as one or more network ports. For example, communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN) . Wireless links may also be implemented. In any such implementation, communication interface 518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

The computer system 500 can send messages and receive data, including program code, through the network (s) , network link and communication interface 518. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 518.

The received code may be executed by processor (s) 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.

Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code modules executed by one or more computer systems or computer processors comprising computer hardware. The processes and algorithms may be implemented partially or wholly in application-specific circuitry.

The various features and processes described above may be used independently of one another, or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure. In addition, certain method or process blocks may be omitted in some implementations. The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate. For example, described blocks or states may be performed in an order other than that specifically disclosed, or multiple blocks or states may be combined in a single block or state. The exemplary blocks or states may be performed in serial, in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed exemplary embodiments. The exemplary systems and components described herein may be configured differently than described. For example, elements may be added to, removed from, or rearranged compared to the disclosed exemplary embodiments.

Moreover, while the system and method in the present disclosure is described primarily in regard to ride-hailing recommendation, it should also be understood that the present disclosure is not intended to be limiting. The system or method of the present disclosure may be applied to any other kind of services. For example, the system or method of the present disclosure may be applied to transportation systems of different environments including land, ocean, aerospace, or the like, or any combination thereof. The vehicle of the transportation systems may include a taxi, a private car, a carpool, a bus, a train, a bullet train, a high speed rail, a subway, a vessel, an aircraft, a spaceship, a hot-air balloon, a driverless vehicle, or the like, or any combination thereof. The transportation system may also include any transportation system for management and/or distribution, for example, a system for sending and/or receiving an express. The application of the system or method of the present disclosure may be implemented on a user device and include a webpage, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.

The various operations of exemplary methods described herein may be performed, at least partially, by an algorithm. The algorithm may be comprised in program codes or instructions stored in a memory (e.g., a non-transitory computer-readable storage medium described above) . Such algorithm may comprise a machine learning algorithm. In some embodiments, a machine learning algorithm may not explicitly program computers to perform a function, but can learn from training data to make a predictions model that performs the function.

The various operations of exemplary methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented engines that operate to perform one or more operations or functions described herein.

Similarly, the methods described herein may be at least partially processor-implemented, with a particular processor or processors being an example of hardware. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented engines. Moreover, the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS) . For example, at least some of the operations may be performed by a group of computers (as examples of machines including processors) , with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an Application Program Interface (API) ) .

The performance of certain of the operations may be distributed among the processors, not only residing within a single machine, but deployed across a number of machines. In some exemplary embodiments, the processors or processor-implemented engines may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm) . In other exemplary embodiments, the processors or processor-implemented engines may be distributed across a number of geographic locations.

Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.

Although an overview of the subject matter has been described with reference to specific exemplary embodiments, various modifications and changes may be made to these embodiments without departing from the broader scope of embodiments of the present disclosure. Such embodiments of the subject matter may be referred to herein, individually or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single disclosure or concept if more than one is, in fact, disclosed.

The embodiments illustrated herein are described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.

Any process descriptions, elements, or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process. Alternate implementations are included within the scope of the embodiments described herein in which elements or functions may be deleted, executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those skilled in the art.

As used herein, the term “or” may be construed in either an inclusive or exclusive sense. Moreover, plural instances may be provided for resources, operations, or structures described herein as a single instance. Additionally, boundaries between various resources, operations, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present disclosure. In general, structures and functionality presented as separate resources in the exemplary configurations may be implemented as a combined structure or resource. Similarly, structures and functionality presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within a scope of embodiments of the present disclosure as represented by the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Conditional language, such as, among others, “can, ” “could, ” “might, ” or “may, ” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The term "passenger, " "requester, " "service requester, " "customer" and "user" in the present disclosure are used interchangeably to refer to an individual, an entity, or a tool that may request or order a service. Also, the term "driver, " "provider, " and "service provider" in the present disclosure are used interchangeably to refer to an individual, an entity, or a tool that may provide a service or facilitate the providing of the service.

The term "service request, " "request for a service, " "requests, " and "order" in the present disclosure are used interchangeably to refer to a request that may be initiated by a passenger, a service requester, a customer, a driver, a provider, a service provider, or the like, or any combination thereof. The service request may be accepted by any one of a passenger, a service requester, a customer, a driver, a provider, or a service provider.

The terms “user device, ” "service provider terminal, ” “provider terminal, ” and " driver terminal" in the present disclosure are used interchangeably to refer to a computing device (e.g., mobile terminal) that is used by a service provider to provide a service or facilitate the providing of the service. The term "service requester terminal, " “requester terminal, ” and "passenger terminal" in the present disclosure are used interchangeably to refer to a mobile terminal that is used by a service requester to request or order a service. The term “distance” between two locations may refer to a linear distance between the two locations and/or or a route distance along a route between the two location.

Claims

A recommendation method, comprising:

obtaining historical ride hailing data;

training a model with the historical ride hailing data to obtain a trained model;

receiving, from a user device associated with a user accessing an online platform, a request for a ride-hailing service;

obtaining a time, a location, and a promotion budget associated with the request;

inputting the obtained time, location, and promotion budget to the trained model to determine a price discount for the request; and

transmitting the determined price discount to the user device to notify the user.
The method of claim 1, wherein:

the promotion budget comprises a remaining budget amount with respect to the time within a period and with respect to the location.
The method of any of claims 1-2, wherein:

the model comprises a reinforcement learning algorithm based on an action of discount allocation and a policy of maximizing a total number of orders completed through the online platform within a period; and

the discount allocation is subject to a fixed budget ceiling for the period.
The method of any of claims 1-3, wherein the trained model comprises a LinUCB-Adaptive-Linear-Programming (LinUCB-ALP) algorithm.
The method of any of claims 1-4, wherein training the model with the historical ride hailing data to obtain the trained model comprises:

pre-training the model with a portion of the historical ride hailing data to obtain a pre-trained model; and

training the pre-trained model with another portion of the historical ride hailing data to obtain the trained model.
The method of claim 5, wherein the pre-trained model comprises a Lin-upper-confidence-bound (LinUCB) algorithm.
The method of any of claims 1-6, wherein:

the model comprises a multi-armed bandits adaptive linear programming algorithm with an infinite contextual space; and

training the model with the historical ride hailing data to obtain the trained model comprises learning the infinite contextual space through a Gaussian Mixture Model.
The method of any of claims 1-7, wherein the historical ride hailing data comprises one or more of the following dimensions:

a background user information dimension, a real-time user information dimension, a weather dimension, a spatial dimension, a temporal dimension, and a discount distribution dimension.
The method of claim 8, wherein:

obtaining the time, the location, and the promotion budget associated with the request comprises: obtaining the time, the location, the promotion budget, background user information, real-time user information, and a weather associated with the request; and

inputting the obtained time, location, and promotion budget to the trained model to determine the price discount comprises: inputting the obtained time, location, promotion budget, background user information, real-time user information, and weather to the trained model to determine the price discount.
The method of claim 8, wherein the background user information dimension comprises one or more of the following attributes:

gender, registration date, registration location, application login history, and ride-hailing order history.
The method of claim 8, wherein the real-time user information dimension comprises one or more of the following attributes:

number of recently completed orders, distance travelled for a recent order, time of a recent order, and price paid for a recent order.
The method of claim 8, wherein the weather dimension comprises one or more of the following attributes:

humidity, precipitation, wind, UV metric, air pollution metric, and weather condition.
The method of claim 8, wherein:

the historical ride hailing data corresponds to a geographical area mapped into a plurality of grids; and

for each piece of the historical ride hailing data associated with an order in one of the plurality of grids, the spatial dimension comprises one or more of the following attributes:

grid index of the one grid, number of vehicles in the one grid, number of ride- hailing orders accepted in the one grid, number of ride-hailing orders completed in the one grid, and waiting time for ride-hailing orders requested in the one grid.
The method of claim 8, wherein for each piece of the historical ride hailing data requesting a historical order, the temporal dimension comprises one or more of the following attributes:

month, day-of-the-week, time-in-the-day, and peak-or-off-peak-hour.
The method of claim 8, wherein for each piece of the historical ride hailing data completing a historical order, the discount distribution dimension comprises one or more of the following attributes:

whether discount was offered, discount type offered, offered price discount, and whether discount was used.
The method of any of claims 1-15, wherein the price discount comprises: no discount or a nonzero discount.
The method of any of claims 1-16, wherein the trained model defines a plurality of arms, and wherein training the model comprises:

evaluating, based on the historical ride hailing data, a spatiotemporal distribution;

defining a linear payoff function;

determining a reward expectation of each arm of the plurality of arms;

selecting, among the plurality of arms, an arm with the highest reward expectation;

training, based on the spatiotemporal distribution, the linear payoff function, and the selected arm, the trained model.
A recommendation system, comprising:

one or more processors; and

one or more computer-readable memories coupled to the one or more processors and having instructions stored thereon that are executable by the one or more processors to perform the method of any of claims 1 to 17.
A recommendation apparatus comprising a plurality of modules for performing the method of any of claims 1 to 17.
A non-transitory computer-readable storage medium configured with instructions executable by one or more processors to cause the one or more processors to perform the method of any of claims 1 to 17.