CN114048978A

CN114048978A - Supply and demand scheduling strategy fusion application based on machine learning model

Info

Publication number: CN114048978A
Application number: CN202111266699.2A
Authority: CN
Inventors: 薛鹏; 于红建; 余进
Original assignee: Beijing Shansong Technology Co ltd
Current assignee: Beijing Shansong Technology Co ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-02-15

Abstract

The invention relates to supply and demand scheduling strategy fusion application based on a machine learning model. The method predicts the order receiving probability of the order between the transport capacity and the transport capacity by fusing a logistic regression algorithm model, and calculates the best combination of the order and the transport capacity by three steps of transport capacity recall, filtering and sequencing in the process of business flow, thereby achieving the optimal platform order dividing efficiency. The method has the advantages that basic preprocessing processes such as normalization, null value processing and the like are carried out on data according to a standard flow in the order receiving rate prediction process, then the order receiving rate between the current order and the transport capacity is predicted in real time through a trained model, the order receiving efficiency and the order receiving rate of the platform are greatly optimized through the mode, and the optimization of the order distribution matching of the platform in the space-time dimension is guaranteed.

Description

Supply and demand scheduling strategy fusion application based on machine learning model

Technical Field

The invention relates to supply and demand scheduling strategy fusion application based on a machine learning model, and belongs to the technical field of order-separating optimization and intelligent scheduling research.

Background

The instant distribution is a rapid distribution service with the distribution time length of less than 1 hour and the average distribution time length of about 30 minutes. The rapid distribution timeliness integrates the traditional online e-commerce transaction and offline logistics distribution (two businesses with definite traditional division) into a unified whole, and a ternary relationship of interaction among a user, a rider and a platform is formed. In the development of the layer-by-layer evolution of the instant logistics distributed system architecture, the technical obstacles and challenges are encountered: large scale of orders and riders and ultra-large scale calculation in the supply and demand matching process. In holidays or severe weather, the orders aggregate, and the flow peak is dozens of times of the usual. The logistics performance is the central scheduling under the on-line connecting line, which is embodied on the order dispatching system, namely, one or a batch of optimal efficiency solutions are calculated according to a series of factors to directly dispatch the orders. The challenge for the distribution system is also the balance between the requirements for identification accuracy and the cost. The requirement on accuracy is high, after all, the identification directly affects pricing, scheduling and liability judgment systems, and the problem caused by low accuracy of the underlying data is great.

The key to efficiently matching one of these is on-demand allocation, identifying the exact needs of the user, and matching to the best fit among the many resources. In order to make efficient matching, the platform accumulates from the daily order a lot of information from the drivers and users, including their journey routes, behaviour habits, special needs, etc., in addition to knowledge of the traffic conditions throughout the city, making it possible to predict the demand ahead of time and then to ensure that the supply quantity matches the demand quantity to be reached, so that the idle resources can be activated in an optimal way.

What the scheduling platform really needs to solve is how to improve matching efficiency. The platform probably relies on subsidy and ground to push away to rush to market earlier stage, and later stage has arrived, and the promotion of matching efficiency is the most important, only matches suitable trip resource, just can let customer's demand obtain furthest's satisfaction. Similarly, in the intelligent scheduling of the ant golden service customer service, how to obtain the most accurate matching of the user requirements and ensure the availability of corresponding resources solves the problems, and the user expectation can be realized to the maximum extent.

And (3) updating the geographic information in real time (a request is initiated within 5 seconds), describing the condition of the whole resource, and pushing the order according to the resource condition at the first time after the user sends the order requirement. Based on the statistics of historical data and the combination of real-time order data, the distribution of order dense areas in the current whole city range is given, valuable listening unit set reference is provided for the rider, the probability of listening the order is improved, and the idle running time of the rider is reduced. And (4) based on the supply and demand prediction result, all available transport capacity of the whole city is orderly mobilized on a large scale, and optimal allocation of resources is realized. The order taking probability model is learned in historical data of the rider and the user, the matching degree of the rider and the user is improved, and the overall transportation efficiency and the passenger trip experience are globally optimized in real time by utilizing the scale effect of the transport capacity. The fault tolerance is extremely low, the system cannot be down, the system cannot be lost, and the availability requirement is extremely high. The data has high requirements on real-time performance and accuracy and is very sensitive to delay and abnormity.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: how to realize the supply and demand scheduling strategy fusion application based on the machine learning model.

In order to achieve the above object, the present invention provides a method for fusing and applying a supply and demand scheduling policy based on a machine learning model, comprising the following steps:

a: determining the statistical aperture of the service characteristics, and collecting related data results;

b: determining a characteristic selection scheme and screening characteristics with excellent effect;

c: establishing a characteristic engineering flow, and converting data into data which can be understood and digested by an algorithm;

d: comparing the algorithm effects of the off-line data;

e: and (4) evaluating the on-line gray effect of the algorithm, and selecting an optimal expression algorithm to carry out formal on-line.

Preferably, the step of generating the supply and demand scheduling policy fusion application scheme of the corresponding opportunistic machine learning model specifically includes:

1. the features are divided into five major types of features, namely user, capacity, order, city and weather, segmentation can be carried out continuously based on the property and attribute of each type of feature, finally the number of related features is determined to be more than 100, the existing features are combed by combining service properties, and the feature statistical period and the statistical standard are determined.

2. And preparing related data based on the characteristics determined in the previous step and the corresponding statistical aperture, randomly sampling the sample data, and screening 100000 pieces of data.

3. And (3) carrying out availability evaluation on the features by utilizing various methods (such as Pearson coefficients, chi-square test, decision tree algorithm and the like), further screening the features with high correlation with the target result, and removing redundant features with high similarity.

4. In order to reduce the influence of data loss on model accuracy, methods such as mode filling, mean filling, median filling, KNN clustering filling, fixed value filling, context filling and direct elimination need to be adopted to fill the missing value. And determining a corresponding filling method according to the service scene and the algorithm requirement.

5. Method for using experience, abnormal value of box chart and

and (5) removing abnormal values by a principle method.

6. Discrete data processing is carried out by using label coding, one hot coding and embedding methods for reference.

7. And (5) processing continuous data in barrels.

8. Data normalization/normalization processing.

9. The sample is randomly divided into three levels of a training set, a verification set and a test set according to the ratio of 6: 2.

10. And marking the sample data by combining the service data, wherein the marking data is based on whether the order is picked up by the transport capacity after the order is dispatched.

11. A list of selected algorithms is determined, which mainly comprises (ridge regression, Lasso, LR, FM, svm, bayes classifier, Adaboost, lightgbm, etc.).

12. And determining related indexes for measuring the advantages and the disadvantages of the algorithms, ranking the algorithms based on the indexes, and screening the algorithm with the top five ranked algorithms for online testing.

13. And observing the on-line algorithm effect for 2 weeks, wherein the observation index is determined as the on-line integral order rate, and the standard of selecting the model is determined according to the 2-week average order rate.

Drawings

FIG. 1: the supply and demand scheduling strategy of the invention is integrated with the application principle flow diagram.

FIG. 2: the supply and demand scheduling strategy of the invention is integrated with an application example order distribution schematic diagram.

Detailed Description

The following detailed description of the present invention will be made with reference to the accompanying drawings and examples, which are provided for illustration of the present invention and are not intended to limit the scope of the present invention.

As shown in fig. 1, a: determining the statistical aperture of the service characteristics, and collecting related data results; b: determining a characteristic selection scheme and screening characteristics with excellent effect; c: establishing a characteristic engineering flow, and converting data into data which can be understood and digested by an algorithm; d: comparing the algorithm effects of the off-line data; e: and (4) evaluating the on-line gray effect of the algorithm, and selecting an optimal expression algorithm to carry out formal on-line. The supply and demand scheduling policy fusion application is explained in detail based on an LR model.

Firstly, the steps of determining service characteristics and statistical caliber are introduced, all relevant characteristics which can influence the order taking rate are listed according to service experience and brainstorming, meanwhile, the statistical period and the statistical formula of all the characteristics are determined, the label source of all the characteristics is whether the orders in the historical orders are taken by relevant riders, all the characteristic data and the label data are calculated in hive, and 100000 pieces of data are randomly sampled to serve as input data.

And secondly, introducing a characteristic selection scheme and screening a standard of the characteristic with excellent effect. The current popular feature selection method comprises a Pearson coefficient, a Chi-square test and a decision tree algorithm, the use scenes, the implementation cost and the final effect of each method are different, then under the comprehensive consideration, the Pearson coefficient is finally determined to be used as a method for screening features, in order to eliminate multiple collinearity among the features, the features with high similarity are deleted, the Pearson coefficient threshold value with high feature similarity is judged to be 0.5, the features larger than 0.5 can be considered to be high in correlation, and the features with the highest correlation coefficient with a target label are selected to be used as training data features from a feature group with high correlation.

Thirdly, a processing flow of feature engineering is introduced. In thatThe most basic in the characteristic engineering is to process abnormal values and null values, the processing mode of the abnormal values is generally strong, data hitting abnormal standards are directly deleted, and methods for measuring whether the data are abnormal include a box diagram abnormal value method, a method for measuring the abnormal values of the data and a method for measuring the abnormal values of the data, and a method for measuring the abnormal values of the data,

And (4) an abnormal value method. Bias to use for data fitting to normal distribution

And (4) adopting an abnormal value method, otherwise adopting a box type map abnormal value method. For the null value processing mode, the influence degree of the abnormal value on the training result and the proportion of data missing need to be judged, and under the condition that the influence degree is high and the proportion of data missing is high, the characteristic needs to be deleted. If not, the data needs to be filled in by adopting a relevant method. The corresponding filling method comprises mode filling, mean filling, median filling, KNN cluster filling, interpolation, fixed value filling and context filling.

In order to ensure that the input data can be understood by the model and the accuracy and efficiency of model calculation are ensured, discrete data need to be processed, and reference methods are label coding, onehot coding and embedding. Meanwhile, continuous data needs to be subjected to barrel processing, and reference methods include equal-width barrel division, equal-frequency barrel division and woe encoding.

Because the LR model has higher sensitivity to different dimensions, normalization processing needs to be carried out on each feature data in order to eliminate the influence of different dimensions of features on the efficiency and the accuracy of the model, and normalization processing methods comprise normalization processing and min-max normalization processing.

The data were divided into a training set, a validation set and a test set in a 6: 2 ratio.

And inputting the training data into the model for training, wherein the evaluation indexes comprise accuracy, AUC, recall ratio and accuracy.

In conclusion, the supply and demand scheduling strategy fusion application based on the machine learning model aims to improve the matching degree between orders and riders, improve the self-order-dividing efficiency of the platform and guarantee the high-quality experience of users and the two sides of the riders.

Claims

1. A supply and demand scheduling strategy fusion application based on a machine learning model is characterized by comprising the following steps:

d: comparing the algorithm effects of the off-line data;

2. The machine learning model-based supply and demand scheduling policy fusion application of claim 1, wherein: the step A specifically comprises the following steps:

a1: dividing the characteristics into five major characteristics of users, transport capacity, orders, cities and weather according to the past business experience, continuously subdividing based on the properties and attributes of each characteristic, finally determining that the number of related characteristics exceeds 100, combing the existing characteristics by combining the business properties and determining the characteristic statistical period and the statistical standard.

A2: and preparing related data based on the characteristics determined in A1 and the corresponding statistical aperture, marking the sample data by combining the service data, wherein the marking data is based on whether the order is received by the transport capacity after the order is dispatched. And simultaneously, randomly sampling sample data and screening 10000 pieces of data.

3. The machine learning model-based supply and demand scheduling policy fusion application of claim 1, wherein: the step B specifically comprises the following steps: in order to avoid dimension disasters, the computational complexity of machine learning needs to be reduced on the premise of ensuring the training result, and the feature screening is particularly important. During feature screening, a plurality of methods (such as Pearson coefficients, Chi-Square tests, decision tree algorithms and the like) can be used for evaluating the usability of the features, so that the features with high correlation with a target result are screened, and redundant features with high similarity are eliminated.

4. The machine learning model-based supply and demand scheduling policy fusion application of claim 1, wherein: the step C specifically comprises the following steps:

c1: in order to reduce the influence of data loss on model accuracy, methods such as mode filling, mean filling, median filling, KNN clustering filling, fixed value filling, context filling and direct elimination need to be adopted to fill the missing value. And determining a corresponding filling method according to the service scene and the algorithm requirement.

C2: method for using experience, abnormal value of box chart and

and (5) removing abnormal values by a principle method.

C3: discrete data processing is carried out by using label coding, one hot coding and embedding methods for reference.

C4: and (5) processing continuous data in barrels.

C5: data normalization/normalization processing.

C6: the sample is randomly divided into three levels of a training set, a verification set and a test set according to the ratio of 6: 2.

5. The machine learning model-based supply and demand scheduling policy fusion application of claim 1, wherein: the step D specifically comprises the following steps:

d1: determining a list of selected algorithms, which mainly comprises (Ridge regression, Lasso, LR, FM, svm, Bayesian classifier, Adaboost, lightgbm, etc.)

D2: and determining related indexes for measuring the advantages and the disadvantages of the algorithms, ranking the algorithms based on the indexes, and screening the algorithm with the top five ranked algorithms for online testing.

6. The machine learning model-based supply and demand scheduling policy fusion application of claim 1, wherein: the step E specifically comprises the following steps: the observation period of the online algorithm is determined as 2 weeks, the observation index is determined as the overall on-line order receiving rate, and the standard of selecting the model is determined according to the average on-line order receiving rate of 2 weeks.