Shared bicycle parking point site selection method based on Hadoop
Technical Field
The invention relates to the technical field of computer information processing, in particular to a shared bicycle parking point address selection method based on Hadoop.
Background
In recent years, along with social development and improvement of living standard of people, the trip consciousness of people is changed, and low-carbon trip becomes the subject of people trip. The shared bicycle, a fusion product of modern science and technology and public bicycles, comes out along with the public bicycle and rapidly occupies the core position of the market. The sharing bicycle overcomes the inherent defects that a public bicycle is borrowed and returned at a fixed point, a deposit is returned inconveniently, and the like, more accords with the travel route of people, is practically convenient for people to travel, and due to the characteristic of parking everywhere, a large number of users select the sharing bicycle to travel.
However, with the rapid development of the shared bicycle, a great deal of problems are also generated. Due to the lack of reasonable layout and scientific planning of shared bicycles, the life of people is seriously affected by the problems of disordered parking and random placement of the generated vehicles, serious damage, partial road congestion caused by the fact that the vehicles cannot be cleaned in time and the like. How to reasonably plan the parking points of the shared bicycle becomes particularly important, and if the parking points are not reasonably selected, the problems like those of the traditional public bicycles are generated, so that a large number of users give up riding. Under the large background of building smart cities and big data times, how to reasonably arrange and plan parking points of shared bicycles is very significant.
Patent CN201710764773.0 "a method and an apparatus for determining a shared parking spot" provides a method and an apparatus for determining a shared parking spot, the method includes: acquiring walking track data in a preset area; based on the position coordinates of the track points contained in the walking track, clustering the walking track by adopting a preset clustering algorithm; and determining shared single-vehicle parking points according to the distribution condition of the real street paths corresponding to the walking tracks contained in each category. In the process of determining the shared bicycle parking points, the method reasonably determines which users on the real street paths have riding requirements by means of walking track data, and sets the shared bicycle parking points on the street paths, so that the method has instructive significance for determining the shared bicycle parking points, thereby serving more users with riding requirements and enabling the shared bicycle resource allocation to be more balanced. Patent 201710517669.1 "a method and apparatus for determining a shared parking spot" provides a method and apparatus for determining a shared parking spot, the method comprising: determining a sub-area with the function from a preset area according to the functional area information with the parking requirement corresponding to the current time period; classifying the determined sub-regions by adopting a preset classification algorithm; and determining the central position of each classified category as a shared single-vehicle parking point. In the method, in the process of determining the parking point of the shared bicycle, the time factor and the sub-regions with parking requirements are associated, the plurality of sub-regions with parking requirements corresponding to the current time period are classified, and the central position of each class is used as the parking point of the shared bicycle, so that shared bicycle management personnel can be effectively guided to schedule the shared bicycle, the requirement of users in the preset region on the shared bicycle in the current time period is met, the situation that supply and demand of the shared bicycle are insufficient in a certain region is reduced as much as possible, and the user experience is improved. However, the methods and the systems only realize the prediction of the parking points of the shared bicycle, do not fully utilize the historical travel data of the shared bicycle, cannot judge the rationality of the parking points, do not provide the parking points to be connected with the programmable area of a government department, and have poor accuracy and operability of the parking points.
In summary, the key of the management of the shared bicycle lies in the reasonable distribution of the user requirements and the number of the bicycles, so that the maximum value of the shared bicycle can be fully exerted, and the problems generated by the shared bicycle can be solved. The problems of unreasonable supply and demand relation, unreasonable resource distribution, disordered management and the like caused by low accuracy of parking point site selection of the current shared bicycle exist.
Disclosure of Invention
The invention aims to provide a shared bicycle parking point addressing method based on Hadoop, aiming at the problem that shared bicycle management is disordered due to unreasonable addressing of the shared bicycle parking point. The invention is realized by the following technical scheme:
the shared single-vehicle parking point site selection method based on Hadoop mainly comprises three parts: the method comprises the steps of shared bicycle demand point prediction based on a distributed clustering algorithm, a shared bicycle parking point site selection model based on multi-objective optimization, and a model solving algorithm based on an NSGA-II algorithm.
(1) Shared single vehicle demand point prediction based on a distributed clustering algorithm: the accuracy of demand point prediction plays a critical role in the site selection of shared bicycle parking points, the traditional demand prediction is mainly carried out based on experience and small-range data statistics, and the demand prediction is not accurate enough, so that the planning of rental points is not reasonable enough. The sharing bicycle demand point forecasting method based on the GPS has the advantages that a large number of real user travel data are generated by aiming at a sharing bicycle with the GPS, and a more reasonable demand point forecasting model is provided to forecast the sharing bicycle demand point.
(2) Sharing single parking point addressing model based on multi-objective optimization: after the steps of the demand points are carried out, address selection distribution is carried out between the demand points and the programmable points, and a parking point address selection model is established by taking the shortest total travel distance of a user and the smallest total cost of the shared bicycle parking points as targets.
(3) The model solving algorithm based on the NSGA-II algorithm comprises the following steps: the model is a classical dual-objective optimization problem, two objective functions in the model cannot be optimized simultaneously, so that the model has a plurality of feasible solutions, the method solves the model on the basis of a mature NSGA-II algorithm in a multi-objective evolutionary algorithm, optimizes the NSGA-II algorithm aiming at the model, and performs distributed improvement according to the problems of slow running time and the like.
The general structure of the method is shown in fig. 1, and the specific implementation steps are as follows:
step one, shared bicycle demand point prediction based on distributed clustering algorithm
Massive user data are generated by the launch of the shared bicycle so far, and the demand points of the shared bicycle are predicted according to the massive data. These travel data include time, bicycle number, bicycle type, GPS location information, etc. The truncated portion of the data is shown in figure 2 below. A clustering mode is adopted for clustering and analyzing the single vehicle data at a certain moment to form a plurality of demand areas in a certain range, a clustering center point in each demand area is used as a demand point, and the number of shared single vehicles in each demand area is used as the demand of the demand point. The demand point prediction model framework provided by the invention is shown in fig. 3, and the specific flow of the model is as follows:
1) for the actual situation of demand points, two thresholds for Canopy are set, i.e., T1 is the maximum distance between demand points and T2 is the maximum range of each demand point.
2) And executing a Canopy algorithm to obtain the number of the demand points and the positions of the demand points.
3) And screening the generated demand points, and deleting the isolated points with less demand to obtain a new data set.
4) And taking the number of the remaining demand points as a K value, taking the position of the demand point as an initial cluster center, and performing iterative operation by a K-means algorithm to finally obtain a clustering result.
Step two, sharing bicycle parking point site selection model based on multi-objective optimization
The problem of site selection and planning of shared bicycle parking spots is colloquially that site selection of optimized quantity distribution relation is carried out between the demand quantity of demand spots and each planned parking spot to be selected, and the single bicycle distribution quantity distributed to each planned parking spot by each shared bicycle demand spot is obtained.
The model takes the minimum total construction cost of the shared bicycle parking points and the minimum total travel distance of a user as optimization targets. The specific mathematical model is expressed as:
in the formula:
i: a set of demand points {1,2,3.. i };
j: represents a set of planned parking points {1,2,3.. j };
ni: representing the single vehicle demand of the demand point i;
dij: representing the distance from the demand point i to the candidate planned parking point j;
xij: representing the number of the single vehicles of the demand point i distributed to the candidate planning parking point j;
cj: representing the total number of the single vehicles distributed at the distributed parking point j;
m: representing a capital construction cost for each candidate planned parking point;
c: the number of the basic single vehicles planned at each candidate planned parking point is represented, and the construction and management cost Y is increased when the number of the basic single vehicles exceeds one;
yj: indicating whether the candidate planned parking point is established;
aj: representing the number of the candidate planned parking points exceeding the number of the basic single vehicles;
wherein the objective function (1) minimizes the total distance of a single vehicle at a demand point to a candidate planned stopping point; the objective function (2) minimizes the total cost required for the parking spot. Equation (3) indicates that the shared vehicles at the demand points are all assigned to the parking points. Equation (4) is used to calculate the number of single vehicles at the assigned parking spot. Equation (5) indicates that if the number of single cars at the post-allocation parking spot is 0, the planned parking spot is not established. Equation (6) represents the number of basic cars exceeding the planned stopping point.
Step three, model solving algorithm based on NSGA-II algorithm
The algorithmic solution steps for this model are as follows:
step 1: reading original data, a demand point set, a facility candidate point set, the single vehicle demand of each demand point, the distance from each demand point to each candidate planning parking point, the basic construction cost of each candidate planning parking point and the like;
step 2: and (3) encoding the population individuals by adopting a matrix encoding mode, initializing the population individuals within a variable value-acquirable range, and generating a population containing N individuals.
And step 3: and calculating two objective function values of each individual of the population, and performing rapid non-dominated sorting on the individuals according to the fitness value of the individual.
And 4, step 4: according to the congestion degree calculation method, the congestion degree value of the individual in the population is calculated.
And 5: according to the improved self-adaptive crossover operator and mutation operator, the crossover probability and mutation probability of each individual are solved, and then the population is selected, crossed and mutated to generate a new offspring population.
Step 6: and combining parent and offspring populations in an elite strategy to form a large population with the population number of 2N.
And 7: and performing rapid non-dominated sorting and congestion degree calculation on the population generated by merging to obtain better N individuals so as to form a new generation parent population.
And 8: and (5) repeating the step.
And step 9: and judging whether to carry out recombination crossing or not according to the self-adaptive adjustment of the feasible solution and the infeasible solution.
Step 10: and 6, repeating the step 6 to obtain a new generation of offspring population.
Step 11: and judging whether the evolution algebra of the program exceeds the maximum iteration algebra or meets a termination condition, if so, ending the program, otherwise, if not, turning to the step 7 to continue the execution.
The algorithm execution flow is shown in fig. 4. The optimal solution set of the model can be obtained through the algorithm, so that the selection planning can be carried out on the parking points.
Meanwhile, Hadoop is adopted to realize the algorithm, and after population initialization, the evolution process of each generation is completed by using MapReduce. The Map stage is used for completing calculation of individual fitness, the number of the node subgroup is used as a key value, and the individual and the fitness thereof are used as value values, so that the processes of completing the operations are time-consuming, and parallel operations are adopted; the Reduce is responsible for reducing the value values corresponding to the same key value, and then the operations such as selection, intersection, mutation and the like are carried out on subgroups on each node, so that the relative independence of the subgroup evolution process can be kept. Because the populations of the nodes are not influenced by each other, the evolution operation of the populations is performed in a parallel mode through a plurality of Reduce nodes. The parallelization flow diagram is shown in fig. 5.
The method has the beneficial effects that the method for selecting the shared bicycle parking point site based on Hadoop is predicted to improve the rationality and accuracy of the shared bicycle parking point site selection, so that the management of the shared bicycle is more standard. Aiming at a large amount of generated shared bicycle trip data, a demand prediction model based on the trip data is established, and shared bicycle demand points are predicted through a Hadoop frame and a clustering algorithm. Meanwhile, aiming at the problem that the demand points can not be set as parking points, a multi-target parking point addressing model which aims at the shortest total travel distance and the smallest total construction cost is established, and the position of the parking point of the shared bicycle and the scale of the shared bicycle which can be accommodated by the position can be calculated through the model. And finally, solving the model by adopting the improved NSGA-II, and realizing the calculation process by adopting a Hadoop framework. The problem of shared bicycle parking point addressing is solved to a certain extent, so that the parking point addressing becomes more scientific and rational.
Drawings
FIG. 1 is a general structure diagram of a shared single-vehicle parking-point addressing method based on Hadoop according to the present invention;
FIG. 2 is a partial cycling data diagram of a shared bicycle in accordance with the present invention;
FIG. 3 is a demand point prediction model framework of the present invention;
FIG. 4 is a flow chart of the improved NSGA-II algorithm solution model of the present invention;
FIG. 5 is a flow chart of an algorithmic solution model implemented on a Hadoop framework according to the present invention;
FIG. 6 is a parallelization implementation flow of Canopy-Kmeans adopted by the model in step 1 of the present invention;
FIG. 7 is a flow of the K-means parallelization algorithm based on MapReduce in the invention.
Detailed Description
The present invention will be further described with reference to the following examples. The following examples are set forth merely to aid in the understanding of the invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
As an implementation mode, the shared single-vehicle parking point addressing method based on Hadoop specifically comprises the following steps:
step one, shared bicycle demand point prediction based on distributed clustering algorithm
The method adopts a parallelization method based on Hadoop to realize the Canopy-Kmeans algorithm to solve the model of the figure 3. Under the MapReduce framework, the solving method of the model can be split into a plurality of subtasks, the specific flow is shown in fig. 6, and each dotted square in the diagram contains an independent MapReduce task. The Canopy and K-means algorithms are matched for use, uncertainty caused by manual K value selection can be overcome, the problems of local optimization and algorithm instability caused by random selection of initial cluster centers, the influence of isolated points on clustering results and the like are solved, and the clustering performance of the K-means algorithms is greatly improved.
Firstly, collected GPS information data of a shared bicycle in a certain time period are subjected to file arrangement and stored in an HDFS file system, Hadoop executes parallel execution of a Canopy algorithm in the first stage, the GPS information data are output in a file form, map visualization is carried out on a clustering center point in the file, isolated points are filtered, the processed file is stored in the HDFS for clustering processing of a K-means algorithm in the next stage of Hadoop, and the processed file, namely the position, the quantity and the vehicle information of a required point, is output. As input data for the third stage.
Step two, sharing bicycle parking point site selection model based on multi-objective optimization
When the shared bicycle parking point model is established, certain assumed conditions are made on the model to improve the feasibility of the model. The analysis and research of the text show that the shared single-vehicle parking spot addressing problem has the following characteristics:
(1) the ground selected by the shared single parking point is a project in the urban construction development, the scale of the project is long, and in addition, the design cost of the electronic fence for the shared single parking point site selection is high, the ground planning and the daily operation need a large amount of cost, which means that the cost constraint in the shared single parking point site selection model accounts for an important part.
(2) The method comprises the steps that a to-be-constructed area of the shared single-vehicle parking spot is divided into a plurality of electronic fence areas according to land properties and geographic conditions, and the selection of a walking traffic mode by travelers is limited by distance factors, so that the single-vehicle parking amount of each electronic fence area has a certain upper limit, and the construction and management cost is increased when the number of the shared single-vehicle parking spots exceeds the basic parking amount.
(3) In addition to cost limitation, how to improve the convenience degree of traveling of travelers is also a key factor for determining the quality of the site selection of the shared bicycle parking spot, and it is assumed herein that a traveler must select a parking spot closest to the traveler when traveling by using the shared bicycle. In order to fully meet the requirements of different trip personnel, namely the shortest distance from a demand point to a parking point of a shared bicycle, the position of the parking point of the shared bicycle can be widely accepted by people, so that the problem that the shared bicycle is randomly parked and randomly placed is solved. To better optimize the model, the locations of all the shared vehicles within the demand area are considered herein as the locations of the demand points, and the centers of the electronic fences are considered as the locations of the shared vehicle parking points. In order to make the distances from all the single vehicles in each demand area to the shared single vehicle parking point as closest as possible, the total distance from the single vehicles in all the demand areas to the planned parking point is adopted as an optimization target by the model, and the distances from all the demand points to the parking point are reduced to the greatest extent.
Step three, model solving algorithm based on improved NSGA-II
1 coding mode
The NSGA-II algorithm adopts a real number coding and binary coding mode, and the one-dimensional real number coding and the binary coding cannot better reflect various combination conditions of population individuals in the model. The method aims at a shared bicycle parking point model and adopts a real matrix coding mode to code population individuals. The specific form is represented by the formula (3-1):
in the formula PkIs the kth individual in the population; xi,jThe number of the vehicles which are assigned to the jth parking point by the ith demand point is the ith row and jth column elements corresponding to the coding matrix; viIndicating the distribution condition of the ith demand point to the parking points; rjDenotes the jthThe parking points are from the distribution condition of each demand point;
the NSGA-II algorithm adopts a matrix coding mode, so that the distribution scheme of results in population individuals can be well reflected, the diversity of the population individuals can be kept in the crossing and variation operations, and the phenomena of local convergence and precocity caused by the early stage can be avoided.
2 crossover operator and mutation operator
Because the population individuals adopt a real matrix coding mode, the cross operator and the mutation operator of the NSGA-II algorithm are redesigned. The NSGA-II algorithm adopts fixed crossover operators and mutation operators, and the crossover probability P is causedcAnd PmThe method is a fixed value, cannot meet the dynamic requirements of the population change process on the parameters, and provides a new crossover operator and a mutation operator according to the problems.
1) The crossover operator:
the traditional crossover operator generally adopts a single-point crossover mode and a two-point crossover mode, so that gene communication among population individuals is insufficient, and a certain column in a matrix is crossed. Two populations of individuals that need to be crossed are as follows:
the individuals generated by the interleaving operation are C1,C2The expression is as follows:
wherein i is a randomly generated cross point, i is between 1 and N,
P1rank represents the individual P1Of non-dominant ordering hierarchy, P2Rank represents the individual P2The non-dominant ranking hierarchy of (c). By associating the participation of the crossover operator with the Pareto non-dominated sorting level of each individual in the population, the value of lambda is larger in the early operation period of the algorithm due to the fact that the proportion of the individuals with small Pareto non-dominated sorting values in the offspring is larger, but as the algorithm is continuously carried out, the individuals tend to the same Pareto front surface, and the value of lambda gradually tends to 0.5. By adopting the cross operator strategy, better genes in the father class can be inherited, and the diversity of population individuals is improved.
2) Mutation operator
For a traditional mutation operator, a node is selected from individuals to perform mutation operation. Because the coding mode of real matrix coding is adopted, the mutation operation of one node cannot be adopted. Set models and coding schemes, which employ a run-to-run mutation operation on a particular sequence, are described below
P=[R1 P,R2 P,...,RN P] (3-7)
The individual P needing mutation is generated through mutation:
Q=[R1 P,R2 P,...,Ri,...,RN P] (3-8)
Rithe original ith column of data is replaced by a column of data generated randomly.
The specific processes of the crossover operator and the mutation operator can be known through the above description, and the key of the performance of the genetic algorithm in the parameters of the genetic algorithm is the crossover probability PcAnd the mutation probability PmAnd (4) selecting. Cross probability PcThe larger, the faster the new individual may be producing, if PmToo large of a vector will result in a genetic modelThe possibility of formula (ii) being broken increases; pcToo small, making the search process slow. For different optimization problems, repeated experiments are required to determine PcAnd PmIt is difficult to find an optimum value suitable for each problem. Since the NSGA-II algorithm employs fixed crossover and mutation probabilities, for this purpose, m.srinvivas [44 ] is introduced herein]Et al propose an adaptive genetic algorithm.
When the strategy considers that the individual fitness is smaller than the population average fitness, the individual performance can be judged to be poor, and a larger cross rate and a larger variation rate are given to the individual fitness, so that the generation of individuals with a new mode is promoted; when the individual fitness is more than or equal to the average fitness, the individual can be judged to have more excellent pattern genes, and a smaller cross rate and a smaller variation rate are given to the individual, so that the better pattern genes in the population are not damaged. The corresponding model is given below, and formula (3-9) is the cross rate adjustment function, and formula (3-10) is the variation rate adjustment function.
Wherein, PcTo be crossed, the individual cross rate, PmThe rate of variation of the individual to be mutated, fmaxIs the maximum value in the population individual fitness favgThe average fitness of population individuals, f' is the maximum fitness of two individuals to be crossed, f is the fitness of the individual to be mutated, k1、k2Adjusting the parameter of the function, k, for the crossover rate3、k4The parameters of the function are adjusted for the crossover rate. In general, k is1=k2,k3=k4。
3 constrained optimization process improvements
In practical applications, many constrained multi-objective optimization problems may have their true optimal solutions often existing near the constraint boundaries, and these infeasible solutions located at the constraint boundaries may often have objective function values better than those of some feasible solutions in the feasible domain. These highly advantageous infeasible solutions are utilized to increase the search speed towards the feasible domain. Because the model has constraint conditions, the population can generate an infeasible solution in the evolution process, in order to fully consider the influence of infeasibility on the population, the feasible solution and the infeasible solution are considered at the same time, and a better feasible solution set and an infeasible solution set are selected for genetic operation every several generations of evolution.
The generation carries out recombination and crossing on the infeasible solution and the feasible solution, and judges the execution generation number through a self-adaptive strategy. Since the evolution process is evolving towards the feasible region and the optimal solution, the number of feasible solutions in the evolution process is increasing, and if too many genetic operations are performed on the feasible solutions and the infeasible solutions at the later stage of evolution, the search performance of the algorithm in the feasible region may be affected instead, so that the number of times of direct intersection of the infeasible solutions and the feasible solutions in the evolution process is gradually reduced. Aiming at the problem, in the process of performing genetic operation by separating feasible solution and infeasible solution, the algebra for adaptively adjusting the intersection of the feasible solution and the infeasible solution is set, that is, when the population evolution algebra is k, recombination intersection is performed on the feasible solution and the infeasible solution:
in the formula (3-11), T is the total evolution generation number of the population. It can be seen from the equation that as population evolution algebra increases, the operations on feasible solutions and infeasible solutions gradually decrease.
The NSGA-II algorithm is coded according to the above requirements, the initial population size is N-100, and the initial cross probability Pc0.8, mutation probability PmThe maximum iteration number of the algorithm is 0.1, and max is 100; hadoop is adopted to perform parallel execution as shown in the figure 5, and finally a pareto optimal solution set of a programmable parking point is output for a decision maker to select.