CN116227714B

CN116227714B - Travel mode selection prediction and analysis method and system

Info

Publication number: CN116227714B
Application number: CN202310240074.1A
Authority: CN
Inventors: 唐立; 唐传丽
Original assignee: Xihua University
Current assignee: Xihua University
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-10-27
Anticipated expiration: 2043-03-14
Also published as: CN116227714A

Abstract

The invention discloses a travel mode selection prediction and analysis method and a travel mode selection prediction and analysis system. Comprising the following steps: data preprocessing, data reprocessing and knowledge-data driven automatic fusion module. The data preprocessing module performs conventional data processing and data set division on the data; the data reprocessing module is used for screening the main input variable and the auxiliary input variable; the knowledge-data driven automatic fusion module divides observable utility of utility functions in random utility theory into two parts, namely knowledge driving part and data driving part, wherein the knowledge driving part is used for reserving the interpretability of the discrete selection model; the data driving section is used to improve the prediction performance. Training a model and predicting by calling a Python-based deep learning framework tool, and carrying out visual analysis on the training process and the prediction result by calling a TensorBoard platform. The invention has the advantages that: the risk of false estimation is reduced, the network degradation problem of the neural network caused by network deepening is relieved, and the working efficiency of a user is improved.

Description

Travel mode selection prediction and analysis method and system

Technical Field

The invention relates to the technical field of artificial intelligent traffic behavior analysis, in particular to a travel mode selection prediction and analysis method and system based on a knowledge-data driven automatic fusion model.

Background

Travel mode selection is a classical problem in traffic planning and traffic demand management. By researching the action decision mechanism behind individual trip selection, the planning department or the management department is helped to understand the trip requirement of the traveler, so that better trip service is provided for the traveler. The traditional model commonly used for travel mode selection prediction in the industry and academia is a metering economy discrete selection model based on random utility theory, which has good interpretability, but has great limitation in processing complex variables and obtaining higher prediction accuracy. Furthermore, the observable portion of the discrete selection model utility function is often assumed to be known a priori, most of which need to rely on modeler a priori knowledge, however erroneous a priori assumptions can lead to erroneous model estimates. In recent years, with the continuous development of artificial intelligence technology, machine learning-based travel mode selection modeling research has received high attention from researchers in related fields around the world. The scholars in the field of economics consider that although the travel mode selection modeling based on the neural network improves the prediction capability, the unexplained property of the black box model greatly discounts the credibility of the model; while the travel pattern selection modeling based on a general interpretable machine learning pattern, while interpretable, models are easy to overfit and may be poorly robust due to the occurrence of some extreme or outliers.

The problem of how to improve the prediction capability of a travel mode selection model and simultaneously make the model have interpretability is one of research hotspots in the field of two-year travel mode selection. Recently, scholars have proposed that the traditional discrete selection model, namely, multiple logits, are converted into an artificial neural network (knowledge-driven part) and combined with a deep learning method (data-driven part) for travel method selection prediction has great potential, but the current research is still in a preliminary exploration stage and has two obvious limitations: first, the knowledge driven portion of the prior study still relies on the modeler's prior knowledge, except that a data driven portion is added to the model to improve the overall predictive performance of the model, but there may still be a greater risk of prior hypothesis errors; secondly, the existing research model solving and analyzing method mostly depends on codes which are automatically designed by researchers according to the models proposed by the researchers, generally, the functions of the researchers are more, the code quantity is larger, beginners are difficult to quickly understand the ideas of the researchers, the beginners are limited by different solving tools, the reproduction work is also greatly hindered, and the sharing of research results and the communication among the researchers are not facilitated.

Tools that can be used to solve the traditional discrete selection model are Nlogit, stata, python, and the like. The NLogit is a software special for processing discrete selection models, and has the main advantages of supporting a wide range of discrete selection model types including Logit, probit, nested Logit, mixed Logit and the like, being capable of carrying out relevant parameter estimation and model selection, and having higher precision and stability. Stata is a specialized piece of statistical analysis software that incorporates a number of statistical models and methods, including discrete selection models. However, the use of both nlogic and Stata requires payment, and the support of nlogic for other data analysis and modeling tasks beyond discrete selection is relatively weak, and Stata may have certain limitations in some complex model estimation and data processing tasks. Python provides a large number of data analysis and modeling libraries, such as NumPy, pandas, matplotlib, scikit-learn et al, that can implement a complete data analysis flow from data cleansing to model evaluation. Python has simple and flexible grammar, has huge community and resource support, and is free to open. With the development of deep learning, there are many frameworks available to study deep learning to help deep learning researchers improve work efficiency. The most popular deep learning frameworks in the world are Tensorflow, pytorch, paddlePaddle, caffe, etc. Wherein, tensorflow and Pytoch frameworks are both used based on Python, and also provide Python API interfaces. Python has the advantages of easy learning and use, ecological richness, community support and the like, so that Python becomes one of the mainstream programming languages in the current deep learning field. TensorBoard is a visual tool which can be installed and called by both Tensorflow and Pytosch frameworks, and can be used for intuitively displaying various information in the training process, including the architecture of a model, loss and accuracy in the training process, the distribution condition of weights and gradients and the like.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a travel mode selection prediction and analysis method and system. And a template is provided to be established and solved by using a template based on a Python deep learning framework Tensorflow and Pytorch with high universality, and a template is provided to be used for visual analysis of the template result. The risk of model false estimation caused by prior assumption set by human errors in travel mode selection prediction is reduced; meanwhile, a solution and analysis tool based on Python open source is used for reducing thresholds for related problem learning and communication; the method for applying deep learning to form data analysis is explored by constructing a knowledge-data driven automatic fusion module, and the network degradation problem of the neural network caused by network deepening is relieved to a certain extent by using an adjustable residual network.

In order to achieve the above object, the present invention adopts the following technical scheme:

a travel mode selection prediction and analysis method is realized on the basis of a travel mode selection prediction and analysis system;

the selection prediction and analysis system comprises: the system comprises a data preprocessing module, a data reprocessing module, a knowledge-data driven automatic fusion module and a TensorBoard server;

the data preprocessing module is used for carrying out data cleaning, data encoding, feature extraction, data standardization and sample sampling on travel data;

the data reprocessing module is formed by fusing two parallel machine learning models of XGBoost and random forest and is used for conducting data reprocessing on travel data to obtain a main input variable and an auxiliary input variable which are used for being respectively input into a knowledge driving sub-module and a data driving sub-module of the knowledge-data driving automatic fusion module;

the knowledge-data driven automatic fusion module comprises: the system comprises a knowledge driving sub-module, a data driving sub-module, a fusion sub-module and an output layer;

the knowledge driving sub-module is mainly a single-layer convolutional neural network and is converted from a plurality of logits based on a random utility theory, and the knowledge driving sub-module is used for establishing the utility of a knowledge driving part of an observable part of a utility function, so that the interpretability of a prediction model is reserved;

the data driving sub-module body is an adjustable residual error network and is used for establishing the data driving part utility of the observable part of the utility function, improving the prediction performance of the prediction model and simultaneously preventing the network degradation problem caused by network deepening.

The fusion sub-module is used for summing the knowledge driving part and the data driving part of the observable part of the utility function.

The output layer is used for converting the fusion result of the fusion sub-module into probability and outputting the probability.

The travel mode selection prediction and analysis method comprises the following steps:

step 1, acquiring characteristic data and socioeconomic attribute data of resident trips, defining an alternative set, and preliminarily carrying out data preprocessing on the alternative set by utilizing a data preprocessing module; obtaining travel data which can be used for solving a discrete selection model, and dividing the travel data into a training set and a testing set according to a certain proportion;

step 2, carrying out data reprocessing on the travel data subjected to data preprocessing by utilizing a data reprocessing module to respectively obtain a main input variable and an auxiliary input variable;

step 3, the travel data obtained through the data reprocessing module is input into a knowledge-data driving automatic fusion module to be calculated and a result is output;

step 4, setting a loss function as a classification Cross Entropy (CE), and carrying out model training and prediction by calling a Tensorflow or Pytorch self-contained optimizer;

and 5, starting a TensorBoard server, and performing visual analysis on the operation process and the model result of the knowledge-data driven automatic fusion module.

Further, the preprocessing of the travel data in the step 1 includes:

data cleaning: unreasonable data of missing values and abnormal values is deleted.

And (3) data coding: text-type data is converted into numeric-type data.

Feature extraction: feature variables that can be input to the machine learning model and the discrete selection model are extracted from the raw data.

Data normalization: and (5) carrying out standardization processing on the data of different dimensions.

Sampling: if the data set is unbalanced, an over-sampling or under-sampling method is employed to balance the data set.

Further, in the step 2, a weight δ is given to the model result of XGBoost in the data reprocessing module, a weight (1- δ) is given to the model result of random forest, and the output result includes the prediction result, the feature importance and the visualization result of the fusion model. The model parameters of the data reprocessing module adopt default parameters, and the corresponding model can be appropriately adjusted independently according to the visual result of the data reprocessing module. The weight delta is an ultra-parameter which can be manually adjusted, delta can be adjusted according to the output result of the knowledge-data driving automatic fusion module, and the flexibility of the data reprocessing module in processing data is improved by setting delta. The main input variable and the auxiliary input variable are automatically selected by the program according to the actual condition of the data through the processing of the data reprocessing module, so that the risk of priori error caused by human setting errors is reduced.

Further, in step 3, the knowledge-data driven automatic fusion module inputs the main input variable and the auxiliary input variable obtained in step 2 to the knowledge driving sub-module by introducing a multi-input wide-deep neural network architecture, and the auxiliary input variable data is input to the data driving sub-module. And then adding a knowledge driving part and a data driving part of the observable part of the utility function in the fusion sub-module, and converting the utility into a predicted selection probability result output by using a Softmax activation function under the assumption that the unobservable part of the utility function obeys the class I extremum distribution.

Further, the Softmax activation function is defined as follows:

wherein, (sigma (V) _n )) _i Representing the probability of the individual n selecting the mode i, V _in Representing the utility of individual n selection pattern i, C _n Is a set of all alternatives, i, j e C _n And i+.j.

Further, in step 4, the definition of the cross entropy of the classification is as follows:

wherein H is _n (σ，y _n ) For classifying cross entropy, for measuring similarity between predictive probability and true probability, y _in True probability of selecting pattern i for individual n, C _n Is a set of all alternatives, σ _i (V _n ) Representing individual n-choicesProbability of mode i.

Compared with the prior art, the invention has the advantages that:

the risk of model false estimation caused by prior assumption set by human errors in travel mode selection prediction is reduced; meanwhile, a solution and analysis tool based on Python open source is used for reducing the threshold of related problem learning and communication, thereby being beneficial to beginners to learn and reproduce the research results of other people and improving the working efficiency of users; the method for applying deep learning to form data analysis is explored by constructing a knowledge-data driven automatic fusion module, and the network degradation problem of the neural network caused by network deepening is relieved to a certain extent by using an adjustable residual network.

Drawings

FIG. 1 is a schematic diagram of a multiple input wide-deep neural network architecture according to an embodiment of the present invention;

FIG. 2 is an overall framework of a travel mode selection prediction and analysis system according to an embodiment of the present invention;

FIG. 3 is a graph showing the importance of features output by a data reprocessing module according to an embodiment of the present invention;

FIG. 4 is a block diagram of a knowledge-data driven auto-fusion module in accordance with an embodiment of the invention;

FIG. 5 is a TensorBoard visualization interface in accordance with an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.

A travel mode selection prediction and analysis method comprises the following steps:

step 1, acquiring characteristic data and socioeconomic property data of resident travel, defining an alternative set, preliminarily preprocessing the data by utilizing a data preprocessing module to obtain travel data which can be used for solving a discrete selection model, and dividing a training set and a testing set according to a certain proportion (generally 7:3 or 8:2 and flexibly setting according to actual data conditions).

The data which can be used for solving the discrete selection model mainly refers to the following preprocessing of travel data:

data cleaning: unreasonable data such as missing values, outliers, etc. are deleted.

And (3) data coding: text-type data is converted into numeric-type data.

Data normalization: and (5) carrying out standardized processing on data (such as travel time and travel distance) with different dimensions.

Sampling: if the data set is unbalanced, it may be considered to balance the data set using over-sampling or under-sampling methods, etc., to avoid the model being overly focused on certain categories.

And 2, carrying out data reprocessing on the travel data subjected to data preprocessing by utilizing a data reprocessing module to respectively obtain a main input variable and an auxiliary input variable.

The data reprocessing module is formed by fusing two parallel machine learning models of XGBoost and random forest, the model result of XGBoost is given weight delta, the model result of random forest is given weight (1-delta), and the output result comprises the prediction result, the feature importance and the visualization result of the fused model. The model parameters of the data reprocessing module adopt default parameters, and the corresponding model can be appropriately adjusted independently according to the visual result of the data reprocessing module. The weight delta is a manually adjustable super parameter, delta can be adjusted according to the output result of the automatic fusion module driven by the later knowledge-data, and the flexibility of the data reprocessing module in processing data is improved to a certain extent by setting delta.

The main input variable refers to a characteristic variable with the characteristic importance ranking top output by the data reprocessing module; the auxiliary input variables refer to other characteristic variables output from the data reprocessing module.

The main input variable and the auxiliary input variable can be automatically selected by the program according to the actual condition of the data through the processing of the data reprocessing module, so that the risk of priori error caused by human setting errors is reduced.

And 3, inputting travel data obtained through the data reprocessing module into a knowledge-data driven automatic fusion module, wherein the module introduces a multi-input wide-deep neural network architecture (the multi-input wide-deep neural network architecture is shown in figure 1). The main input variable data is input to the knowledge driving sub-module, and the auxiliary input variable data is input to the data driving sub-module. And then adding a knowledge driving part and a data driving part of the observable part of the utility function in the fusion sub-module, and converting the utility into a predicted selection probability result output by using a Softmax activation function under the assumption that the unobservable part of the utility function obeys the class I extremum distribution.

The knowledge driving sub-module is mainly a single-layer convolutional neural network and is converted from a plurality of logits based on a random utility theory; the data-driven sub-module body is a residual network that is adjustable and can be used to prevent network degradation problems due to network deepening. The adjustable residual network refers to the size of residual blocks of the network, and the number and the connection mode of the residual blocks can be adjusted at will.

Random utility theory assumes that an individual is a rational decision maker whose decision goal is to maximize the individual's utility by making a selection decision. Let U _in The utility that individual n considers travel pattern i to have is shown. U (U) _in Consists of two parts: one is the observable part V of the utility _in Secondly, the unobservable part epsilon of the effect _in (error term) for characterizing uncertainty caused by the modeler not fully considering all influencing factors under a certain selection scenario.

U _in ＝V _in +ε _in (1)

The learner has demonstrated that the observable portion thereof can be decomposed into the form of equation (2):

wherein f _i (χ _n The method comprises the steps of carrying out a first treatment on the surface of the Beta) is a knowledge driven part, assuming interpretable. The function f is defined such that its unknown model parameter β is the interpretation variable (or input feature) χ _n Is described herein.Is a data driving part, and is provided with a data driving part, is +.>The a priori relationship is not assumed.

The new expression given in equation (2) replaces the observable utility in equation (1), resulting in the following utility expression:

in the case where multiple logits assume that the error terms are independent and equally distributed (i.i.d.), and that the error terms follow a class I extremum distribution, the probability that individual n selects travel pattern I is given by:

the preference parameter β is typically estimated by maximizing a log-likelihood function given by:

and 4, setting a loss function as a classification Cross Entropy (CE), and carrying out model training and prediction by calling a Tensorflow or Pytorch self-contained optimizer.

Wherein the Softmax activation function is defined as in equation (6). For the probability that individual n selects each travel pattern, equation (4) may be converted to equation (6).

The definition of the cross-class entropy is as in equation (7):

when all individuals n are summed, the minimization equation (7) equals the maximization equation (5).

The following experiment is carried out by selecting a prediction and analysis method according to the travel mode, and the experimental steps are as follows:

step 1, assuming that a certain declarative survey is adopted to obtain 20000 records of resident trip data in total, and the description of original record samples is shown in table 1.

Table 1 original record sample description

The clear alternative is (car, rail transit, bus). Preliminary data preprocessing is carried out on the data by utilizing a data preprocessing module, and the data preprocessing module is used for carrying out data preprocessing according to 8:2 divide the training set and the test set.

The data preprocessing mainly comprises the following steps:

data cleaning: unreasonable data such as missing values, outliers, etc. are deleted. Deleting data records with age, gender, income deficiency or extreme abnormality, and selecting travel modes as other records for deletion.

And (3) data coding: converting the trip purpose, the traffic mode and the gender into numerical data and coding, wherein if the trip purpose is to work/school code 100000, the trip purpose is to go home (process business) code 000001; the travel mode is that the car code is 100, and the travel mode is that the bus code is 001; sex male code 1 and sex female code 0.

Extracting original data characteristics: the extraction characteristics are as follows: travel time TT_CAR, TT_RAIL, TT_BUS of each travel mode; travel fees TC_CAR, TC_RAIL and TC_BUS of each travel mode; travel distance DIST; SEX; AGE; revenue INC. A total of 7 travel characteristics, 3 individual economic characteristics. The trip destination MODE is a tag.

And 2, carrying out data reprocessing on the travel data subjected to data preprocessing by utilizing a data reprocessing module to respectively obtain a main input variable and an auxiliary input variable. The strategy adopted when setting delta is as follows: firstly, setting delta to be a number very close to 0 (for example, setting delta=0.00001), then setting delta to be a number very close to 1 (for example, setting delta=0.99999), judging the processing condition of the module on data according to the visual output of the data reprocessing module, judging the value of the super parameter delta, and assuming that the visual output indicates that the processing effect of the module is better when the delta=0.5.

The feature importance output obtained is assumed to be shown in fig. 3. Thus, TT_CAR, TT_RAIL, TT_BUS, TC_CAR, TC_RAIL, TC_BUS, DIST are selected as the main input variables; SEX, AGE and INC are auxiliary input variables.

And 3, inputting travel data obtained through the data reprocessing module into a knowledge-data driven automatic fusion module, wherein a frame diagram of the knowledge-data driven automatic fusion module is shown in fig. 4.

And 4, setting a loss function as a classification Cross Entropy (CE), carrying out model training and prediction by calling a Tensorflow self-contained optimizer Adam, setting an evaluation index as Accurcy, and adopting a strategy of terminating training in advance for preventing overfitting during training.

And 5, starting a TensorBoard server, and performing visual analysis on the operation process and the model result of the knowledge-data driven automatic fusion module. Example TensorBoard operation interface is shown in FIG. 5.

The whole process can carry out training solution and prediction on travel mode selection behaviors by calling a Python-based deep learning framework, and can carry out visual analysis on the training process and prediction results of the model by starting a TensorBoard server. The threshold of related problem learning and communication is reduced by using the open-source universal solving and analyzing tool, so that beginners can learn and reproduce the research results of other people, and the working efficiency of the method user is improved.

In yet another embodiment of the present invention, a selection prediction and analysis system is provided, which can be used to implement a selection prediction and analysis method as described above, and specifically includes: the system comprises a data preprocessing module, a data reprocessing module, a knowledge-data driven automatic fusion module and a TensorBoard server;

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. The method is not only suitable for the traffic behavior analysis field, but also suitable for the application field related to the selection analysis in the form data. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. A travel mode selection prediction and analysis method is realized on the basis of a travel mode selection prediction and analysis system;

the data driving sub-module main body is an adjustable residual error network and is used for establishing the utility of a data driving part of an observable part of a utility function, improving the prediction performance of a prediction model and simultaneously preventing the problem of network degradation caused by network deepening;

the fusion sub-module is used for adding the knowledge driving part and the data driving part of the observable part of the utility function;

the output layer is used for converting the fusion result of the fusion sub-module into probability and outputting the probability;

2. The travel mode selection prediction and analysis method according to claim 1, wherein: the preprocessing of the travel data in the step 1 comprises the following steps:

data cleaning: deleting unreasonable data of the missing value and the abnormal value;

and (3) data coding: converting the text type data into numerical type data;

feature extraction: extracting feature variables which can be input into a machine learning model and a discrete selection model from the original data;

data normalization: carrying out standardization processing on data of different dimensions;

3. The travel mode selection prediction and analysis method according to claim 1, wherein: in the step 2, a weight delta is given to the model result of XGBoost in the data reprocessing module, a weight (1-delta) is given to the model result of the random forest, and the output result comprises a prediction result, a feature importance and a visualization result of the fusion model; model parameters of the data reprocessing module adopt default parameters, and corresponding models can be adjusted appropriately according to the visual results of the data reprocessing module alone; the weight delta is an ultra-parameter which can be manually adjusted, delta can be adjusted according to the output result of the knowledge-data driving automatic fusion module, and the flexibility of the data reprocessing module in processing data is improved by setting delta; the main input variable and the auxiliary input variable are automatically selected by the program according to the actual condition of the data through the processing of the data reprocessing module, so that the risk of priori error caused by human setting errors is reduced.

4. The travel mode selection prediction and analysis method according to claim 1, wherein: in step 3, the knowledge-data driven automatic fusion module inputs the main input variable and the auxiliary input variable obtained in step 2 to the knowledge driving sub-module by introducing a multi-input wide-deep neural network architecture, and the auxiliary input variable data is input to the data driving sub-module; and then adding a knowledge driving part and a data driving part of the observable part of the utility function in the fusion sub-module, and converting the utility into a predicted selection probability result output by using a Softmax activation function under the assumption that the unobservable part of the utility function obeys the class I extremum distribution.

5. The travel mode selection prediction and analysis method according to claim 4, wherein: the definition of the Softmax activation function is as follows:

6. The travel mode selection prediction and analysis method according to claim 1, wherein: in step 4, the definition of the class cross entropy is as follows:

wherein H is _n (σ，y _n ) For classifying cross entropy, for measuring similarity between predictive probability and true probability, y _in True probability of selecting pattern i for individual n, C _n Is a set of all alternatives, σ _i (V _n ) The probability of the individual n selecting the mode i is represented.