CN106779214B - A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model - Google Patents

A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model Download PDF

Info

Publication number
CN106779214B
CN106779214B CN201611159984.3A CN201611159984A CN106779214B CN 106779214 B CN106779214 B CN 106779214B CN 201611159984 A CN201611159984 A CN 201611159984A CN 106779214 B CN106779214 B CN 106779214B
Authority
CN
China
Prior art keywords
passenger
airline
passengers
travel
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611159984.3A
Other languages
Chinese (zh)
Other versions
CN106779214A (en
Inventor
刘杰
王嫄
冯丽娜
陈会朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN201611159984.3A priority Critical patent/CN106779214B/en
Publication of CN106779214A publication Critical patent/CN106779214A/en
Application granted granted Critical
Publication of CN106779214B publication Critical patent/CN106779214B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/40Business processes related to the transportation industry

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A multi-factor fusion civil aviation passenger trip prediction method based on a theme model. According to the invention, firstly, the association diagram among passengers is constructed, and subject modeling is carried out according to the preference of the passengers, so that a Passenger association diagram Travel subject Model (PGTTM) is constructed, the subject information can be enriched, and the problem of sparsity of civil aviation data can be effectively solved; secondly, a multi-factor fusion prediction framework is constructed through a Bayesian probability model, and the future travel of the passenger is accurately predicted by fusing the airline heat and PGTTM to obtain the passenger airline preference, passenger loyalty and airline market share information. The invention can effectively predict airlines and airlines of passengers going out in the future, can provide effective decision support for aviation and related industries, and provides personalized service for passengers.

Description

一种基于主题模型的多因素融合民航旅客出行预测方法A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model

技术领域technical field

本发明属于计算机应用技术领域,涉及数据挖掘和民航数据分析,特别是一种基于主题模型的多因素融合民航旅客出行预测方法。The invention belongs to the technical field of computer application and relates to data mining and civil aviation data analysis, in particular to a multi-factor fusion civil aviation passenger travel prediction method based on a theme model.

背景技术Background technique

人们生活水平的提高、互联网的发展,使民航旅客订票系统中积累了大量订票数据,具有海量性、稀疏性、长尾性特点,给民航数据分析带来挑战。基于这些数据分析旅客出行特点、预测未来出行行为,是民航数据分析中最重要的任务之一。国内外关于民航旅客分析研究处于初步阶段,也没有对民航旅客出行预测有较多研究。With the improvement of people's living standards and the development of the Internet, a large amount of booking data has been accumulated in the passenger booking system of civil aviation, which has the characteristics of mass, sparseness and long tail, which brings challenges to the analysis of civil aviation data. Analyzing passenger travel characteristics and predicting future travel behavior based on these data is one of the most important tasks in civil aviation data analysis. Domestic and foreign research on the analysis of civil aviation passengers is in the preliminary stage, and there is not much research on the travel forecast of civil aviation passengers.

民航数据相关的分析研究例如Maalouf等对真实的航空公司频繁旅客数据应用聚类分析和关联规则等,对客户关系管理提出推荐和改善策略[1]。而王朝恩等采用问卷调查并结合统计方法,对长春民航旅客群体进行消费动机、航空公司偏好以及购买行为分析[2]。Feng等人构建民航数据上的异质信息网络,采用随机游走方式进行低频次出行旅客价值发现任务[3]。而Etzioni等探究了时间与票价之间关联性,采用一种多策略数据挖掘算法,告知旅客购买机票的最佳时间[4]The analysis and research related to civil aviation data, such as Maalouf et al., applied cluster analysis and association rules to real airline frequent passenger data, and proposed recommendations and improvement strategies for customer relationship management [1] . Wang Wang et al. used a questionnaire survey combined with statistical methods to analyze the consumption motivation, airline preferences and purchasing behavior of Changchun civil aviation passenger groups [2] . Feng et al. constructed a heterogeneous information network on civil aviation data, and used a random walk method to find the value of low-frequency travel passengers [3] . Etzioni et al. explored the correlation between time and fare, and adopted a multi-strategy data mining algorithm to inform passengers of the best time to buy air tickets [4] .

主题模型中的LDA(Latent Dirichlet Allocation)模型有更好文本主题建模性能,具有良好扩展性[5]。如Rosen-Zvi等基于LDA提出ATM(Author-Topic Model),同时对作者、文档和词进行主题建模[6]。而Blei等针对文本分类问题提出有监督LDA模型,将训练语料中文档标记作为观测值加入LDA中[7]。拓展主题模型或LDA模型应用到推荐领域,如Liu等将旅行套餐数据中隐含特征显示加入主题模型中,提出一种个性化推荐旅游信息方法[8]。而Tan等将旅客信息表示成特征-值对形式,采用主题模型学习旅客潜在兴趣分布,并结合协同过滤进行旅行套餐推荐[9]The LDA (Latent Dirichlet Allocation) model in the topic model has better performance of text topic modeling and has good scalability [5] . For example, Rosen-Zvi et al. proposed ATM (Author-Topic Model) based on LDA, and performed topic modeling on authors, documents and words at the same time [6] . And Blei et al. proposed a supervised LDA model for the text classification problem, adding the document tags in the training corpus as observations into the LDA [7] . Extending the topic model or LDA model to the field of recommendation, for example, Liu et al. added the implicit feature display in the travel package data to the topic model, and proposed a personalized recommendation method of travel information [8] . Tan et al. expressed the passenger information in the form of feature-value pairs, used a topic model to learn the potential interest distribution of passengers, and combined with collaborative filtering to recommend travel packages [9] .

旅客间社会关系有助于建模,如王琨琨等通过构建共同出行网络,提出一种旅客个体偏好和关系偏好结合的民航旅客座位偏好建模方法[10]。而周元炜等提出一个基于信息图的半监督关系分类算法,获得更为准确的旅客关系,提供针对性、高质量服务[11]The social relationship between passengers is helpful for modeling. For example, Wang Kunkun et al. proposed a modeling method of passenger seat preference in civil aviation that combines individual passenger preference and relationship preference by constructing a common travel network [10] . Zhou Yuanwei et al. proposed a semi-supervised relationship classification algorithm based on information graphs to obtain more accurate passenger relationships and provide targeted and high-quality services [11] .

将主题模型应用到民航旅客出行分析和预测中,发现潜在主题分布、解决数据海量性问题,是值得尝试的,以及将旅客之间的关系融入到主题建模中,丰富主题信息、减轻稀疏性问题,借此来提高建模的效果。另外通过构建概率模型框架,融合多种出行影响因素,对提高预测效果同样拭目以待。It is worth trying to apply the topic model to the analysis and prediction of passenger travel in civil aviation to discover the distribution of potential topics and solve the problem of massive data, and integrate the relationship between passengers into topic modeling to enrich topic information and reduce sparsity problem, so as to improve the modeling effect. In addition, by building a probabilistic model framework and integrating a variety of travel influencing factors, we will also wait and see to improve the prediction effect.

参考文献:references:

[1]Maalouf L,Mansour N.Mining airline data for crm strategies.InProceeding of the 7th WSEAS International Conference on Simulation,Modelingand Optimization,Beijing,China,pages 345-350,2007.[1] Maalouf L, Mansour N. Mining airline data for crm strategies. In Proceeding of the 7th WSEAS International Conference on Simulation, Modeling and Optimization, Beijing, China, pages 345-350, 2007.

[2]王朝恩,长春民航旅客特征与行为分析[D].吉林大学,2010.[2] Wang Wangen, Analysis on the characteristics and behavior of Changchun civil aviation passengers [D]. Jilin University, 2010.

[3]Feng X,Xu B Y,Lu M,et al.Infrequent Passenger Value Discovery byRandom Walk on Passenger-route Heterogeneous Network.Journal of Computationaland Theoretical Nanoscience,2(1):10-17,2015.[3] Feng X, Xu B Y, Lu M, et al. Infrequent Passenger Value Discovery by Random Walk on Passenger-route Heterogeneous Network. Journal of Computational and Theoretical Nanoscience, 2(1):10-17, 2015.

[4]Etzioni,Oren,Tuchinda,et al.To buy or not to buy:mining airfaredata to minimize ticket purchase price[C]//ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,Washington,USA,August.2003:119-128.[4] Etzioni, Oren, Tuchinda, et al. To buy or not to buy: mining airfaredata to minimize ticket purchase price[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, August. 2003: 119 -128.

[5]Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Journalof Machine Learning Research,2003,3:993-1022.[5] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.

[6]Rosen-Zvi M,Griffiths T,Steyvers M,et al.The author-topic modelfor authors and documents[C]//Proceedings of the 20th conference onUncertainty in artificial intelligence.AUAI Press,2004:487-494.[6] Rosen-Zvi M, Griffiths T, Steyvers M, et al. The author-topic model for authors and documents [C]//Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004: 487-494.

[7]Blei D M,Mcauliffe J D.Supervised Topic Models[J].Advances inNeural Information Processing Systems,2010,3:327-332.[7] Blei D M, Mcauliffe J D.Supervised Topic Models[J].Advances inNeural Information Processing Systems,2010,3:327-332.

[8]Liu Q,Ge Y,Li Z,et al.Personalized Travel Package Recommendation[C]//IEEE,International Conference on Data Mining.IEEE Computer Society,2011:407-416.[8]Liu Q,Ge Y,Li Z,et al.Personalized Travel Package Recommendation[C]//IEEE,International Conference on Data Mining.IEEE Computer Society,2011:407-416.

[9]Tan C,Liu Q,Chen E,et al.Object-Oriented Travel PackageRecommendation[J].Acm Transactions on Intelligent Systems&Technology,2014,5(3):1-26.[9]Tan C,Liu Q,Chen E,et al.Object-Oriented Travel PackageRecommendation[J].Acm Transactions on Intelligent Systems&Technology,2014,5(3):1-26.

[10]王琨琨,民航旅客座位偏好建模与应用研究[D].北京交通大学,2015.[10] Wang Kunkun, Modeling and application of passenger seat preference in civil aviation [D]. Beijing Jiaotong University, 2015.

[11]周元炜,民航社会网络关系分类算法设计与实现[D].北京交通大学,2013.[11] Zhou Yuanwei, Design and Implementation of Civil Aviation Social Network Relationship Classification Algorithm [D]. Beijing Jiaotong University, 2013.

发明内容SUMMARY OF THE INVENTION

本发明目的是针对民航旅客订票数据的海量性、稀疏性、长尾性、影响出行因素多样性问题,为准确预测旅客将来搭乘的航空公司和航线,提供一种基于主题模型的多因素融合民航旅客出行预测方法。The purpose of the present invention is to provide a multi-factor fusion based on a theme model in order to accurately predict the airlines and routes that passengers will take in the future in order to solve the problems of mass, sparseness, long tail, and diversity of factors affecting the travel of passenger booking data in civil aviation. Air passenger travel forecasting method.

本发明采用主题模型对旅客与其选择的航空公司、航线进行主题建模,并通过引进构建的旅客关联图,提出旅客关联图出行主题模型PGTTM(Passenger Graph basedTravel Topic Model),能够得到旅客对航线、航空公司偏好信息,并丰富主题信息,解决民航稀疏性问题。The present invention adopts the theme model to carry out theme modeling for passengers and their selected airlines and routes, and introduces a passenger association graph constructed by introducing a passenger association graph travel theme model PGTTM (Passenger Graph based Travel Topic Model). Airline preference information and enrich topic information to solve the problem of civil aviation sparsity.

接着引进贝叶斯概率模型,融合航线热度、PGTTM得到的旅客对航线偏好、旅客忠诚度、航空公司市场占有率四部分因素,构造多因素融合预测框架,更准确的预测和推荐旅客将来搭乘的航空公司和航线。以上即是基于主题模型的多因素融合民航旅客出行预测方法的主要发明内容。Next, a Bayesian probability model is introduced, which integrates airline popularity, passenger preference for airline routes obtained by PGTTM, passenger loyalty, and airline market share, and constructs a multi-factor fusion forecasting framework to more accurately predict and recommend future passengers. Airlines and routes. The above is the main content of the invention of the multi-factor fusion civil aviation passenger travel prediction method based on the theme model.

本发明技术方案Technical scheme of the present invention

一种基于主题模型的多因素融合民航旅客出行预测方法,该方法包括:A multi-factor fusion civil aviation passenger travel prediction method based on topic model, the method includes:

步骤1):构建旅客关联图出行主题模型。主要包括构建旅客的关联图,并对旅客出行偏好进行主题建模,最终得到旅客关联图出行主题模型:Step 1): Build a travel topic model of passenger association graph. It mainly includes the construction of passenger association graph, and the topic modeling of passenger travel preference, and finally the travel topic model of passenger association graph is obtained:

步骤1.1)、构建旅客关联图;Step 1.1), build a passenger association graph;

构建旅客关联图,就是计算旅客之间的关联度,它由旅客航线共现度和属性共现度共同决定;航线共现度由旅客之间的航线共现数决定;属性共现度是指旅客的年龄、性别、平均折扣、平均里程是否相同;旅客年龄、平均折扣、平均里程信息由基于方差的切分方法得到;To construct a passenger association graph is to calculate the association degree between passengers, which is determined by the co-occurrence degree of passenger routes and the co-occurrence degree of attributes; the co-occurrence degree of routes is determined by the co-occurrence number of routes between passengers; the co-occurrence degree of attributes refers to Whether the age, gender, average discount, and average mileage of the passengers are the same; the information on the age, average discount, and average mileage of the passengers is obtained by the segmentation method based on variance;

步骤1.2)、对旅客出行偏好主题建模;Step 1.2), modeling the theme of passenger travel preference;

基于主题模型对旅客和其搭乘的航线、航空公司进行主题建模,发现并求得旅客、航线、航空公司的潜在主题分布,最终将旅客的潜在主题分布和航空公司、航线的潜在主题分布相结合,可以得到旅客对航空公司和航线的出行偏好信息;Based on the topic model, the topic modeling of passengers and their routes and airlines is carried out, and the potential topic distribution of passengers, routes and airlines is found and obtained, and finally the potential topic distribution of passengers is compared with the potential topic distribution of airlines and airlines. Combined, you can get the travel preference information of passengers on airlines and routes;

步骤1.3)、构建旅客关联图出行主题模型;Step 1.3), build a travel theme model of a passenger association graph;

在步骤1.2)主题建模过程中加入步骤1.1)中的旅客关联图,以构建旅客关联图出行主题模型(Passenger Graph based Travel Topic Model,PGTTM);PGTTM在为每个旅客的航线、航空公司分配主题时,使得主题不仅来自于旅客本身,还有可能来自于旅客关联的其他旅客,能丰富主题信息,提高预测性能,并减轻民航旅客出行稀疏性的问题;In the process of step 1.2) topic modeling, the passenger association graph in step 1.1) is added to construct the Passenger Graph based Travel Topic Model (PGTTM); PGTTM is assigned to each passenger's route and airline When the theme is used, the theme can come not only from the passenger itself, but also from other passengers associated with the passenger, which can enrich the theme information, improve the prediction performance, and alleviate the problem of the sparse travel of civil aviation passengers;

步骤2):构建航线热度、旅客忠诚度、航空公司市场占有率计算模型,利用这些先验知识,可以帮助后面准确预测:Step 2): Build a calculation model for airline popularity, passenger loyalty, and airline market share, and use these prior knowledge to help accurately predict the following:

步骤2.1)、计算航线的热度;Step 2.1), calculate the heat of the route;

对于航线热度,首先统计该航线被全部旅客搭乘的次数,以及每个航线被全部旅客搭乘的次数之和,在此基础上,计算得到航线热度;For airline popularity, first count the number of times the route is taken by all passengers, and the sum of the number of times each route is taken by all passengers. On this basis, the route popularity is calculated;

步骤2.2)、计算旅客对航空公司的忠诚度;Step 2.2), calculate the loyalty of the passenger to the airline;

对于旅客忠诚度,首先统计该旅客搭乘该航空公司的次数,以及该旅客搭乘每一个航空公司的次数之和,在此基础上,经过平滑处理,计算得到旅客对航空公司的忠诚度;For passenger loyalty, first count the number of times the passenger takes the airline and the sum of the number of times the passenger takes each airline. On this basis, after smoothing, the passenger's loyalty to the airline is calculated;

步骤2.3)、计算航空公司对航线的市场占有率;Step 2.3), calculate the market share of the airline on the route;

对于航空公司市场占有率,首先统计该航空公司、该航线作为一个词对被全部旅客搭乘的次数,以及在不考虑航空公司下该航线被全部旅客搭乘的次数,基于此,计算得到航空公司对航线的市场占有率;For the airline market share, first count the number of times the airline and this route are taken by all passengers as a word pair, and the number of times the airline is taken by all passengers without considering the airline. market share of the route;

步骤3):通过贝叶斯概率模型融合航线热度、旅客对航线偏好、旅客忠诚度以及航空公司市场占有率,构建多因素融合预测框架,对旅客将来选择的航线、航空公司进行预测:Step 3): Construct a multi-factor fusion forecasting framework by integrating airline popularity, passenger preference for airline, passenger loyalty, and airline market share through a Bayesian probability model to predict the airline and airline that passengers will choose in the future:

步骤3.1)、基于贝叶斯概率模型的多因素融合;Step 3.1), multi-factor fusion based on Bayesian probability model;

基于步骤1)中PGTTM得到的旅客对航线偏好,步骤2.1)中的航线热度,步骤2.2)中的旅客忠诚度,以及步骤2.3)中的航空公司市场占有率,构建贝叶斯概率模型,对这四部分因素进行融合,更好建模旅客的出行行为;Based on the passenger's preference for the route obtained by PGTTM in step 1), the route popularity in step 2.1), the passenger loyalty in step 2.2), and the airline's market share in step 2.3), a Bayesian probability model is constructed to calculate These four factors are integrated to better model the travel behavior of passengers;

步骤3.2)、基于贝叶斯概率模型的多因素预测;Step 3.2), multi-factor prediction based on Bayesian probability model;

针对每个旅客、每个航空公司-航线词对,利用贝叶斯概率模型函数,分别计算旅客的搭乘概率;对每个旅客而言,挑选出概率最大的几个航空公司-航线词对,进行预测和推荐。For each passenger and each airline-route word pair, use the Bayesian probability model function to calculate the passenger's boarding probability; for each passenger, select several airline-route word pairs with the highest probability, Make predictions and recommendations.

本发明的优点和积极效果:Advantages and positive effects of the present invention:

·提出旅客关联图出行主题模型PGTTM·Propose the travel theme model PGTTM of passenger association graph

本发明针对民航旅客出行行为进行主题建模,发现旅客及其搭乘的航空公司、航线的潜在主题分布,准确地预测旅客未来出行选择的航线等行为。在此基础上构建并引进旅客关联图,得到PGTTM,能够借助相似旅客丰富主题信息,提高预测准确度,解决民航旅客出行数据稀疏性问题。The invention conducts theme modeling for the travel behavior of civil aviation passengers, discovers the potential theme distribution of passengers and the airlines and routes they take, and accurately predicts the routes and other behaviors that passengers choose to travel in the future. On this basis, a passenger correlation graph is constructed and introduced, and the PGTTM is obtained, which can enrich the subject information with the help of similar passengers, improve the prediction accuracy, and solve the problem of the sparseness of civil aviation passenger travel data.

·借助贝叶斯概率模型函数提出多因素融合预测框架·Propose a multi-factor fusion prediction framework with the help of the Bayesian probability model function

本发明通过一个贝叶斯概率模型函数得到一个多因素融合预测框架,融合PGTTM得到的旅客对航线的偏好,以及航线热度、旅客忠诚度和航空公司市场占有率这些先验知识,相较于基准方法,该预测框架可以更准确地预测旅客将来出行选择的航空公司和航线。The present invention obtains a multi-factor fusion prediction framework through a Bayesian probability model function, integrates the preference of passengers on routes obtained by PGTTM, as well as the prior knowledge of route popularity, passenger loyalty and airline market share, compared with the benchmark method, the forecasting framework can more accurately predict the airlines and routes that passengers choose to travel in the future.

附图说明Description of drawings

图1是本发明的整体模型系统图。FIG. 1 is an overall model system diagram of the present invention.

图2是本发明的算法流程图。Fig. 2 is the algorithm flow chart of the present invention.

具体实施方式Detailed ways

实施例1:Example 1:

下面结合附图和具体实施对本发明提供的基于主题模型的多因素融合民航旅客出行预测方法进行详细说明。The method for predicting the travel of civil aviation passengers based on multi-factor fusion based on the theme model provided by the present invention will be described in detail below with reference to the accompanying drawings and specific implementations.

本发明主要采用数据挖掘理论和方法对民航数据中旅客出行行为进行分析,为了保证系统的正常运行,在具体实施中,要求所使用的计算机平台配备不低于8G的内存,CPU核心数不低于4个且主频不低2.6GHz、Windows7及以上版本的64位操作系统,并安装Oracle数据库、Java1.7及以上版本、Matlab2011b及以上版本等必备软件环境。The present invention mainly adopts the theory and method of data mining to analyze the travel behavior of passengers in the civil aviation data. In order to ensure the normal operation of the system, in the specific implementation, it is required that the computer platform used is equipped with a memory of not less than 8G and the number of CPU cores is not low. On 4 64-bit operating systems with a main frequency of not lower than 2.6GHz, Windows7 and above, and install Oracle database, Java1.7 and above, Matlab2011b and above and other necessary software environments.

本发明提供的基于主题模型的多重因素融合的旅客出行行为预测方法如下,并结合附图1和附图2进行说明。The method for predicting the travel behavior of passengers based on the theme model fusion of multiple factors provided by the present invention is as follows, and is described in conjunction with FIG. 1 and FIG. 2 .

步骤1):构建旅客关联图出行主题模型PGTTMStep 1): Build a passenger association graph travel topic model PGTTM

步骤1.1)、数据预处理和构建旅客关联图的S1.1阶段;Step 1.1), data preprocessing and S1.1 stage of building passenger association graph;

步骤1.11)、数据介绍与预处理Step 1.11), data introduction and preprocessing

旅客订票数据中每一条数据包含旅客个人信息和出行信息;个人信息包括唯一识别旅客的加密身份证号、旅客年龄、性别等,出行信息包括搭乘的航空公司、起飞机场、到达机场、折扣等信息。Each piece of data in the passenger booking data contains the personal information and travel information of the passenger; the personal information includes the encrypted ID number that uniquely identifies the passenger, the passenger's age, gender, etc., and the travel information includes the airline taken, the departure airport, the arrival airport, and the discount. and other information.

经过去除低频旅客、去除重复记录、去除异常记录等预处理操作后,取一定的历史数据作为训练集,其余数据作为测试集。After preprocessing operations such as removing low-frequency passengers, removing duplicate records, and removing abnormal records, certain historical data is taken as the training set, and the rest of the data is used as the test set.

步骤1.12)、基于方差的切分方法;Step 1.12), segmentation method based on variance;

例如切分年龄,将训练集旅客出行记录中所有年龄提取成排序列表,遍历最小年龄到最大年龄,以遍历到的每个年龄为切分点,计算切分后两段年龄表方差的加权平均值,权重是切分后包含的年龄数占切分前年龄数的比例,找到切分后方差加权平均值和切分前方差相差最大的切分年龄值,即为最佳切分点。For example, age segmentation, extract all ages in the passenger travel records of the training set into a sorted list, traverse the minimum age to the maximum age, take each age traversed as the segmentation point, and calculate the weighted average of the variances of the two age tables after segmentation. The weight is the ratio of the age included after segmentation to the age before segmentation. Find the segmentation age value with the largest difference between the weighted average of the variance after segmentation and the variance before segmentation, which is the best segmentation point.

步骤1.13)、构建旅客关联图;Step 1.13), build a passenger association graph;

旅客之间的关联度由航线共现度和属性共现度共同决定;在步骤1.11)中得到的训练集上统计计算,得到一个表达旅客之间航线共现数的稀疏矩阵,每一列归一化即是航线共现度矩阵;属性共现度是指旅客年龄、性别、平均折扣、平均里程在经过步骤1.12)切分后,两个旅客是否都相同;最后取旅客航线共现度最高的几个旅客作为其关联旅客,然后该旅客与这些关联旅客的关联度由他们之间的航线共现度与属性共现度的加权平均所得;这样旅客间的关联图得以构建。The degree of association between passengers is determined by the co-occurrence degree of the airline and the co-occurrence degree of attributes; statistical calculation is performed on the training set obtained in step 1.11), and a sparse matrix expressing the number of airline co-occurrences between passengers is obtained, and each column is normalized is the route co-occurrence degree matrix; the attribute co-occurrence degree refers to whether the two passengers are the same after the passenger's age, gender, average discount, and average mileage are divided in step 1.12); Several passengers are regarded as their associated passengers, and then the degree of association between the passenger and these associated passengers is obtained by the weighted average of the co-occurrence degree of routes and the co-occurrence degree of attributes between them; thus, the association graph between passengers is constructed.

所述旅客搭乘的航线由起飞机场和到达机场决定,里程信息由起飞机场和到达机场代表的两个城市的距离所得,价格由里程和折扣信息决定,平均折扣由旅客总里程和总价格决定。The flight route taken by the passenger is determined by the departure airport and the arrival airport, the mileage information is obtained from the distance between the two cities represented by the departure airport and the arrival airport, the price is determined by the mileage and discount information, and the average discount is determined by the passenger's total mileage and total price. Decide.

步骤1.2)、利用PGTTM建模旅客出行偏好Step 1.2), use PGTTM to model passenger travel preferences

步骤1.21)得到输入数据的S1.21阶段;Step 1.21) obtain the S1.21 stage of the input data;

设训练集的旅客订票记录中有不同的U位旅客(由加密身份证号区别),C家航空公司,R条航线。从旅客订票记录中抽取身份证号、航空公司、航线三个字段,并分别替换成索引形式,即这三个字段分别由数字1~U,1~C,1~R表示,最后得到三个向量u、c、r,长度都为N(也是训练集的订票记录数),即是输入数据。三个向量的每一行表示第i个订票记录中的旅客ui搭乘了航空公司ci下的航线ri,(1≤ui≤U,1≤ci≤C,1≤ri≤R,i=1,2,...,N)。Suppose there are different U passengers (distinguished by encrypted ID numbers), C airlines, and R routes in the passenger booking records of the training set. Extract the three fields of ID number, airline and airline from the passenger booking record, and replace them with the index form respectively, that is, these three fields are represented by numbers 1~U, 1~C, 1~R respectively, and finally three fields are obtained. A vector u, c, r, all of length N (also the number of booking records in the training set), is the input data. Each row of the three vectors indicates that the passenger ui in the ith booking record took the route ri under the airline c i , (1≤u i ≤U, 1≤ci ≤C, 1≤r i R, i=1,2,...,N).

T为设定的主题个数。z表示主题向量,长度为N,x是用以生成主题的旅客向量,长度为N。u、c、r与z、x的关系是,它们的每一分量表示旅客ui搭乘的航空公司ci和航线ri的主题zi是由xi分配的,而xi可以是ui,也可能是ui的关联旅客,(1≤zi≤T,1≤xi≤U,i=1,2,...,N)。T is the set number of topics. z represents the topic vector, with length N, and x is the passenger vector used to generate the topic, with length N. The relationship of u, c, r to z, x is that each component of them represents that the subject zi of the airline ci and the route ri that the passenger ui takes is assigned by xi , and xi can be u i , and may also be associated passengers of ui , (1≤zi ≤T, 1≤xi ≤U, i =1,2,...,N).

下面是PGTTM中旅客生成每个出行行为的过程:The following is the process by which passengers generate each travel behavior in PGTTM:

(1)每一个旅客u对应一个主题分布,每一个主题t对应一个航空公司分布和一个航线分布。旅客u的主题分布θuDirichlet(α),主题t的航空公司分布φtDirichlet(μ),主题t的航线分布

Figure GDA0002549423010000061
Dirichlet(β),(u=1,2,...,U,t=1,2,...,T;θu是T维向量,φt是C维向量,
Figure GDA0002549423010000062
是R维向量;α,μ,β是狄利克雷分布的参数)。(1) Each passenger u corresponds to a topic distribution, and each topic t corresponds to an airline distribution and an airline distribution. The topic distribution θ u Dirichlet(α) of passenger u, the airline distribution φ t Dirichlet(μ) of topic t, the route distribution of topic t
Figure GDA0002549423010000061
Dirichlet(β), (u=1,2,...,U, t=1,2,...,T; θ u is a T-dimensional vector, φ t is a C-dimensional vector,
Figure GDA0002549423010000062
is an R-dimensional vector; α, μ, β are the parameters of the Dirichlet distribution).

(2)旅客ui首先采样一个旅客s,然后由s采样一个出行主题,最后根据出行主题选择搭乘的航空公司和航线。即主题ziMultinomial(θs),航空公司ci

Figure GDA0002549423010000063
航线ri
Figure GDA0002549423010000064
在PGTTM中s可以是ui本身,还可能是ui的关联旅客,(1≤ui≤U,1≤zi≤T,1≤ci≤C,1≤ri≤R,i=1,2,...,N)。(2) Passenger u i first samples a passenger s, and then samples a travel theme from s, and finally selects the airline and route to take according to the travel theme. i.e. topic z i Multinomial(θ s ), airline c i
Figure GDA0002549423010000063
route r i
Figure GDA0002549423010000064
In PGTTM, s can be u i itself, or the associated passenger of u i , (1≤u i ≤U, 1≤zi ≤T, 1≤ci ≤C, 1≤r i ≤R , i = 1,2,...,N).

旅客-主题分布θ(U×T维),主题-航空公司分布φ(T×C维),主题-航线分布

Figure GDA0002549423010000065
(T×R维)是PGTTM要推断的参数。就是根据已有的旅客u和其搭乘行为c、r,反向推断它们的主题分布。Passenger-subject distribution θ (U×T dimension), subject-airline distribution φ (T×C dimension), subject-airline distribution
Figure GDA0002549423010000065
(T×R dimension) is the parameter to be inferred by PGTTM. It is to infer their topic distributions in reverse based on the existing passengers u and their boarding behaviors c and r.

步骤1.22)初始化操作的S1.22阶段;Step 1.22) S1.22 stage of initialization operation;

设定用以分配主题的旅客x初始状态和搭乘旅客u相等。接着用T个主题随机初始化主题向量z。(即1≤zi≤T,i=1,2,...,N)。The initial state of the passenger x for assigning the theme is set equal to the passenger u boarding. Then randomly initialize the topic vector z with T topics. (ie 1≤zi≤T, i =1,2,...,N).

设CUT是U×T维矩阵,表示旅客分配各个主题的次数,由向量x和z统计得到;CTC是T×C维矩阵,表示主题分配到各个航空公司的次数,由向量z和c统计得到;CTR是T×R维矩阵,表示主题分配到各个航线的次数,由向量z和r统计得到。这三个矩阵分别是旅客、航空公司、航线的主题计数矩阵。Let C UT be a U×T-dimensional matrix, which represents the number of times passengers are assigned to each topic, which is obtained by the vectors x and z; C TC is a T×C-dimensional matrix, which represents the number of topics assigned to each airline, which is calculated by the vectors z and c. Statistically obtained; C TR is a T×R-dimensional matrix, which represents the number of times the subject is assigned to each route, and is obtained from the statistics of the vectors z and r. These three matrices are the subject count matrices of passengers, airlines, and routes, respectively.

设定最大迭代次数NN;构造一个长度为N的向量order,其值遍布1到N,但是顺序随机打乱。Set the maximum number of iterations NN; construct a vector order of length N whose values are spread from 1 to N, but the order is randomly shuffled.

步骤1.23)不考虑当前旅客、当前航空公司和航线的主题分配,更新主题计数矩阵的S1.23阶段;Step 1.23) Update the S1.23 stage of the topic count matrix without considering the topic assignments of the current passenger, current airline and airline;

不考虑主题z的下标为orderi的那一分量,更新三个主题计数矩阵,即

Figure GDA0002549423010000071
Figure GDA0002549423010000072
都减1。Update the three topic count matrices without considering the component of topic z subscripted by order i , that is
Figure GDA0002549423010000071
Figure GDA0002549423010000072
both minus 1.

步骤1.24)为当前航空公司、航线采样一个用来生成新主题的旅客的S1.24阶段;Step 1.24) S1.24 stage of sampling a passenger used to generate a new theme for the current airline and airline;

由一个参数为τ的伯努利分布决定为当前航空公司

Figure GDA0002549423010000073
和航线
Figure GDA0002549423010000074
重新采样的主题由当前旅客
Figure GDA0002549423010000075
产生,还是由
Figure GDA0002549423010000076
关联图中的关联旅客产生。而由
Figure GDA0002549423010000077
的哪一个关联旅客产生,则由一个多项分布来决定,该多项分布的参数是该旅客与其关联旅客的关联度。假设采样旅客为s,
Figure GDA0002549423010000078
是采样概率,取决于两个分布的参数。Determined by a Bernoulli distribution with parameter τ as the current airline
Figure GDA0002549423010000073
and route
Figure GDA0002549423010000074
Resampled topics by current travelers
Figure GDA0002549423010000075
produced by
Figure GDA0002549423010000076
The associated passengers in the association graph are generated. and by
Figure GDA0002549423010000077
Which of the associated passengers is generated is determined by a multinomial distribution whose parameter is the degree of association between the passenger and its associated passengers. Suppose the sample passenger is s,
Figure GDA0002549423010000078
is the sampling probability, which depends on the parameters of the two distributions.

步骤1.25)利用Gibbs采样公式为当前航空公司和航线重新分配新主题的S1.25阶段;Step 1.25) The S1.25 stage of reassigning a new topic for the current airline and airline using the Gibbs sampling formula;

根据Gibbs采样公式,计算出由旅客s为当前航空公司

Figure GDA0002549423010000079
和当前航线
Figure GDA00025494230100000710
重新分配的新主题是t(t=1,2,...,T)的概率。公式如下:According to the Gibbs sampling formula, it is calculated that the passenger s is the current airline
Figure GDA0002549423010000079
and current route
Figure GDA00025494230100000710
The reassigned new topic is the probability of t(t=1,2,...,T). The formula is as follows:

Figure GDA00025494230100000711
Figure GDA00025494230100000711

公式的意义是为当前旅客采样旅客s以及为当前航空公司、航线采样新主题t的概率。其中,下标标有-orderi的向量表示不考虑下标为orderi的那一分量,

Figure GDA00025494230100000712
是旅客s分配主题t的次数,
Figure GDA00025494230100000713
是主题t分配给航空公司
Figure GDA00025494230100000714
的次数,
Figure GDA00025494230100000715
是主题t分配给航线
Figure GDA00025494230100000716
的次数,
Figure GDA00025494230100000717
是步骤1.23)得到的、根据
Figure GDA00025494230100000718
采样旅客s的概率。The meaning of the formula is the probability of sampling passenger s for the current passenger and sampling the new topic t for the current airline and airline. Among them, the vector with the subscript -order i means that the component with the subscript order i is not considered,
Figure GDA00025494230100000712
is the number of times passenger s assigns topic t,
Figure GDA00025494230100000713
is the topic t assigned to the airline
Figure GDA00025494230100000714
number of times,
Figure GDA00025494230100000715
is the subject t assigned to the route
Figure GDA00025494230100000716
number of times,
Figure GDA00025494230100000717
is obtained in step 1.23), according to
Figure GDA00025494230100000718
Probability of sampling passenger s.

最后,以这T个概率值为参数构成多项分布,采样一个新主题为topic。Finally, a multinomial distribution is formed with these T probability values as parameters, and a new topic is sampled as topic.

步骤1.26)更新用以生成主题的旅客向量以及主题向量的S1.26阶段;Step 1.26) update the S1.26 stage for generating the subject vector and subject vector;

根据步骤1.24)在x中将

Figure GDA0002549423010000081
更新为s,根据步骤1.25)在z中将
Figure GDA0002549423010000082
更新为topic。According to step 1.24) in x the
Figure GDA0002549423010000081
Update to s, in z according to step 1.25)
Figure GDA0002549423010000082
Update to topic.

步骤1.27)更新三个主题计数矩阵的S1.27阶段;Step 1.27) Update the S1.27 stage of the three topic count matrices;

生成主题的旅客向量和主题向量在步骤1.26)更新后,令

Figure GDA0002549423010000083
都加1。After the passenger vector and topic vector of the generated topic are updated in step 1.26), let
Figure GDA0002549423010000083
Both add 1.

步骤1.28)迭代结束后计算得到旅客-主题、主题-航空公司、主题-航线分布的S1.28阶段;Step 1.28) Calculate the S1.28 stage of passenger-topic, topic-airline, and topic-airline distribution after the iteration is over;

迭代次数从1到NN,i从1到N,分别作为外循环和内循环,不断重新采样产生主题的旅客以及分配到航空公司和航线的主题,即重复执行步骤1.23)到步骤1.27)。迭代完成后,根据以下公式,可以得到旅客-主题分布θ,主题-航空公司分布φ,主题-航线分布

Figure GDA0002549423010000084
The number of iterations is from 1 to NN, and i is from 1 to N, as the outer loop and the inner loop, respectively, to continuously resample the passengers that generate the topic and the topics assigned to airlines and routes, that is, repeat steps 1.23) to 1.27). After the iteration is completed, the passenger-topic distribution θ, the topic-airline distribution φ, and the topic-airline distribution can be obtained according to the following formulas
Figure GDA0002549423010000084

Figure GDA0002549423010000085
Figure GDA0002549423010000085

Figure GDA0002549423010000086
Figure GDA0002549423010000086

Figure GDA0002549423010000087
Figure GDA0002549423010000087

其中,u=1,2,...,U,c=1,2,...,C,r=1,2,...,R,t=1,2,...,T。Wherein, u=1,2,...,U, c=1,2,...,C, r=1,2,...,R, t=1,2,...,T.

步骤1.29)计算旅客对航线偏好程度的S1.29阶段;Step 1.29) The S1.29 stage of calculating the passenger's preference for the route;

PGTTM用来建模旅客对航空公司和航线的偏好,例如用P(u|r)表示旅客对航线偏好程度,也是航线对旅客的吸引程度,计算公式如下:PGTTM is used to model passengers' preferences for airlines and routes. For example, P(u|r) is used to represent the passenger's preference for the airline, which is also the attraction of the airline to passengers. The calculation formula is as follows:

Figure GDA0002549423010000088
Figure GDA0002549423010000088

其中,u=1,2,...,U,r=1,2,...,R。where u=1,2,...,U, r=1,2,...,R.

步骤2):计算航线热度,旅客忠诚度,航空公司市场占有率:Step 2): Calculate airline popularity, passenger loyalty, and airline market share:

步骤2.1)、计算航线热度的S2.1阶段;Step 2.1), the S2.1 stage of calculating the route heat;

航线热度用P(r)表示,表明旅客在出行时选择航线r的概率,公式如下:Route popularity is represented by P(r), which indicates the probability that passengers choose route r when they travel. The formula is as follows:

Figure GDA0002549423010000089
Figure GDA0002549423010000089

其中,count(r)表示航线r在2010年旅客订票记录中出现的次数,r=1,2,...,R。Among them, count(r) represents the number of times the route r appears in the passenger booking records in 2010, r=1,2,...,R.

步骤2.2)、计算旅客忠诚度的S2.2阶段;Step 2.2), the S2.2 stage of calculating passenger loyalty;

旅客忠诚度用P(c|u)表示,表明旅客u在出行时选择航空公司c的概率,公式如下:Passenger loyalty is represented by P(c|u), which indicates the probability that passenger u chooses airline c when he travels. The formula is as follows:

Figure GDA0002549423010000091
Figure GDA0002549423010000091

其中,count(u,c)表示在2010年旅客订票记录中旅客u选择航空公司c的次数,c=1,2,...,C,u=1,2,...,U。Among them, count(u,c) represents the number of times that passenger u chooses airline c in the passenger booking record in 2010, c=1,2,...,C, u=1,2,...,U.

步骤2.3)、计算航空公司市场占有率的S2.3阶段;Step 2.3), the S2.3 stage of calculating the airline market share;

航空公司市场占有率用P(c|r)表示,表明航线r属于航空公司c下航线的概率,公式如下:The airline market share is represented by P(c|r), which indicates the probability that route r belongs to the route of airline c. The formula is as follows:

Figure GDA0002549423010000092
Figure GDA0002549423010000092

其中,count(c,r)表示在2010年旅客订票记录中,航空公司c和航线r共同出现的记录数,c=1,2,...,C,r=1,2,...,R。Among them, count(c,r) represents the number of records co-occurred by airline c and route r in the passenger booking records in 2010, c=1,2,...,C, r=1,2,... ., R.

步骤3):引进贝叶斯概率模型,构建多因素融合预测框架,计算旅客搭乘航空公司、航线的概率,并进行预测和推荐的S3阶段:Step 3): Introduce a Bayesian probability model, build a multi-factor fusion prediction framework, calculate the probability of passengers taking airlines and routes, and carry out the S3 stage of prediction and recommendation:

步骤3.1)、利用贝叶斯概率模型,构建多因素融合预测框架;Step 3.1), use the Bayesian probability model to construct a multi-factor fusion prediction framework;

将步骤1)中PGTTM得到的旅客对航线偏好,以及步骤2)中航线热度、旅客忠诚度、航空公司市场占有率利用一个贝叶斯概率模型融合在一起,构造多因素融合预测框架。本发明用到的贝叶斯概率模型推导如下:The airline preference of passengers obtained by PGTTM in step 1), and the airline popularity, passenger loyalty, and airline market share in step 2) are fused together using a Bayesian probability model to construct a multi-factor fusion prediction framework. The Bayesian probability model used in the present invention is derived as follows:

首先对于固定的旅客u,P(u)是常数,可以得到First, for a fixed passenger u, P(u) is a constant, we can get

Figure GDA0002549423010000093
Figure GDA0002549423010000093

又根据according to

P(r,c,u)=P(r)*P(u|r)*P(c|u,r)≈P(r)*P(u|r)*[αP(c|u)+(1-α)P(c|r)],所以可以得到需要的贝叶斯概率函数如下:P(r,c,u)=P(r)*P(u|r)*P(c|u,r)≈P(r)*P(u|r)*[αP(c|u)+ (1-α)P(c|r)], so the required Bayesian probability function can be obtained as follows:

logP(r,c|u)∝log{P(r)*P(u|r)*[αP(c|u)+(1-α)P(c|r)]}logP(r,c|u)∝log{P(r)*P(u|r)*[αP(c|u)+(1-α)P(c|r)]}

其中,P(r,c|u)表示旅客u选择航空公司c下航线r的概率,α为可设定的参数,公式两边取log是为了避免求得的概率值过小。Among them, P(r,c|u) represents the probability that the passenger u chooses the route r under the airline c, α is a parameter that can be set, and the log on both sides of the formula is to avoid the obtained probability value being too small.

最后一个公式即是所需贝叶斯概率模型,也是多因素融合预测框架,融合了航线热度P(r),旅客对航线偏好P(u|r),旅客忠诚度P(c|u),航空公司市场占有率P(c|r)。(c=1,2,...,C,r=1,2,...,R,u=1,2,...,U)。The last formula is the required Bayesian probability model, and it is also a multi-factor fusion prediction framework, which integrates airline popularity P(r), passenger preference for airline P(u|r), passenger loyalty P(c|u), Airline market share P(c|r). (c=1,2,...,C, r=1,2,...,R, u=1,2,...,U).

步骤3.2)、预测旅客将来选择的航空公司、航线;Step 3.2), predict the airline and route that the passenger will choose in the future;

根据步骤3.1)中的多因素预测框架,假设训练集中一共有W个航空公司-航线词对,对于每一旅客u能够计算其搭乘每一个航空公司-航线词对的概率,根据算得的数值进行从大到小排序,然后找到数值最大的前K个(TopK)航空公司-航线词对作为预测对象,进行推荐,通过将预测结果与测试集进行比对,得到预测准确度。According to the multi-factor prediction framework in step 3.1), assuming that there are a total of W airline-airline word pairs in the training set, for each passenger u, the probability of taking each airline-airline word pair can be calculated. Sort from large to small, and then find the top K (TopK) airline-airline word pairs with the largest values as the prediction objects, and recommend them. By comparing the prediction results with the test set, the prediction accuracy is obtained.

比如对于某个旅客17464755.(加密后身份证号),将订票数据中的航空公司290(真实名称的代号)、航线CTU-CAN(机场三字码,成都双流机场-广州白云机场)所代表的(c,r)代入步骤3.1)的多因素融合预测框架函数中进行计算,假设计算得到的数值相较于其它W-1个词对最大,那么理所当然的将该词对作为预测对象,如果在测试集中该旅客真实搭乘了该航空公司下的该航线,那么对于Top1来说,预测准确率为1。(c=1,2,...,C,r=1,2,...,R,u=1,2,...,U)。For example, for a passenger 17464755. (encrypted ID number), the airline 290 (the real name code) and the route CTU-CAN (airport three-character code, Chengdu Shuangliu Airport-Guangzhou Baiyun Airport) in the booking data The representative (c, r) is substituted into the multi-factor fusion prediction framework function of step 3.1) for calculation. Assuming that the calculated value is the largest compared to other W-1 word pairs, it is a matter of course that this word pair is used as the prediction object. If the passenger actually took the route under the airline in the test set, then for Top1, the prediction accuracy is 1. (c=1,2,...,C, r=1,2,...,R, u=1,2,...,U).

需要强调的是,本发明所述的实施例是说明性的,而不是限定性的,因此本发明并不限于具体实施方式中所述的实施例,凡是由本领域技术人员根据本发明的技术方案得出的其他实施方式,同样属于本发明保护的范围。It should be emphasized that the embodiments described in the present invention are illustrative rather than restrictive, so the present invention is not limited to the embodiments described in the specific implementation manner. The other embodiments obtained also belong to the protection scope of the present invention.

Claims (1)

1.一种基于主题模型的多因素融合民航旅客出行预测方法,采用数据挖掘理论和方法对民航数据中旅客出行行为进行分析,运行环境要求所使用的计算机平台配备不低于8G的内存,CPU核心数不低于4个且主频不低2.6GHz、Windows 7及以上版本的64位操作系统,并安装Oracle数据库、Java 1.7及以上版本、Matlab 2011b及以上版本必备软件环境;其特征在于,该方法包括:1. A multi-factor fusion civil aviation passenger travel prediction method based on the theme model, using data mining theory and methods to analyze passenger travel behavior in civil aviation data, the operating environment requires the computer platform used to be equipped with a memory of no less than 8G, CPU A 64-bit operating system with at least 4 cores and a main frequency of at least 2.6GHz, Windows 7 and above, Oracle database, Java 1.7 and above, and Matlab 2011b and above must be installed; , the method includes: 步骤1):构建旅客关联图出行主题模型;包括构建旅客的关联图,并对旅客出行选择概率分布进行主题建模,最终得到旅客关联图出行主题模型:Step 1): Build a travel topic model of passenger association graph; including building a passenger association graph, and subjecting the travel selection probability distribution of passengers to topic modeling, and finally obtaining a travel topic model of passenger association graph: 步骤1.1)、构建旅客关联图;Step 1.1), build a passenger association graph; 构建旅客关联图,就是计算旅客之间的关联度,它由旅客航线共现度和属性共现度共同决定;航线共现度由旅客之间的航线共现数决定;属性共现度是指旅客的年龄、性别、平均折扣、平均里程是否相同;旅客年龄、平均折扣、平均里程信息由基于方差的切分方法得到;To construct a passenger association graph is to calculate the association degree between passengers, which is determined by the co-occurrence degree of passenger routes and the co-occurrence degree of attributes; the co-occurrence degree of routes is determined by the co-occurrence number of routes between passengers; the co-occurrence degree of attributes refers to Whether the age, gender, average discount, and average mileage of the passengers are the same; the information on the age, average discount, and average mileage of the passengers is obtained by the segmentation method based on variance; 步骤1.2)、对旅客出行选择概率分布主题建模;Step 1.2), modeling the subject of passenger travel selection probability distribution; 基于主题模型对旅客和其搭乘的航线、航空公司进行主题建模,发现并求得旅客、航线、航空公司的潜在主题分布,最终将旅客的潜在主题分布和航空公司、航线的潜在主题分布相结合,得到旅客对航空公司和航线的出行选择概率分布信息;Based on the topic model, the topic modeling of passengers and their routes and airlines is carried out, and the potential topic distribution of passengers, routes and airlines is found and obtained, and finally the potential topic distribution of passengers is compared with the potential topic distribution of airlines and airlines. Combined, the probability distribution information of passengers' travel choices to airlines and routes is obtained; 步骤1.3)、构建旅客关联图出行主题模型;Step 1.3), build a travel theme model of a passenger association graph; 在步骤1.2)主题建模过程中加入步骤1.1)中的旅客关联图,以构建旅客关联图出行主题模型(Passenger Graph based Travel Topic Model,PGTTM);PGTTM在为每个旅客的航线、航空公司分配主题时,使得主题不仅来自于旅客本身,还有可能来自于旅客关联的其他旅客;这样能够丰富主题信息,提高预测性能,并减轻民航旅客出行稀疏性的问题;In the process of step 1.2) topic modeling, the passenger association graph in step 1.1) is added to construct the Passenger Graph based Travel Topic Model (PGTTM); PGTTM is assigned to each passenger's route and airline When the theme is used, the theme can not only come from the passenger itself, but also from other passengers associated with the passenger; this can enrich the theme information, improve the prediction performance, and alleviate the problem of the sparse travel of civil aviation passengers; 步骤2):构建航线热度、旅客忠诚度、航空公司市场占有率计算模型,利用这些先验知识,能够帮助后面准确预测:Step 2): Build a calculation model for airline popularity, passenger loyalty, and airline market share, and use these prior knowledge to help accurately predict the following: 步骤2.1)、计算航线的热度;Step 2.1), calculate the heat of the route; 对于航线热度,首先统计该航线被全部旅客搭乘的次数,以及每个航线被全部旅客搭乘的次数之和,在此基础上,计算得到航线热度;For airline popularity, first count the number of times the route is taken by all passengers, and the sum of the number of times each route is taken by all passengers. On this basis, the route popularity is calculated; 步骤2.2)、计算旅客对航空公司的忠诚度;Step 2.2), calculate the loyalty of the passenger to the airline; 对于旅客忠诚度,首先统计该旅客搭乘该航空公司的次数,以及该旅客搭乘每一个航空公司的次数之和,在此基础上,经过平滑处理,计算得到旅客对航空公司的忠诚度;For passenger loyalty, first count the number of times the passenger takes the airline and the sum of the number of times the passenger takes each airline. On this basis, after smoothing, the passenger's loyalty to the airline is calculated; 步骤2.3)、计算航空公司对航线的市场占有率;Step 2.3), calculate the market share of the airline on the route; 对于航空公司市场占有率,首先统计该航空公司、该航线作为一个词对被全部旅客搭乘的次数,以及在不考虑航空公司下该航线被全部旅客搭乘的次数,基于此,计算得到航空公司对航线的市场占有率;For the airline market share, first count the number of times the airline and this route are taken by all passengers as a word pair, and the number of times the airline is taken by all passengers without considering the airline. market share of the route; 步骤3):通过贝叶斯概率模型,构建多因素融合预测框架;通过贝叶斯概率模型融合航线热度、旅客对航线选择概率分布、旅客忠诚度以及航空公司市场占有率,对旅客将来选择的航线、航空公司进行预测:Step 3): Build a multi-factor fusion prediction framework through the Bayesian probability model; integrate the popularity of routes, the probability distribution of passengers' choice of routes, passenger loyalty, and airline market share through the Bayesian probability model. Routes and airlines make predictions: 步骤3.1)、基于贝叶斯概率模型的多因素融合;Step 3.1), multi-factor fusion based on Bayesian probability model; 基于步骤1.3)中PGTTM得到的旅客对航线选择概率分布,步骤2.1)中的航线热度,步骤2.2)中的旅客忠诚度,以及步骤2.3)中的航空公司市场占有率,构建贝叶斯概率模型,对这四部分因素进行融合,更好建模旅客的出行行为;A Bayesian probability model is constructed based on the probability distribution of passengers' choice of routes obtained by PGTTM in step 1.3), the popularity of routes in step 2.1), the passenger loyalty in step 2.2), and the airline market share in step 2.3). , Integrate these four factors to better model the travel behavior of passengers; 步骤3.2)、基于贝叶斯概率模型的多因素预测;Step 3.2), multi-factor prediction based on Bayesian probability model; 针对每个旅客、每个航空公司-航线词对,利用贝叶斯概率模型函数,分别计算旅客的搭乘概率;对每个旅客而言,挑选出概率最大的几个航空公司-航线词对,进行预测和推荐。For each passenger and each airline-route word pair, use the Bayesian probability model function to calculate the passenger's boarding probability; for each passenger, select several airline-route word pairs with the highest probability, Make predictions and recommendations.
CN201611159984.3A 2016-12-15 2016-12-15 A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model Active CN106779214B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611159984.3A CN106779214B (en) 2016-12-15 2016-12-15 A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611159984.3A CN106779214B (en) 2016-12-15 2016-12-15 A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model

Publications (2)

Publication Number Publication Date
CN106779214A CN106779214A (en) 2017-05-31
CN106779214B true CN106779214B (en) 2020-08-28

Family

ID=58889245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611159984.3A Active CN106779214B (en) 2016-12-15 2016-12-15 A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model

Country Status (1)

Country Link
CN (1) CN106779214B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876049A (en) * 2018-06-27 2018-11-23 南京航空航天大学 A kind of airport market share variation prediction method in new demand servicing nurturing period
CN110751523A (en) * 2019-10-21 2020-02-04 中国民航信息网络股份有限公司 Method and device for discovering potential high-value passengers
CN110852650B (en) * 2019-11-19 2021-11-02 交通运输部公路科学研究所 Modeling Method of Integrated Passenger Hub Group Network Based on Dynamic Graph Hybrid Automata
CN112948161B (en) * 2021-03-09 2022-06-03 四川大学 A method and system for error correction and correction of aviation message based on deep learning
CN118350858B (en) * 2024-03-20 2025-02-21 中航信数智科技(北京)有限公司 Route passenger volume prediction method, device, electronic device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488597A (en) * 2015-12-28 2016-04-13 中国民航信息网络股份有限公司 Passenger destination prediction method and system
CN105512773A (en) * 2015-12-25 2016-04-20 中国民航信息网络股份有限公司 Passenger travel destination prediction method and device
CN106055807A (en) * 2016-06-06 2016-10-26 四川大学 Civil aviation passenger movement model based on potential trip purposes

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5946394B2 (en) * 2012-11-09 2016-07-06 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Statistical inference method, computer program, and computer of path start and end points using multiple types of data sources.

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512773A (en) * 2015-12-25 2016-04-20 中国民航信息网络股份有限公司 Passenger travel destination prediction method and device
CN105488597A (en) * 2015-12-28 2016-04-13 中国民航信息网络股份有限公司 Passenger destination prediction method and system
CN106055807A (en) * 2016-06-06 2016-10-26 四川大学 Civil aviation passenger movement model based on potential trip purposes

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
出行链化的贝叶斯网络预测;赵应场 等;《道路交通与安全》;20151231;第15卷(第1期);全文 *

Also Published As

Publication number Publication date
CN106779214A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106779214B (en) A Multi-factor Fusion Civil Aviation Passenger Travel Prediction Method Based on Theme Model
Çavdar et al. Airline customer lifetime value estimation using data analytics supported by social network information
Gao et al. Location-centered house price prediction: A multi-task learning approach
Zhao et al. Modelling consumer satisfaction based on online reviews using the improved Kano model from the perspective of risk attitude and aspiration
WO2024031933A1 (en) Social relation analysis method and system based on multi-modal data, and storage medium
Liu et al. Personalized air travel prediction: A multi-factor perspective
Dokuz et al. Discovering socially important locations of social media users
Kang et al. LA-CTR: A limited attention collaborative topic regression for social media
Noorian A BERT-based sequential POI recommender system in social media
CN111429161A (en) Feature extraction method, feature extraction device, storage medium, and electronic apparatus
Chen et al. Big data analytics on aviation social media: The case of china southern airlines on sina weibo
Dai A hybrid machine learning-based model for predicting flight delay through aviation big data
CN110096651A (en) Visual analysis method based on online social media individual center network
CN112784177B (en) A Spatial Distance Adaptive Next Interest Point Recommendation Method
Weng et al. OptDist: Learning Optimal Distribution for Customer Lifetime Value Prediction
Long et al. Construction framework of smart tourism big data mining model driven by blockchain technology
Krishnan et al. Predicting Passenger Preferences: An AI-Driven Framework for Personalized Airport Lobby Experiences
KR102639069B1 (en) Artificial intelligence-based advertising method recommendation system
Dai et al. Attention Mechanism with Spatial‐Temporal Joint Deep Learning Model for the Forecasting of Short‐Term Passenger Flow Distribution at the Railway Station
Sun et al. Measuring latent combinational novelty of Technology
Li et al. A greyness reduction framework for prediction of grey heterogeneous data
Kalra et al. RETRACTED ARTICLE: Enduring data analytics for reliable data management in handling smart city services
Parbat et al. Understanding the customer perception using machine learning while booking flight tickets
Faroqi et al. Modelling socioeconomic attributes of public transit passengers
Ranasinghe et al. Ensemble Learning Approach for Predicting Job Satisfaction on Freelancing Jobs in Sri Lanka

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant