CN107527223A - A kind of method and device of Ticketing information analysis - Google Patents

A kind of method and device of Ticketing information analysis Download PDF

Info

Publication number
CN107527223A
CN107527223A CN201611198401.8A CN201611198401A CN107527223A CN 107527223 A CN107527223 A CN 107527223A CN 201611198401 A CN201611198401 A CN 201611198401A CN 107527223 A CN107527223 A CN 107527223A
Authority
CN
China
Prior art keywords
passenger
booking
distribution
station
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611198401.8A
Other languages
Chinese (zh)
Inventor
赵忠信
曹文洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201611198401.8A priority Critical patent/CN107527223A/en
Publication of CN107527223A publication Critical patent/CN107527223A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Abstract

The embodiment of the invention discloses a kind of method and device of Ticketing information analysis.This method includes:Attribute information, trip purpose distribution, the distribution of booking number, train number type from passenger, booking mode are distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, the process of study is fitted to the probability density distribution of passenger's hidden state vector by being converted into the kind judging problem of the passenger.So as to develop iterative state estimation algorithm, vaild act feature can be extracted parallel from mass data, the abnormal patterns of passenger's booking behavior are identified, the requirement of real-time is met in time computational efficiency;The displaying of many-sided multi-angle can be carried out to the output result of model, facilitate the use of Correlative data analysis number.

Description

A kind of method and device of Ticketing information analysis
Technical field
The present embodiments relate to safety detection technical field, more particularly to a kind of Ticketing information analysis method And device.
Background technology
Railway is the important infrastructure of country, is the backbone of traffic and transportation system, is the main artery of national economy, right Vital effect is all played in the politics of country, economy, culture, national defense construction.According to 2015 years according to statistics, china railway Revenue kilometres reach 11.2 ten thousand kilometers, and 116.48 kilometers/ten thousand square kilometres of road mileage, investment planning is more than 3.3 trillion people Coin, china railway passenger traffic volume is more than 23.57 hundred million person-times.
Safety is the lifeline of railway transportation, is directly related to production efficiency, economic results in society and the person peace of enterprise Entirely.At present, the railway security monitoring means in China mainly utilizes sensor, data acquisition conveyor apparatus, DAS Real-time monitoring analysis early warning is carried out to the parameter of the hardware facilities such as track, train, however, theme of the people as passenger transport, Novel presentation in some booking behavior, transportations of people is also possible to railway transportation, safety in production, normal order Maintenance has adverse effect on, and how to detect this special pool of passengers or reduces the hunting zone of potential danger crowd, I State is still without perfect theoretical model and technical products.
But from passenger's booking data of magnanimity, using correlation machine learning algorithm, valuable pattern is extracted, is faced Many problems:
(1) lack flag data, supervised learning model can not be applied:
Do not have clear and definite flag data in the booking data of passenger and supply model learning, artificially nominal data not only consumes Duration, cost are high, and have significant subjectivity, first, do not ensure that there is each demarcation personnel field specialty to know Know, abnormal patterns that can be in accurate judgement booking data, secondly, the criterion for demarcating personnel may be not consistent, causes pair The demarcation of same data may produce conflict, and again, the passenger's booking data that can be acquired are imperfect informations, from endless It is difficult to determine a clear and definite standard to judge whether data are abnormal in full information.
(2) data are incomplete, lack multi-aspect information cross validation:
Data are incomplete to be mainly shown as two aspects, first, there is no definite multiply in the passenger's booking data got Objective booking time data, the booking mode metadata of passenger are simultaneously incomplete;Second, only obtained from passenger's booking data Information limitation is too strong, and (after being fitted to the probability density function of data, traversal passenger purchases the outlier identified Ticket data collection, passenger's vector is labeled using maximum likelihood estimate, judges ownership of passenger's vector to each classification cluster Degree, when passenger is both less than some threshold value to the degree of membership of all categories, labels it as outlier) can not be directly as Judge that passenger belongs to the foundation of population at risk.The booking behavior pattern of passenger accurately to be described, it is also necessary to other aspect information Support, checking.
(3) passen-gers are huge, but for personal record by bus than sparse, data compressible space is small:
Data volume is huge, and number by bus has 6,000,000 person-times daily, and peak period is even up to ten million person-time, is related to Crowd also has millions of crowds.But sum up in the point that individual, a big chunk passenger year used during taking train number may 10 times with Under, personal tables of data by bus reveals significant openness.The main target applied herein is identification outlier, and detection is started a work shift The abnormal patterns of objective booking behavior, so, the data of some details of individual can not be lost again, data compressible space is small, Using association analysis Algorithm Analysis passenger go with trip when, cause algorithm time computation complexity and space calculate Complexity is all very high.
The content of the invention
The purpose of the embodiment of the present invention is to propose a kind of method and device of Ticketing information analysis, it is intended to which how is solution From passenger's booking data of magnanimity, using correlation machine learning algorithm, the problem of extracting valuable pattern.
To use following technical scheme up to this purpose, the embodiment of the present invention:
In a first aspect, a kind of method of Ticketing information analysis, methods described include:
Attribute information, trip purpose distribution, the distribution of booking number, train number type from passenger, booking mode are distributed, begun The distribution of hair station, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;
The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is a passenger in height A point in dimension space, if the type of the passenger is unknown, the kind judging problem of the passenger will be converted into The process of study is fitted to the probability density distribution of passenger's hidden state vector.
Preferably, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age Distributed intelligence;
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein, It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other Represent other situations;
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective Booking number is the frequency of the record that state is 5 in booking record;
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger, The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level It is required that the relation of going with that then passenger A and passenger B has.
Preferably, it is described from the attribute information of passenger, trip purpose distribution, the distribution of booking number, train number type, booking Mode is distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger, including:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously , quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking The entropy and the number of different booking modes that mode is distributed.
Preferably, the booking behavior pattern that passenger is characterized by passenger's hidden state vector, then each passenger is one A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector, including:
It is hidden variable if the classification of passenger is, in the case of known passenger classification, the bar of passenger's hidden state vector A Multi-dimensional Gaussian distribution is obeyed in part distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real There may be the situation of multiple classifications;
Assume that passenger's vector is distributed Normal Distribution to the condition of passenger's classification according to central-limit theorem;It is multiple high For the weighted sum of this model with arbitrary extent close to any probability distribution, each Gauss model represents a class, sentences for needs Determine passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement Fruit.
Second aspect, a kind of device of Ticketing information analysis, described device include:
Extraction module, for the attribute information from passenger, trip purpose distribution, the distribution of booking number, train number type, purchase Ticket mode is distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger;
Fitting module, for characterizing the booking behavior pattern of passenger by passenger's hidden state vector, then each passenger is A point of one passenger in higher dimensional space, if the type of the passenger is unknown, the classification of the passenger will be sentenced Determine problem and be converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
Preferably, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age Distributed intelligence;
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein, It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other Represent other situations;
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective Booking number is the frequency of the record that state is 5 in booking record;
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger, The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level It is required that the relation of going with that then passenger A and passenger B has.
Preferably, the extraction module, is specifically used for:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously , quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking The entropy and the number of different booking modes that mode is distributed.
Preferably, the fitting module, is specifically used for:
It is hidden variable if the classification of passenger is, in the case of known passenger classification, the bar of passenger's hidden state vector A Multi-dimensional Gaussian distribution is obeyed in part distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real There may be the situation of multiple classifications;
Assume that passenger's vector is distributed Normal Distribution to the condition of passenger's classification according to central-limit theorem;It is multiple high For the weighted sum of this model with arbitrary extent close to any probability distribution, each Gauss model represents a class, sentences for needs Determine passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement Fruit.
A kind of method and device of Ticketing information analysis provided in an embodiment of the present invention, attribute information, trip from passenger Purpose distribution, the distribution of booking number, train number type, booking mode is distributed, the starting station is distributed, terminus is distributed, the relation of going with carries Take the booking behavior pattern feature of passenger;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then is each multiplied Visitor is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, by the class of the passenger Other decision problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.So as to The sneak condition for dynamically tracking passenger is realized, it is accurate to estimate passenger's booking behavior pattern, and have necessarily to error information Tolerance, fault-tolerance;Iterative state estimation algorithm is developed, vaild act spy can be extracted parallel from mass data Sign, identifies the abnormal patterns of passenger's booking behavior, the requirement of real-time is met in time computational efficiency;The output of model As a result the requirement of stability should be met, the result of determination of someone's booking pattern should be consistent in special time;It is right The output result of model can carry out the displaying of many-sided multi-angle, facilitate the use of Correlative data analysis number.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet of the method for Ticketing information analysis provided in an embodiment of the present invention;
Fig. 2 is a kind of passenger's booking behavior model provided in an embodiment of the present invention and the schematic diagram of feature extraction;
Fig. 3 is a kind of age distribution block diagram provided in an embodiment of the present invention;
Fig. 4 is a kind of passenger's character classification by age cake chart provided in an embodiment of the present invention;
Fig. 5 is a kind of trip purpose distribution cake chart provided in an embodiment of the present invention;
Fig. 6 is a kind of train number type preference distribution cake chart provided in an embodiment of the present invention;
Fig. 7 is that a kind of booking mode provided in an embodiment of the present invention is distributed cake chart;
Fig. 8 is that a kind of whole nation provided in an embodiment of the present invention is got on the bus 100 station distribution block diagrams before number ranking;
Fig. 9 is that a kind of whole nation provided in an embodiment of the present invention is got off 100 station distribution block diagrams before number ranking;
Figure 10 is that a kind of frequent 3 item collection of candidate provided in an embodiment of the present invention enumerates schematic diagram;
Figure 11 is that candidate's frequent item set provided in an embodiment of the present invention counts schematic diagram;
Figure 12 is correlation rule generation schematic diagram provided in an embodiment of the present invention;
Figure 13 is a kind of high-level schematic functional block diagram of the device of Ticketing information analysis provided in an embodiment of the present invention.
Embodiment
The embodiment of the present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this The specific embodiment of place description is used only for explaining the embodiment of the present invention, rather than the restriction to the embodiment of the present invention.In addition also It should be noted that for the ease of description, the part related to the embodiment of the present invention illustrate only in accompanying drawing and not all knot Structure.
With reference to figure 1, Fig. 1 is a kind of schematic flow sheet of the method for Ticketing information analysis provided in an embodiment of the present invention.
As shown in figure 1, the method for the Ticketing information analysis includes:
Step 101, the attribute information from passenger, trip purpose distribution, the distribution of booking number, train number type, booking mode Distribution, starting station distribution, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;
Specifically, as shown in Fig. 2 Fig. 2 is a kind of schematic diagram of passenger's hidden state provided in an embodiment of the present invention.From Several aspects such as the attribute information of passenger, trip purpose, booking number, booking mode are distributed, terminus distribution, relation of going with Portrayed come the booking behavior pattern to passenger.
The hidden state of a passenger is characterized with a vector, takes into account entirety and local message, each passenger of concentrated expression Behavioural habits, preference, for some discrete data such as trip purpose, booking mode etc., discrete, qualitatively data Mark is converted into continuous, quantitative data mode, on the one hand provides more information for accurate description user behavior custom, On the other hand, it is convenient that derived function is optimized to model.Such as statistical is carried out according to the booking data in interval of time Analysis, the most probable booking mode of the passenger is described with maximum likelihood probability, qualitatively, the data of label be converted into company Continuous, quantitative numerical value is portrayed, while some the extreme value information that can also reflect in passenger's booking behavior pattern, passes through polymerization one Booking mode in the section time in all records by bus of the passenger, calculate entropy, of different booking modes that booking mode is distributed Number reflects Global Information that booking mode is distributed.
Wherein, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age point Cloth information;
Specifically, being counted to data file according to the age, passenger's age distribution block diagram is obtained as shown in Fig. 3.It is right Passenger's age simple classification, if after 90s, after 80s respectively, Ganlei, passenger's character classification by age cake chart are as shown in Figure 4 after 70 etc..Year Age:Type real, the date of birth is parsed from ID card information, calculate the age corresponding to identification card number.With system time Days subtract the days in identity card, calculate the age.
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein, It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other Represent other situations;
Specifically, according to the province native place number information parsed from ID card information, then in conjunction with the starting station and end The administrative division code at point station, judge whether native place numbering is equal with the administrative division code of the starting station, terminus, by passenger Five major classes are divided into according to trip purpose, do not repeat not omit between of all categories, trip purpose distribution cake chart is as shown in Figure 5.
Wherein odh:It is all consistent with native place to represent starting station terminus, i.e., short distance is gone on a journey inside the province in local, odo:Represent and begin Hair station is consistent with terminus, but province's short distance trip beyond local, o:Representative leaves local and goes to other provinces to go on a journey, d:Represent Gone home from other provinces, other:Represent other situations.It can be seen that short distance is gone on a journey, goes home to occupy predominantly inside the province from Fig. 5 Position.
All trips record of passenger is counted, and trip record each time is classified as one kind of above-mentioned five type, polymerization All classifications obtain discrete trip purpose data set, and trip purpose is described with the above-mentioned other probability of five species.Consider Independent irrelevance between data dimension, removes the probability of other classifications, and adds the entropy dimension of trip purpose distribution.
O probability:In the trip record of passenger, belong to classification o frequency and the business of total trip number.
Odo probability:In the trip record of passenger, belong to classification odo frequency and the business of total trip number.
D probability:In the trip record of passenger, belong to classification d frequency and the business of total trip number.
● odh probability:In the trip record of passenger, belong to classification odh frequency and the business of total trip number.
The entropy of trip purpose distribution:In all trips record of passenger, the classification of passenger's trip purpose is recorded successively, is obtained To the array of a discrete distribution, the entropy of the discrete distribution.
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective Booking number is the frequency of the record that state is 5 in booking record;
Specifically, change label number:The frequency for the record that state is 3 in booking record.Returned ticket number:Shape in booking record State is the frequency of 2 record.Effective booking number:The frequency for the record that state is 5 in booking record.Passenger's booking behavior mould The calculating of other features is all recorded according to effective booking to calculate in formula.
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
Specifically, the selection of train number type, can be accurate from income situation, the trip purpose of a side reflection passenger Ground passenger, which draws a portrait, provides extra information.
The coding of train number type all has certain meaning.Mainly there are K (quick), G (high ferro), D in effective booking record (motor-car), Z (through quick, accurate), C (intercity), T (express train), Y (tourism), the 4-digit number train of 6/7/8/9 beginning is general Logical passenger train, L (temporary passenger train).Train number type preference distribution cake chart is as shown in Figure 6.
Train number type is described in terms of economy, agility, comfortableness three, economy is mainly from unit mileage Ticket is measured in price, and agility is from the speed of service of train, whether middle station stops, down time length etc. Measurement, comfortableness consider the subjective feeling that passenger rides, and are often stood berthing time from everyone average shared train space and train Measurement.
Economy:It can be weighed from every kilometer of admission fee of train number type, every kilometer of admission fee is smaller, and economy value is higher. Assuming that certain every kilometer of admission fee of train number type is x members, every kilometer of admission fee average value of all train number types is mean, then economy Desired value y can be calculated by following formula:
Economic index value y is x subtraction function, i.e. sigmoid functions, and the codomain of function is controlled between 0 to 1, passed through Different parameter a is set, can be with the slope of control function, a is bigger, and the inclined degree of function is higher, and a is smaller, and function inclines Oblique degree is lower.When x is equal to mean, when y changes equal to 0.5, x near average value mean, y values level off to linear change Change, when x from average value farther out when, y value changes level off to gently.When y values are more than 0.5, illustrate that train number type admission fee x is small In average fare mean, when y values are less than 0.5, illustrate that train fare x is more than average fare mean.
Agility:Calculated by the average speed of the train number type, the average speed of the train number type is bigger, fast Property index is higher.The calculating of agility desired value equally uses Sigmoid functions, it is assumed that the average speed v of certain train number type Thousand ms/h, the average of the average speed of all train number types is mean, then:
Agility desired value y is v increasing function, and the codomain of function is controlled between 0 to 1, and a controls function curve Slope, a is bigger, and the inclined degree of function is higher.When v is equal to mean, y is equal to 0.5, when y is more than 0.5, vehicle speed Degree v is more than average speed mean.
Comfortableness:Consider everyone shared train space, train often stands berthing time to weigh.Everyone shared train Space is bigger, and train is often stood, and berthing time is shorter, and comfortableness is higher.The calculating of comfort index value is relatively complicated, comfortably Property y is everyone shared train space s increasing function, is that train is often stood berthing time t subtraction function, first has to a s and t is immeasurable Guiding principle, the weighted sum weightedSum of s and t tape symbol is then sought, then reapplies sigmoid functions.Assuming that all cars Everyone shared train space s of secondary type maximum is smax, minimum value smin, all train number types often stand berthing time t most Big value is tmax, minimum value tmin, then:
WeightedSum=ws*sstd-wttstd
The weightedSum values of all train number types are calculated, seek weightedSum average value aws,
Y=sigmoid (weightedSum-aws)
To in a period of time, a certain passenger has a plurality of record by bus, obtains the sequence of different train number types, calculates respectively Each train number type economy, agility, the score of comfortableness, then it is added summation and divided by rides to record number, i.e., each index It is worth being divided into the average value of all corresponding indexs of record by bus in this period.
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
Specifically, booking mode preference can reflect economic level, the behavioural habits of passenger to a certain extent.Such as university student Be more likely to online ticketing, business people may more preference third party sell etc..All effective bookings are recorded, according to purchase Ticket mode counts.
It is as shown in Figure 7 that booking mode is distributed cake chart.
Booking mode number:In all booking records, the number of used different booking mode, this is paid attention to Value and the difference of effective booking number, it is emphasized that using the number of different booking modes, identical booking mode does not repeat Count.
Maximum likelihood probability:In all booking records, the probability of the most multiple booking mode of use, probability frequency Number divided by booking number altogether calculate.
The entropy that booking mode is distributed:It polymerize all booking records of single passenger, obtains the purchase that all booking records use The set of ticket mode, different item (number of the booking mode used) in statistics set, and calculate the entropy of discrete distribution.
The average metric coefficient of booking mode:For each booking mode, weighed in terms of economy, agility two Amount.Economy mainly reflects the financial cost using this booking mode, for example, sell, it is necessary to certain transport cost and Collect corresponding service charge to use, caused expense is higher, and economy is lower;Agility mainly consider time of booking mode into Originally time taking length, is spent, the cost time is longer, and agility is lower, and the cost time is shorter, and agility is higher.Each property value Calculate and use and train number type preference similar mode.
Economy
Assuming that the financial cost of certain booking mode is x members, the average financial cost of all booking modes is mean members, Then economic index value y is financial cost x subtraction function:
Y=sigmoid (mean-x)
Agility
Assuming that the time cost of certain booking mode is x minutes, the average time cost of all booking modes is mean points Clock, then agility desired value y is time cost x subtraction function:
Y=sigmoid (mean-x)
To in a period of time, a certain passenger has a plurality of record by bus, obtains the sequence of different booking modes, calculates respectively Each train number type economy, the score of agility, then it is added summation and divided by rides to record number, i.e., each desired value score For the average value of all corresponding indexs of record by bus in this period.
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
Specifically, in order to weigh the size of the passenger traffic volume and technical operation amount, and politically, the grade of economic bus loading zone Status, China has formulated special grade scale to passenger station, comprehensive station, as follows:
It is average daily get on or off the bus and transfer passenger more than 60000 people or in the change one's profession passenger station of bag more than 20000 be top grade Stand
It is average daily get on or off the bus and transfer passenger more than 15000 people or in the change one's profession passenger station of bag more than 1500 be first-class Stand
It is average daily get on or off the bus and transfer passenger more than 5000 people or in the change one's profession passenger station of bag more than 500 be secondary station
It is average daily get on or off the bus and transfer passenger more than 2000 people or in the change one's profession passenger station of bag more than 100 be third station
Other
To passenger's booking data of 20100710 to 20,100,716 one weeks of selection, the starting station recorded according to booking is entered Row statistics, take it is daily get on the bus or transfer number before the station of 100, total number of passengers originated accounts for the 61% of total number of persons, the 100th The station important coefficient of name is 0.001918.The whole nation 100 station distribution block diagrams before number ranking of getting on the bus are as shown in Figure 8.
In order to which passenger's booking behavior pattern is more accurately more fully described, the station information that passenger is often come in and gone out adds Into extraction feature, the distribution situation at the starting station in passenger's booking behavior pattern is reflected using following desired value.
The number at the starting station:With passenger identity card number for key, all booking records of polymerization passenger, in all bookings In record, there is the quantity at the different starting stations.
Maximum likelihood probability:In the booking record of each passenger, the most station of occurrence number, probability is removed with frequency Calculated with booking number altogether.
Station important coefficient:The important coefficient at each station is used total transmission number at the same day station divided by owned Total transmission number at station calculates.Passenger has a plurality of record by bus in a period of time, and station important coefficient is ridden with all The average value of the important coefficient at the starting station in record, maximum, minimum value are weighed.
The entropy of starting station distribution:It polymerize all booking records of single passenger, obtains the starting station in all booking records Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger, The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Specifically, to passenger's booking data of 20100710 to 20,100,716 one weeks of selection, the end recorded according to booking Point station is counted, take it is daily get off or transfer number before the station of 100, total number of passengers originated accounts for the 59% of total number of persons, The station important coefficient of the 100th is 0.001949.The whole nation is got off 100 station distribution block diagrams such as Fig. 9 institutes before number ranking Show.
In order to which passenger's booking behavior pattern is more accurately more fully described, the station information that passenger is often come in and gone out adds Into extraction feature, the distribution situation of terminus in passenger's booking behavior pattern is reflected using following desired value.
The number of terminus:With passenger identity card number for key, all booking records of polymerization passenger, in all bookings In record, there is the quantity of different terminus.
Maximum likelihood probability:In the booking record of each passenger, the most station of occurrence number, probability is removed with frequency Calculated with booking number altogether.
Station important coefficient:The station total on the day of the important coefficient at each station it is expected that arrive at a station number divided by The total of all stations is expected to arrive at a station number to calculate.Passenger has a plurality of record by bus in a period of time, and station important coefficient is used It is all record by bus in the average values of important coefficient of terminus, maximum, minimum value weigh.
The entropy of terminus distribution:It polymerize all booking records of single passenger, obtains terminus in all booking records Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level It is required that the relation of going with that then passenger A and passenger B has.
Specifically, passenger's booking trip is often gone with behavior with certain colony, by the relation of going with for analyzing passenger Analysis may find that and passenger's Relationship Comparison is close or people with similar behavioural habits, find some valuable moulds Formula.Relation of going with mainly is portrayed by the number of going with.
Go with number:In the flow data set got in interval of time, there is the passenger for relation of going with the passenger Number.
Go with relationship analysis:
Under the scene of imperfect information, a certainty, the rule of universality do not judge between different passengers In the presence of the relation of going with.So the relation of going with here is not to be related to " relation of going with " in common discourse, we are more prone to Between multiple passengers are described with incidence relation at the same trip relation.When being associated analysis to Transaction Information of riding, The problem of needing to handle two keys:1) performance issue, transaction data set (TDS) is bigger by bus, calculates time, space complexity It is high;2) some patterns found are probably false, it is necessary to assess the correlation rule of generation, reject false mould Formula.
Go with contextual definition:
The herein relation of going with is based on the basis of support-Confidence Framework, with being related in daily life " relation of going with " is different.Such as:Passenger A and passenger B go with trip used during taking train, but go on a journey number it is fewer, be unsatisfactory for propping up The requirement of degree of holding threshold value, in the case of other no effective informations, it can not judge that there is pass of going between passenger A and passenger B System.
Before contextual definition is gone with proposition, the definition of basic concepts during association analysis is first provided.
Affairs:
Affairs are the set of some, do not have the item repeated in set.Such as someone shopping online, in a shopping online During, 2 bread are bought, 1 bottled beer, 3 bags of paper diaper, then can be referred to as by gathering { bread, beer, paper diaper } by one Affairs.
Element in affairs is referred to as item.The width of affairs is the number of element in affairs.
Correlation rule:
Correlation rule be shaped likeRule,X is regular former piece, and Y is consequent, X and Y tools There is certain causality.
Support:
Assuming that there are set T, the T={ t of an affairs1,t2,...,tn, n is the number of affairs, and X is the collection of affairs middle term Close, X support counting is defined as:
Wherein:| g | represent the number of element in set, the frequent journey that support measurement item collection X occurs in transaction set T Degree.
Correlation ruleSupport counting be defined as:
Confidence level:
Correlation ruleConfidence level be defined as:
Confidence metric correlation ruleThe degree of reliability, confidence level is higher, what Y occurred in the affairs comprising X Probability is higher.
Go with contextual definition:
Passenger, which goes with to go on a journey, can typically meet certain relation, such as be got on the bus at the same starting station on the same day, go to Same point of destination, same train is taken, in same compartment, order of seats is adjacent closer;In the same of same station Individual window booking.
Contextual definition of going with is herein:
In passenger's booking flow data in the interval of time of acquisition, if passenger A is same on the same day with passenger B The individual starting station is got on the bus, and goes to same point of destination, is taken same train, in same compartment, is used identical booking side Formula and meets the requirement of support and confidence level, then passenger A and passenger B has in the same window booking at same station Go with relation.
It should be noted that:1) it is based on support-Confidence Framework to define, and searching is frequently pattern, for Only trip once but meet other relation conditions of going with two passengers may because be unsatisfactory for minimum support threshold value and by Filter out.
2) time interval may be across more days.In order to avoid certain passenger goes on a journey once and only gone on a journey daily once can be by Support rule-based filtering falls.
3) not measurement passenger between order of seats distance because indefinite rule can be determined that great Two people just possess the relation of going with distance, and beyond this distance, relation of going with just is not present.In actual life, trip of going with People also not necessarily ensure order of seats on it is adjacent.
4) this definition defines the exact range of affairs, it is necessary to while meet that multiple conditions just possess the relation of going with, thing The scope of business can adjust according to real data situation, but the mean breadth of affairs determines association analysis search space Depth, algorithm computation complexity is with the change growth of index greatly of average transaction width.
Transaction Information is analyzed by bus:
According to the definition for relation of going with above, booking data by bus are mapped as with the date+train number+starting station+terminus The character string of+coach number+booking mode+booking station+booking window splicing is key, is worth key assignments with passenger identity card number It is right, and reduction is carried out according to key, it is worth the set for identification card number, extracts the value of all key-value pair datas, is saved in HDFS files In, each value is an affairs.
Simple statistics analysis is done to passenger's booking data:
The quantity about 6,000,000 or so of average trip passenger daily, item collection space is very huge.
Average every train accommodates 855 people, and according to the definition for relation of going with above, the mean breadth of affairs is 10.
Booking data are very sparse by bus, and 62% passenger only have purchased a ticket in this time interval.Thing by bus Data compressible space of being engaged in is smaller.The per day booking number distribution histogram of passenger is as shown in figure 11.
About 1.15% people buys multiple ticket with same identity card and takes same train, and this is hereafter tied Very big puzzlement is caused with relationship analysis,
Go with relationship analysis method:
For Transaction Information compressible space of riding it is smaller the characteristics of, we using Apriori algorithm analysis passenger it Between incidence relation.Apriori algorithm is first association rules mining algorithm.Its initiative having used is based on support Technology of prunning branches, systematically control candidate exponential increase.The main thought of Apriori algorithm is:An if item collection It is that frequently, then its all subsets are also frequently;If item collection right and wrong are frequently, its superset must be non-frequent 's.
Apriori algorithm produce frequent item set have two it is important the characteristics of:1) it is an algorithm successively, from frequent 1 Item collection, progressive alternate is until generating most long frequent item set;2) frequent item set is found using Test Strategy is produced.New time The frequent item set that set of choices is all found by preceding an iteration is produced, and then each candidate's frequent item set is counted, and with most Small support threshold is compared, and meet minimum support threshold value is then frequent item set.
The substantially thinking of algorithm is as described below:
1) initially through the whole data set of traverse scanning, it is determined that the support of each item.This step is completed, can be obtained Frequent 1 item collection.
2) frequent (k-1) item collection found using last iteration, the k item collections of candidate are produced.
3) transaction data set (TDS) is scanned, it is determined that all k item collections included included in each affairs, calculate candidate item Support counting.
4) according to minimum support threshold value, candidate's k item collections is filtered, obtain all frequent k item collections.
5) 2-4 steps are repeated, until not new frequent item set produces or met other conditions.
Candidate's Frequent Set generates:
There are a variety of methods on the generative theory of candidate's Frequent Set, space, time complexity are calculated in order to reduce, these Method should meet several requirements:
1) the too many unnecessary candidate of generation is avoided.
2) ensure that the candidate of generation is complete, without any frequent item set is omitted in generating process, will not produce The candidate repeated.In process of data preprocessing, ascending order arrangement is carried out according to lexcographical order to all items in affairs, just It is in order to avoid producing the candidate item repeated, causing the waste of computing resource.
Common generating algorithm has:
1)Fk-1×F1Method:Each frequently F (k-1) item collection is extended with frequent 1 item collection.Assuming that frequent k-1 item collections For A, frequent 1 item collection is B, and the item in frequent item set A, B is all stored with lexcographical order, any one member in the element ratio A in B Element is all big, then asks the union of two item collections of A, B to be extended.Although this method has significantly than universe traversal method Improve, but still can produce a large amount of unnecessary candidate items.
2)Fk-1×Fk-1Method:By merging a pair of frequently (k-1) item collections, to generate a k item collection, it is desirable to two (k-1) (k-2) item is all identical before item collection.Assuming that frequent k-1 item collections A, B, the item in A, B is all stored with lexcographical order, in A, B All identical (the k of preceding k-2 items>=2) union of two item collections of A, B, is then asked to be extended, and to the result after extension according to word Canonical ordering sorts.
Here method in using the 2nd.
Candidate's Frequent Set counts:
A kind of method of support counting is the method for exhaustion:All candidates are traveled through, each candidate and often One things compares, and judges whether candidate is included in affairs.This method needs repeatedly traversal transaction set, works as affairs When set candidates item collection number is all very big, calculating time, space cost are all very big.
Another method is:Transaction set is traveled through, enumerates the k item collections that each office includes, and candidate corresponding to renewal The support of collection.In all possible k item collections in enumerating each affairs, the data structure of prefix trees can be used.In candidate When item collection compares with affairs, using HashTree, candidate is divided into different buckets, and be stored in In HashTree, during support counting, the corresponding branches of HashTree are selected successively according to the structure of affairs sequence, finally Matched with candidate in same bucket.
Assuming that there is affairs { 1,2,3,4,5 }, the item in affairs arranges according to lexcographical order ascending order, enumerates and is possible to 3 item collections, enumeration process is as shown in Figure 10:
1) number of plies depth=k of prefix trees, initialization iterative parameter i=1 are first determined
2) when meeting condition i≤depth, iterative process is continued with
3) determine the beginning item of the i-th node layer and terminate item, start the next item down that item is father node, end item is affairs (3-i+1) item reciprocal
4) intercept in affairs and starting item and terminating all items between item, each one node of generation
5) i++, the 2) step condition judgment is turned
6) all leaf nodes are traveled through, to each leaf node, export the path from root node to leaf node
It is as shown in figure 11 that 3 all item collections are enumerated to affairs { 1,2,3,4,5 }.
HashTree is built, the item collection of all candidates 3 generated according to chapters and sections 4.2.1 methods describeds is put into different leaves In node, to each affairs, successively according to HashTree hierarchical structure, corresponding branch is selected successively, then same Item collection of enumerating in individual bucket matches with candidate, and process description is as shown in figure 12.
Minimum support threshold value requirement is unsatisfactory for support in candidate's frequent item set, beta pruning is directly carried out, in candidate Iterative process in, no longer nonmatching grids are extended.
Correlation rule generates:
Apriori algorithm produces correlation rule using algorithm successively, wherein every layer of item corresponded in consequent Number.When initial, containing only all high confidence levels rule of an item in extracting rule consequent, then using these rules, to generate New rule.It is similar with process caused by frequent item set to produce the process of correlation rule, unlike, in regular generation process In, it is not necessary to scan data set is come come the support calculated when calculating the confidence level of candidate rule, but being produced using frequent item set It is determined that the confidence level of each rule.
The generating algorithm of rule, which is similar to, performs Fk-1×Fk-1Method, first calculate in all consequents containing only one The rule for meeting confidence level requirement of item, the rule that two consequent numbers are (k-1) is then combined with, obtains consequent number For k rule, then successively iteration successively.Algorithm calculating process is as follows:
For frequent item set { 1,2,3,4,5 }, Association Rules Generating Algorithm.
Algorithm flow is as described below:
1) frequent item set set rules is traveled through, each frequent episode is done extended below:
2) depth=is set | rules |, | g | represent to ask the number of element in set, d=1
3) judge whether to meet condition d<depth
4) F is appliedk-1×Fk-1Method, merge after the consequent that two item numbers are d-1 is the rule that an item number is d Part, regular former piece then subtract newly-generated consequent for the set of all of frequent item set
5) judge whether newly-generated rule meets min confidence requirement, if being unsatisfactory for, the node is cut Branch, is no longer extended to the rule of the node later
6) d=d+1, the 3) step is jumped to
Correlation rule is assessed:
Apriori algorithm is established in the frame foundation of the association analysis of support-support, but confidence level rule tool There is certain limitation, it have ignored the support of the item collection occurred in consequent, and high confidence level rule may result in one The illusion of a little mistakes.For this problem, consider to weigh the reliability of rule by degree of being lifted.Lifting degree is defined as follows It is shown:
WhenWhen, it is believed that X, Y positive correlation.WhenWhen, it is believed that X, Y are negatively correlated.When, it is believed that X, Y are separate.
In affairs by bus, selectionRule, it is believed that X, Y, which have, to go with relation.
Preferably, it is described from the attribute information of passenger, trip purpose distribution, the distribution of booking number, train number type, booking Mode is distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger, including:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously , quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking The entropy and the number of different booking modes that mode is distributed.
Step 102, the booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is one A point of the passenger in higher dimensional space, if the type of the passenger is unknown, the kind judging of the passenger will be asked Topic is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
Specifically, characterize the booking behavior pattern of a passenger with a state vector herein, then each passenger can be with Regard a point of the passenger in higher dimensional space as.The type of passenger is unknown, and the judgement to passenger type is regarded as one Individual generation learning process, then passenger's kind judging problem is converted into is carried out to the probability density distribution of passenger's hidden state vector It is fitted the process of study.
Assuming that the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y Condition distribution obey Multi-dimensional Gaussian distribution, we using multiple Gauss models linear weighted function and be fitted passenger's Probability density distribution curve, there may be the situation of multiple classifications during simulation is real.Because the scale of data is bigger, in Heart limit theorem is understood, it is assumed that it is rational that passenger's vector y is distributed Normal Distribution to passenger's classification z condition.It is theoretical On, the weighted sums of multiple Gauss models can be with arbitrary extent close to any probability distribution.Each Gauss model represents one Class, for needing to judge that the passenger of classification is vectorial, the probability that passenger belongs to each class is calculated respectively, then select probability is maximum Class as result of determination.
Preferably, the booking behavior pattern that passenger is characterized by passenger's hidden state vector, then each passenger is one A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector, including:
If the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y's A Multi-dimensional Gaussian distribution is obeyed in condition distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real There may be the situation of multiple classifications;
Assume that passenger's vector y is distributed Normal Distribution to passenger's classification z condition according to central-limit theorem;It is multiple For the weighted sum of Gauss model with arbitrary extent close to any probability distribution, each Gauss model represents a class, for needing Judge passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement Fruit.
Specifically, GMM model:
Assuming that passenger's hidden state vector obeys mixture gaussian modelling, then the probability that jth position passenger occurs can be retouched State for:
θk=(μk,Ck)
Wherein:αkRepresent the probability that passenger's classification is k, φ (yj;θk) it is y in the case of known passenger classification is kjCondition Probability density function, φ (yj;θk) expression mean vector is μk, covariance matrix CkMulti-dimensional Gaussian distribution probability it is close Spend function, CkFor positive definite matrix, n yjThe dimension of vector.
It is contemplated that passenger's hidden state vector so generates:1) according to probability αkSelect k-th of Gaussian distribution model φ(yj;θk);2) the probability density function φ (y according to the Gaussian distribution model selectedj;θk) generation observation data yj.At this moment, Observe data yjIt is known, yjData from k-th of sub-model are unknown, with hidden variable γjkRepresent, it is defined such as Under:
γjk=1, when j-th of observation comes from k-th of sub-model;
γjk=0, other situations.
There is observation data yjAnd data γ is not observedjk, then complete data is exactly (yjj1j1,...,γjk), J=1,2 ..., N.Then, we can write out the likelihood function of complete data:
Wherein:nkThe number from k-th of sub-model in all observation data is represented, N is the number of all observations.
So, the log-likelihood function of complete data is:
Solution to model uses EM (Expectation Maximization) iterative algorithm.EM algorithms can ensure The convergence of log-likelihood function process is maximized, but cannot be guaranteed to obtain optimal solution.It is contemplated that EM algorithms are run multiple times, then Take the logarithm likelihood function maximum maximum, to ensure that model can obtain relatively good fitting effect.
EM algorithm iterations solve:
EM algorithms are iterative algorithms, it is necessary to go to specify specific initial value by hand.Meanwhile train a good GMM mould Type is a highly difficult task, because the selection of initial value can largely influence the result of model final output, even Whether energy decision model restrains, and on some specific initial points, EM algorithms can become unstable, or failure.Initial value Selection can also have influence on convergence of algorithm speed.So algorithm parameter initialization is vital in EM parameter estimation procedures One link.
Model parameter initializes:
Conventional model parameter initialization strategy has:
Rasterizing is searched for
One reference axis is established with each free parameter, the span of each parameter is determined, then with equidistant Mode be directed to each parameter value, cartesian product computing is done in the value set to each parameter, finally obtains a higher-dimension Network structure in space, each point represents a kind of value condition of all parameters, then using each point in grid as just Initial value is substituted into EM algorithms and solved, and takes the optimal solution of all situations.
The subject matter of rasterizing search strategy be calculation cost with the number of free parameter it is exponential increase.
Random search
This strategy is similar with rasterizing value, the difference is that in the span of parameter, if random selection Dry value, an edge distribution can be defined to free parameter, Bernoulli Jacob can be used to be distributed the parameters of some discrete types, Multinomial distribution etc..Random searching strategy is used than rasterizing search strategy and is more convenient, and convergence rate is faster.
Expectation Step
The log-likelihood function logP (y, γ | θ) of complete data is in given observation data Y and parameter current θ(i)Under to not Observe data γ conditional probability distribution P (γ | Y, θ(i)) expectation be referred to as Q functions.
Pay attention to:I is the ordinal number of the iteration in the iterative process for estimate model parameter with EM algorithms, and j is number in data set According to ordinal number, k represents the ordinal number of sub-model.
Need exist for calculating E [γjk| y, θ], it is designated asBe under "current" model parameter j-th of observation data from the The probability of k sub-model, referred to as sub-model k are to observing data yjResponsiveness.
WillAndFormula in substitution, is obtained:
Maximization Step
The M steps of iteration are to find a function Q (θ, θ(i)) to θ maximum, i.e.,:
θ(i+1)=argmax Q (θ, θ(i))
It is 0 to seek partial derivative to Q functions and make it, is obtained:
E steps and M steps are repeated, untill algorithmic statement, the standard of algorithmic statement can reach certain step number or two The change of Q functions is less than a threshold value in secondary iterative process.
A kind of method of Ticketing information analysis provided in an embodiment of the present invention, attribute information, trip purpose point from passenger Cloth, booking number, train number type are distributed, booking mode is distributed, the starting station is distributed, terminus is distributed, relation of going with extraction passenger Booking behavior pattern feature;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is one A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.It is dynamic so as to realize State tracks the sneak condition of passenger, accurate to estimate passenger's booking behavior pattern, and have to error information certain tolerance, Fault-tolerance;Iterative state estimation algorithm is developed, vaild act feature can be extracted parallel from mass data, is identified Go out the abnormal patterns of passenger's booking behavior, the requirement of real-time is met in time computational efficiency;The output result of model should expire The requirement of sufficient stability, the result of determination of someone's booking pattern should be consistent in special time;To the defeated of model The displaying of many-sided multi-angle can be carried out by going out result, facilitate the use of Correlative data analysis number.
With reference to figure 13, Figure 13 is that a kind of functional module of the device of Ticketing information analysis provided in an embodiment of the present invention is shown It is intended to.
As shown in figure 13, the device of the Ticketing information analysis includes:
Extraction module 1301, for the attribute information from passenger, trip purpose distribution, booking number, train number type point Cloth, booking mode are distributed, the starting station is distributed, terminus is distributed, the booking behavior pattern feature of relation of going with extraction passenger;
Fitting module 1302, for characterizing the booking behavior pattern of passenger by passenger's hidden state vector, then each multiply Visitor is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, by the class of the passenger Other decision problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
Preferably, the attribute information of the passenger includes data file is counted according to the age to obtain passenger's age Distributed intelligence;
Trip purpose distribution include province native place number information that basis parses from ID card information in conjunction with The administrative division code of the starting station and terminus judge native place numbering with the starting station, terminus administrative division code whether phase Deng passenger to be divided into the classification of predetermined number according to trip purpose, the distribution do not omitted is not repeated between of all categories;Wherein, It is all consistent with native place that odh represents starting station terminus, and in local, short distance is gone on a journey inside the province;Odo represents the starting station and terminus one Cause, but province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;other Represent other situations;
The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is remembered for booking State is the frequency of 3 record in record;The returned ticket number is the frequency of the record that state is 2 in booking record;It is described effective Booking number is the frequency of the record that state is 5 in booking record;
The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, point Each train number type economy, agility, the score of comfortableness are not calculated, then is added summation and simultaneously divided by by bus records number, are respectively referred to Scale value is scored at the distribution of all average values of the corresponding index of record by bus in preset time;
The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, point Each train number type economy, the score of agility are not calculated, then is added summation and simultaneously divided by by bus records number, and each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
The starting station distribution includes according to the number at the starting station, maximum likelihood probability, station important coefficient and originated Stand the entropy of distribution, the number at the starting station is all booking records of polymerization passenger, in institute for key with passenger identity card number In some booking records, there is the quantity at the different starting stations;The maximum likelihood probability is the booking record in each passenger In, the most station of occurrence number, booking number of the probability with frequency divided by altogether calculates;The station important coefficient Counted for the important coefficient at each station with the same day total transmission number at the station divided by total transmission number at all stations Calculate;The entropy of the starting station distribution obtains the starting station in all booking records to polymerize all booking records of single passenger Gather, the frequency of different items in statistics set, and calculate the entropy of discrete distribution;
The terminus distribution includes number, maximum likelihood probability, station important coefficient and the terminus point of terminus The entropy of cloth;The number of the terminus is all booking records of polymerization passenger, all for key with passenger identity card number In booking record, there is the quantity of different terminus;During the maximum likelihood probability is records in the booking of each passenger, The most station of occurrence number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is every The important coefficient at individual station with total estimated arrive at a station number divided by all stations at the same day station it is total it is estimated arrive at a station number come Calculate;The entropy of the terminus distribution obtains terminus in all booking records to polymerize all booking records of single passenger Set, the frequency of different items in statistics set, and calculate the entropy of discrete distribution.
Preferably, the relation of going with is included in passenger's booking flow data in the interval of time of acquisition, if Passenger A and passenger B on the same day get on the bus by the same starting station, goes to same point of destination, same train is taken, same In compartment, using same window booking of the identical booking mode at same station, and meet support and confidence level It is required that the relation of going with that then passenger A and passenger B has.
Preferably, the extraction module 1301, is specifically used for:
With one vector characterize a passenger hidden state, by it is discrete, qualitatively Data Identification is converted into continuously , quantitative data mode;
Statistical analysis is carried out according to the booking data in prefixed time interval, passenger is described most with maximum likelihood probability Possible booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking The entropy and the number of different booking modes that mode is distributed.
Preferably, the fitting module 1302, is specifically used for:
It is hidden variable if the classification of passenger is, in the case of known passenger classification, the bar of passenger's hidden state vector A Multi-dimensional Gaussian distribution is obeyed in part distribution;
Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, during simulation is real There may be the situation of multiple classifications;
Assume that passenger's vector is distributed Normal Distribution to the condition of passenger's classification according to central-limit theorem;It is multiple high For the weighted sum of this model with arbitrary extent close to any probability distribution, each Gauss model represents a class, sentences for needs Determine passenger's vector of classification, calculate the probability that passenger belongs to each class respectively, the class of reselection maximum probability is tied as judgement Fruit.
A kind of device of Ticketing information analysis provided in an embodiment of the present invention, attribute information, trip purpose point from passenger Cloth, booking number, train number type are distributed, booking mode is distributed, the starting station is distributed, terminus is distributed, relation of going with extraction passenger Booking behavior pattern feature;The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is one A point of the individual passenger in higher dimensional space, if the type of the passenger is unknown, by the kind judging of the passenger Problem is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.It is dynamic so as to realize State tracks the sneak condition of passenger, accurate to estimate passenger's booking behavior pattern, and have to error information certain tolerance, Fault-tolerance;Iterative state estimation algorithm is developed, vaild act feature can be extracted parallel from mass data, is identified Go out the abnormal patterns of passenger's booking behavior, the requirement of real-time is met in time computational efficiency;The output result of model should expire The requirement of sufficient stability, the result of determination of someone's booking pattern should be consistent in special time;To the defeated of model The displaying of many-sided multi-angle can be carried out by going out result, facilitate the use of Correlative data analysis number.
The technical principle of the embodiment of the present invention is described above in association with specific embodiment.These descriptions are intended merely to explain The principle of the embodiment of the present invention, and the limitation to protection domain of the embodiment of the present invention can not be construed in any way.Based on this The explanation at place, those skilled in the art, which would not require any inventive effort, can associate the other of the embodiment of the present invention Embodiment, these modes are fallen within the protection domain of the embodiment of the present invention.

Claims (10)

  1. A kind of 1. method of Ticketing information analysis, it is characterised in that methods described includes:
    Attribute information, trip purpose distribution, the distribution of booking number, train number type from passenger, booking mode are distributed, the starting station point Cloth, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;
    The booking behavior pattern of passenger is characterized by passenger's hidden state vector, then each passenger is a passenger in higher dimensional space In a point, if the type of the passenger is unknown, the kind judging problem of the passenger will be converted into described The probability density distribution of passenger's hidden state vector is fitted the process of study.
  2. 2. according to the method for claim 1, it is characterised in that the attribute information of the passenger include to data file according to Age is counted to obtain passenger's age distribution information;
    Trip purpose distribution includes province native place number information that basis parses from ID card information in conjunction with originating Stand and the administrative division code of terminus judges whether native place numbering is equal with the administrative division code of the starting station, terminus, will Passenger is divided into the classification of predetermined number according to trip purpose, and the distribution do not omitted is not repeated between of all categories;Wherein, odh generations Table starting station terminus is all consistent with native place, and in local, short distance is gone on a journey inside the province;It is consistent with terminus that odo represents the starting station, still Province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;Other represents other Situation;
    The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is in booking record State is the frequency of 3 record;The returned ticket number is the frequency of the record that state is 2 in booking record;Effective booking Number is the frequency of the record that state is 5 in booking record;
    The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, calculates respectively Each train number type economy, agility, the score of comfortableness, then be added summation and simultaneously divided by by bus record number, each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
    The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, is calculated respectively Each train number type economy, the score of agility, then be added summation and simultaneously divided by by bus record number, each desired value is scored at default The distribution of all average values of the corresponding index of record by bus in time;
    The starting station distribution is included according to the number at the starting station, maximum likelihood probability, station important coefficient and the starting station point The entropy of cloth, the number at the starting station is all booking records of polymerization passenger, in all purchases for key with passenger identity card number In ticket record, there is the quantity at the different starting stations;The maximum likelihood probability is in the booking record of each passenger, is occurred The most station of number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each car The important coefficient stood is calculated with the same day total transmission number at the station divided by total transmission number at all stations;It is described to originate The entropy being distributed stand to polymerize all booking records of single passenger, obtains the set at the starting station in all bookings records, counts collection The frequency of different items in conjunction, and calculate the entropy of discrete distribution;
    What number, maximum likelihood probability, station important coefficient and the terminus that the terminus distribution includes terminus were distributed Entropy;The number of the terminus is for key with passenger identity card number, all booking records of polymerization passenger, is remembered in all bookings In record, there is the quantity of different terminus;The maximum likelihood probability is the occurrence number in the booking record of each passenger Most stations, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each station Important coefficient estimated arrives at a station number to calculate with the total of total estimated arrive at a station number divided by all stations at the same day station;It is described The entropy of terminus distribution obtains the set of terminus in all booking records, system to polymerize all booking records of single passenger The frequency of different items in meter set, and calculate the entropy of discrete distribution.
  3. 3. according to the method for claim 1, it is characterised in that the relation of going with is included in the interval of time of acquisition In interior passenger's booking flow data, if passenger A and passenger B on the same day get on the bus by the same starting station, same point of destination is gone to, Same train is taken, in same compartment, using same window booking of the identical booking mode at same station, And meet the requirement of support and confidence level, then the relation of going with that passenger A and passenger B has.
  4. 4. according to the method described in claims 1 to 3 any one, it is characterised in that the attribute information from passenger, trip Purpose distribution, the distribution of booking number, train number type, booking mode is distributed, the starting station is distributed, terminus is distributed, the relation of going with carries The booking behavior pattern feature of passenger is taken, including:
    The hidden state of a passenger is characterized with a vector, by it is discrete, qualitatively Data Identification be converted into it is continuous, quantitative Data mode;
    Statistical analysis is carried out according to the booking data in prefixed time interval, it is most probable to describe passenger with maximum likelihood probability Booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
    Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking mode point The number of the entropy of cloth and different booking modes.
  5. 5. according to the method for claim 4, it is characterised in that the purchase that passenger is characterized by passenger's hidden state vector Ticket behavior pattern, then each passenger is a point of the passenger in higher dimensional space, if the type of the passenger is unknown, The kind judging problem of the passenger will be then converted into and the probability density distribution of passenger's hidden state vector is intended The process of study is closed, including:
    If the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y condition A Multi-dimensional Gaussian distribution is obeyed in distribution;
    Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, may have in simulation reality The situation of multiple classifications;
    Assume that passenger's vector y is distributed Normal Distribution to passenger's classification z condition according to central-limit theorem;Multiple Gaussian modes For the weighted sum of type with arbitrary extent close to any probability distribution, each Gauss model represents a class, for needing to judge class Other passenger's vector, calculates the probability that passenger belongs to each class, the class of reselection maximum probability is as result of determination respectively.
  6. 6. a kind of device of Ticketing information analysis, it is characterised in that described device includes:
    Extraction module, for the attribute information from passenger, trip purpose distribution, the distribution of booking number, train number type, booking mode Distribution, starting station distribution, terminus distribution, the booking behavior pattern feature of relation of going with extraction passenger;
    Fitting module, for characterizing the booking behavior pattern of passenger by passenger's hidden state vector, then each passenger is one A point of the passenger in higher dimensional space, if the type of the passenger is unknown, the kind judging of the passenger will be asked Topic is converted into the process that study is fitted to the probability density distribution of passenger's hidden state vector.
  7. 7. device according to claim 6, it is characterised in that the attribute information of the passenger include to data file according to Age is counted to obtain passenger's age distribution information;
    Trip purpose distribution includes province native place number information that basis parses from ID card information in conjunction with originating Stand and the administrative division code of terminus judges whether native place numbering is equal with the administrative division code of the starting station, terminus, will Passenger is divided into the classification of predetermined number according to trip purpose, and the distribution do not omitted is not repeated between of all categories;Wherein, odh generations Table starting station terminus is all consistent with native place, and in local, short distance is gone on a journey inside the province;It is consistent with terminus that odo represents the starting station, still Province's short distance trip beyond local;O representatives leave local and go to other provinces to go on a journey;D is represented and gone home from other provinces;Other represents other Situation;
    The booking number includes changing label number, returned ticket number and effective booking number, and the label number that changes is in booking record State is the frequency of 3 record;The returned ticket number is the frequency of the record that state is 2 in booking record;Effective booking Number is the frequency of the record that state is 5 in booking record;
    The train number type distribution includes having a plurality of record by bus to obtain the sequence of different train number types according to passenger, calculates respectively Each train number type economy, agility, the score of comfortableness, then be added summation and simultaneously divided by by bus record number, each desired value obtains It is divided into the distribution of all average values of the corresponding index of record by bus in preset time;
    The booking mode is distributed including having a plurality of record by bus to obtain the sequence of different booking modes according to passenger, is calculated respectively Each train number type economy, the score of agility, then be added summation and simultaneously divided by by bus record number, each desired value is scored at default The distribution of all average values of the corresponding index of record by bus in time;
    The starting station distribution is included according to the number at the starting station, maximum likelihood probability, station important coefficient and the starting station point The entropy of cloth, the number at the starting station is all booking records of polymerization passenger, in all purchases for key with passenger identity card number In ticket record, there is the quantity at the different starting stations;The maximum likelihood probability is in the booking record of each passenger, is occurred The most station of number, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each car The important coefficient stood is calculated with the same day total transmission number at the station divided by total transmission number at all stations;It is described to originate The entropy being distributed stand to polymerize all booking records of single passenger, obtains the set at the starting station in all bookings records, counts collection The frequency of different items in conjunction, and calculate the entropy of discrete distribution;
    What number, maximum likelihood probability, station important coefficient and the terminus that the terminus distribution includes terminus were distributed Entropy;The number of the terminus is for key with passenger identity card number, all booking records of polymerization passenger, is remembered in all bookings In record, there is the quantity of different terminus;The maximum likelihood probability is the occurrence number in the booking record of each passenger Most stations, booking number of the probability with frequency divided by altogether calculate;The station important coefficient is each station Important coefficient estimated arrives at a station number to calculate with the total of total estimated arrive at a station number divided by all stations at the same day station;It is described The entropy of terminus distribution obtains the set of terminus in all booking records, system to polymerize all booking records of single passenger The frequency of different items in meter set, and calculate the entropy of discrete distribution.
  8. 8. device according to claim 6, it is characterised in that the relation of going with is included in the interval of time of acquisition In interior passenger's booking flow data, if passenger A and passenger B on the same day get on the bus by the same starting station, same point of destination is gone to, Same train is taken, in same compartment, using same window booking of the identical booking mode at same station, And meet the requirement of support and confidence level, then the relation of going with that passenger A and passenger B has.
  9. 9. according to the device described in claim 6 to 8 any one, it is characterised in that the extraction module, be specifically used for:
    The hidden state of a passenger is characterized with a vector, by it is discrete, qualitatively Data Identification be converted into it is continuous, quantitative Data mode;
    Statistical analysis is carried out according to the booking data in prefixed time interval, it is most probable to describe passenger with maximum likelihood probability Booking mode, by qualitatively, the data of label be converted into continuous, quantitative data mode;
    Booking mode in being recorded by bus by the way that the passenger being aggregated in the prefixed time interval is all, calculates booking mode point The number of the entropy of cloth and different booking modes.
  10. 10. device according to claim 9, it is characterised in that the fitting module, be specifically used for:
    If the classification of passenger is z, z is hidden variable, in the case of known passenger classification, passenger's hidden state vector y condition A Multi-dimensional Gaussian distribution is obeyed in distribution;
    Using the linear weighted function of multiple Gauss models and to be fitted the probability density distribution curve of passenger, may have in simulation reality The situation of multiple classifications;
    Assume that passenger's vector y is distributed Normal Distribution to passenger's classification z condition according to central-limit theorem;Multiple Gaussian modes For the weighted sum of type with arbitrary extent close to any probability distribution, each Gauss model represents a class, for needing to judge class Other passenger's vector, calculates the probability that passenger belongs to each class, the class of reselection maximum probability is as result of determination respectively.
CN201611198401.8A 2016-12-22 2016-12-22 A kind of method and device of Ticketing information analysis Pending CN107527223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611198401.8A CN107527223A (en) 2016-12-22 2016-12-22 A kind of method and device of Ticketing information analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611198401.8A CN107527223A (en) 2016-12-22 2016-12-22 A kind of method and device of Ticketing information analysis

Publications (1)

Publication Number Publication Date
CN107527223A true CN107527223A (en) 2017-12-29

Family

ID=60748558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611198401.8A Pending CN107527223A (en) 2016-12-22 2016-12-22 A kind of method and device of Ticketing information analysis

Country Status (1)

Country Link
CN (1) CN107527223A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573430A (en) * 2018-03-14 2018-09-25 北京经纬信息技术公司 A kind of data processing method, device and computer readable storage medium
CN108596664A (en) * 2018-04-24 2018-09-28 盘缠科技股份有限公司 A kind of unilateral tranaction costs of electronic ticket determine method, system and device
CN109376315A (en) * 2018-09-25 2019-02-22 海南民航凯亚有限公司 A kind of civil aviation passenger label analysis method and processing terminal based on machine learning
CN109783531A (en) * 2018-12-07 2019-05-21 北京明略软件系统有限公司 A kind of relationship discovery method and apparatus, computer readable storage medium
CN110334963A (en) * 2019-07-11 2019-10-15 四川亨通网智科技有限公司 Admission ticket order background management system
CN111598162A (en) * 2020-05-14 2020-08-28 万达信息股份有限公司 Cattle risk monitoring method, terminal equipment and storage medium
CN112836996A (en) * 2021-03-10 2021-05-25 西南交通大学 Method for identifying potential ticket buying demand of passenger
CN112949926A (en) * 2021-03-10 2021-06-11 西南交通大学 Income maximization ticket amount distribution method based on passenger demand re-identification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235933A (en) * 2013-04-15 2013-08-07 东南大学 Vehicle abnormal behavior detection method based on Hidden Markov Model
CN104702378A (en) * 2013-12-06 2015-06-10 华为技术有限公司 Method and device for estimating parameters of mixture Gaussian distribution
CN104749624A (en) * 2015-03-03 2015-07-01 中国石油大学(北京) Method for synchronously realizing seismic lithofacies identification and quantitative assessment of uncertainty of seismic lithofacies identification
CN105516127A (en) * 2015-12-07 2016-04-20 中国科学院信息工程研究所 Internal threat detection-oriented user cross-domain behavior pattern mining method
CN105701180A (en) * 2016-01-06 2016-06-22 北京航空航天大学 Commuting passenger feature extraction and determination method based on public transportation IC card data
CN105719023A (en) * 2016-01-24 2016-06-29 东北电力大学 Real-time wind power prediction and error analysis method based on mixture Gaussian distribution
CN105808639A (en) * 2016-02-24 2016-07-27 平安科技(深圳)有限公司 Network access behavior recognizing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235933A (en) * 2013-04-15 2013-08-07 东南大学 Vehicle abnormal behavior detection method based on Hidden Markov Model
CN104702378A (en) * 2013-12-06 2015-06-10 华为技术有限公司 Method and device for estimating parameters of mixture Gaussian distribution
CN104749624A (en) * 2015-03-03 2015-07-01 中国石油大学(北京) Method for synchronously realizing seismic lithofacies identification and quantitative assessment of uncertainty of seismic lithofacies identification
CN105516127A (en) * 2015-12-07 2016-04-20 中国科学院信息工程研究所 Internal threat detection-oriented user cross-domain behavior pattern mining method
CN105701180A (en) * 2016-01-06 2016-06-22 北京航空航天大学 Commuting passenger feature extraction and determination method based on public transportation IC card data
CN105719023A (en) * 2016-01-24 2016-06-29 东北电力大学 Real-time wind power prediction and error analysis method based on mixture Gaussian distribution
CN105808639A (en) * 2016-02-24 2016-07-27 平安科技(深圳)有限公司 Network access behavior recognizing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈宇: "基于高斯混合模型的林业信息文本分类算法", 《中南林业科技大学学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573430A (en) * 2018-03-14 2018-09-25 北京经纬信息技术公司 A kind of data processing method, device and computer readable storage medium
CN108596664A (en) * 2018-04-24 2018-09-28 盘缠科技股份有限公司 A kind of unilateral tranaction costs of electronic ticket determine method, system and device
CN108596664B (en) * 2018-04-24 2021-01-05 盘缠科技股份有限公司 Method, system and device for determining unilateral transaction fee of electronic ticket
CN109376315A (en) * 2018-09-25 2019-02-22 海南民航凯亚有限公司 A kind of civil aviation passenger label analysis method and processing terminal based on machine learning
CN109783531A (en) * 2018-12-07 2019-05-21 北京明略软件系统有限公司 A kind of relationship discovery method and apparatus, computer readable storage medium
CN110334963A (en) * 2019-07-11 2019-10-15 四川亨通网智科技有限公司 Admission ticket order background management system
CN111598162A (en) * 2020-05-14 2020-08-28 万达信息股份有限公司 Cattle risk monitoring method, terminal equipment and storage medium
CN112836996A (en) * 2021-03-10 2021-05-25 西南交通大学 Method for identifying potential ticket buying demand of passenger
CN112949926A (en) * 2021-03-10 2021-06-11 西南交通大学 Income maximization ticket amount distribution method based on passenger demand re-identification
CN112836996B (en) * 2021-03-10 2022-03-04 西南交通大学 Method for identifying potential ticket buying demand of passenger
CN112949926B (en) * 2021-03-10 2022-03-04 西南交通大学 Income maximization ticket amount distribution method based on passenger demand re-identification

Similar Documents

Publication Publication Date Title
CN107527223A (en) A kind of method and device of Ticketing information analysis
Grawe et al. Automated patent classification using word embedding
CN101216998B (en) An urban traffic flow information amalgamation method of evidence theory based on fuzzy rough sets
CN112464094B (en) Information recommendation method and device, electronic equipment and storage medium
Li Credit risk prediction based on machine learning methods
Kaeeni et al. Derailment accident risk assessment based on ensemble classification method
CN111242484A (en) Vehicle risk comprehensive evaluation method based on transition probability
CN113379313B (en) Intelligent preventive test operation management and control system
CN112508600A (en) Vehicle value evaluation method based on Internet public data
CN115147155A (en) Railway freight customer loss prediction method based on ensemble learning
CN115409577A (en) Intelligent container repurchase prediction method and system based on user behavior and environmental information
CN113256409A (en) Bank retail customer attrition prediction method based on machine learning
CN115099450A (en) Family carbon emission monitoring and accounting platform based on fusion model
CN116541782A (en) Power marketing data anomaly identification method
CN107992613A (en) A kind of Text Mining Technology protection of consumers&#39; rights index analysis method based on machine learning
CN106779214A (en) A kind of multifactor fusion civil aviation passenger travel forecasting approaches based on topic model
Hu Overdue invoice forecasting and data mining
CN115545342A (en) Risk prediction method and system for enterprise electric charge recovery
CN110347828A (en) A kind of Metro Passenger demand dynamic acquisition method and its obtain system
Xu et al. MM-UrbanFAC: Urban functional area classification model based on multimodal machine learning
Rao et al. Flight Ticket Prediction using Random Forest Regressor Compared with Decision Tree Regressor
Wang et al. Stacking Based LightGBM-CatBoost-RandomForest Algorithm and Its Application in Big Data Modeling
Gao et al. Statistics and Analysis of Targeted Poverty Alleviation Information Integrated with Big Data Mining Algorithm
CN111078882A (en) Text emotion measuring method and device
CN114897517B (en) Text and travel consumption data management method based on block chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171229