CN107527313A

CN107527313A - User Activity mode division and attribute estimation method

Info

Publication number: CN107527313A
Application number: CN201610442680.1A
Authority: CN
Inventors: 杨超; 朱荣荣; 许项东
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2016-06-20
Filing date: 2016-06-20
Publication date: 2017-12-29

Abstract

The present invention relates to a kind of User Activity mode division and attribute estimation method, including：The Individual character estimation model of topic model, activity pattern partitioning model, Bayesian network is built based on the survey data containing personal attribute and trip information；Obtain the big data for treating completion；Handled in big data topic model and activity pattern partitioning model, to obtain the activity pattern of user；Activity pattern is inputted in the Individual character estimation model based on Bayesian network, to obtain social economy's attribute information of user corresponding to each activity pattern.In the present invention, up to six kinds of activity pattern, can preferably completion user social economy's attribute information, so can science, easily analyze and count the individual quantity in one day on each period road, social economy's attribute of individual can determine whether that each user is specifically to take bus still oneself to drive simultaneously, so as to provide important reference frame for Urban Traffic Planning, Traffic Demand Forecasting etc..

Description

User Activity mode division and attribute estimation method

Technical field

The present invention relates to smart machine data analysis, more particularly to a kind of User Activity mode division and attribute supposition side Method.

Background technology

Urban subscriber's travel activity information is the important evidence of urban planning, traffic administration and User Activity research, typically Obtained by traditional approach such as user's trip surveys, human and material resources, time consumption are all very big.With intelligent transportation system, intelligence The it is proposed of the concepts such as intelligent city, traffic big data (such as bus IC card, mobile phone signaling data) due to its user broad covered area, Without special data acquisition equipment, data acquisition cost is relatively low, data volume is big the advantages that, start to be used for user's trip information The field such as extraction and individual activity pattern analysis.But although the positional information with timestamp can be obtained by traffic big data So as to obtain user's motion track of one day, but due to the defects of its data is intrinsic or Privacy Protection, user can not be obtained Specific social economy attribute information.

Research at present to the activity pattern extracting method based on traffic big data is relatively limited, and the activity pattern extracted In, individual activity characteristic information is more single.It was that individual dwell point, shape are calculated according to the historical position data of individual in the past Into the stop point sequence comprising the residence time.Dwell point sequence fragment is clustered by using topic model, it is individual to obtain Body activity pattern.It is in topic model in use, by the history place of user's stop (such as：Certain apartment, Startbuck, gymnasium Deng) " word " is used as, the duration that user's every day is stopped in different location is as input data.Activity pattern is defined as by it The distribution of total duration is stopped in one day in different location.The processing of this simplification, it have ignored and permitted in original user position data More useful information, the activity pattern content for obtaining this method is more single, only stay time information, is had more without other The content of body.

The content of the invention

Based on this, it is necessary to which the User Activity mode contents for analyzing to obtain for big data are single, user social economy category Property it is few the problem of, there is provided a kind of social economy's category that the various activities pattern of user can be analyzed according to big data and obtains user A kind of User Activity mode division and attribute estimation method of property information.

A kind of User Activity mode division and attribute estimation method, including：

Topic model is built based on the survey data containing personal attribute and trip information, exported using the topic model First theme probability distribution, and according to the first theme probability distribution construction activities mode division model；The activity pattern Partitioning model includes various activities pattern, is based on according to user social economy attribute information structure corresponding to each activity pattern The Individual character estimation model of Bayesian network；

Obtain the big data for treating completion；

The motion track that data cleansing obtains user is carried out to the big data, division user stacking area is identified by stroke Section and trip section, obtain the Trip chain of user；With reference to residence time section, the information of stay time and dwell times carries out duty Residence judges, obtains the duty residence position of user；

The position sequence of the preset time of user is obtained according to the Trip chain and user's duty residence position, by position sequence It is input in the topic model, to obtain the second theme probability distribution of User Activity；

The second theme probability distribution is inputted in the activity pattern partitioning model, to obtain the movable mold of user Formula；

The activity pattern is inputted in the Individual character estimation model based on Bayesian network, it is each described to obtain Social economy's attribute information of user corresponding to activity pattern.

In one of the embodiments, it is described according to the Trip chain and user's duty residence position obtain user it is default when Between position sequence comprise the following steps：Discretization was carried out for unit according to half an hour by 24 hours one day, is divided into 48 Period；According to the activity characteristic of different periods, multiple big time zones were divided into by one day；Determined according to the event trace each The activity purpose label of period each user.

In one of the embodiments, the activity purpose label include family, place of working, school, shopping, entertainment, Business and pick people.

In one of the embodiments, the position sequence includes more sub- position sequences, and the sub- position sequence is by every Big time zone composition belonging to the continuous two activity purpose labels of the individual period and the period, counts institute The word frequency for having all sub- position sequences in user one day forms position sequence matrix.

In one of the embodiments, structure topic model comprises the following steps：Believed based on containing personal attribute and going on a journey The conventional survey data extraction activity chain data of breath, determine user's position sequence of one day as theme according to user's trip purpose The input data of model；Optimal number of topics is determined according to the position sequence；The theme is built according to the optimal number of topics Model.

In one of the embodiments, it is described to determine that optimal number of topics comprises the following steps：

The degree of aliasing perplexity of optimal number of topics is calculated, formula is as follows：

Wherein, N_mFor the number of period, to be equal to 47, M be model, w_mIt is the sum for the word not occurred in position sequence Amount, Nm is the total quantity of the word occurred in position sequence；

Training set and test set are taken at random from survey data, using the training set data solving model, and described in use Test set data calculate degree of aliasing perplexity；

Test number of topics K_topicFrom 2 to 50, each corresponding number of topics K_topicTraining 10 times, seeks degree of aliasing Perplexity average value；

Selection degree of aliasing perplexity from the minimum point for dropping to rising or drops to the critical point pair to tend towards stability The number of topics answered builds topic model as optimal number of topics.

In one of the embodiments, it is described according to the first theme probability distribution construction activities mode division model bag Include following steps：

The the first theme probability distribution exported according to the topic model, i.e., the one K dimension table using theme as coordinate is up to really Determine preferable clustering number K_cluster, formula is as follows：

For each point i in class cluster, S (i) is the silhouette coefficient of i points, a (i) be i points into its all affiliated class cluster its He puts the average value of distance, and b (i) is minimum value of the i points to the average distance of the point of all non-place class cluster itself；By a little Silhouette coefficient be averaging, obtain the total silhouette coefficient of the cluster result；

Test cluster numbers K_clusterFrom 2 to 10, for each K_clusterIt is repeated 10 times, seeks the average value of silhouette coefficient；

Select the minimum K of silhouette coefficient_clusterAs final cluster numbers, the classification of activity pattern is obtained；According to described The classification construction activities mode division model of activity pattern.

In one of the embodiments, the classification of the activity pattern includes：Evening return working, early hair working, normal working, Normally go to school.

In one of the embodiments, the Individual character estimation model of the structure based on Bayesian network includes following step Suddenly：

Bayesian network structure is obtained according to Bayes's score function, Bayes's score function is：

D represents training dataset；

S represents network structure；

Bayesian network structure is inputted using the activity pattern as known variables, exports and respectively belongs under the activity pattern Property conditional probability distribution corresponding to variable；

Each attribute variable takes social economy's attribute corresponding to the property value of maximum probability corresponding as the activity pattern respectively Social economy's attribute information.

User Activity mode division provided by the invention and attribute estimation method are with containing personal attribute and trip information Conventional survey data structure topic model, activity pattern partitioning model, and user property based on Bayesian network is counter pushes away mould Type.Then obtain and treat the big data of completion, cleaning, analysis above big data to obtain user's position sequence of one day, then To obtain second theme probability distribution in each position sequence inputting topic model, subsequently second theme probability distribution movable mold In formula partitioning model, so as to obtain corresponding activity pattern, activity pattern is finally inputted into the user based on Bayesian network and belonged to User social economy attribute corresponding to the activity pattern can be obtained in property estimation model.In the present invention, activity pattern up to six Kind, overcome the defects of prior art user activity pattern content is single, while can preferably completion user social economy's category Property information, so can science, easily analyze and count the individual quantity in one day on each period road, while individual Social economy's attribute can determine whether each user be specifically take bus or oneself drive, so as to for city hand over The important reference frame of the offer such as drift is drawn, Traffic Demand Forecasting.

Brief description of the drawings

Fig. 1 is User Activity mode division provided by the invention and attribute estimation method schematic diagram；

Fig. 2 is User Activity mode division provided by the invention and attribute estimation method flow chart；

Fig. 3 is the composition figure of user's trip data position sequence in one day of conventional survey；

Fig. 4 is the converted next block diagrams of Fig. 2；

Fig. 5 is the input data figure of the topic model after Fig. 3 is arranged；

Fig. 6 is the graph of a relation of degree of aliasing and number of topics in the embodiment of the present invention.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Fig. 1 is User Activity mode division provided by the invention and attribute estimation method schematic diagram, and overall say first is taken off The operational process of User Activity mode division provided by the invention and attribute estimation method.

On the one hand, based on the survey data structure topic model containing personal attribute and trip information, activity pattern division Model, and the user property estimation model based on Bayesian network.On the other hand, the big data for treating completion is received, it is then right Big data carries out cleaning and obtains motion track and the user position sequence of one day, and position sequence subsequently is inputted into above-mentioned structure To obtain second theme probability distribution in the topic model built, then second theme probability distribution is inputted to the activity built again In mode division model, you can obtain the specific activity pattern of user, the base for finally having built the input of specific activity pattern It is the social economy's attribute information that can obtain user in the user property estimation model of Bayesian network.

Referring also to Fig. 2, in an embodiment, User Activity mode division provided by the invention and attribute estimation method include Following steps：

S110, topic model is built based on the survey data containing personal attribute and trip information, it is defeated using topic model Go out the first theme probability distribution, and according to the first theme probability distribution construction activities mode division model；Activity pattern divides mould Type includes various activities pattern, and Bayesian network is based on according to user social economy attribute information structure corresponding to each activity pattern Individual character estimation model.It is to carry out model using Shanghai user's trip survey data in 2009 specifically, in the present embodiment Structure.Topic model is built first, and the first theme probability distribution is exported using topic model, and according to the first theme probability distribution Construction activities mode division model, finally belong to further according to the user corresponding to each activity pattern in activity pattern partitioning model Property user property estimation model of the information architecture based on Bayesian network.The building process of specific three of the above model can hereafter be done Elaborate.

S120, obtain the big data for treating completion.Big data is typically provided by mobile phone, Ipad, and wearable device etc. Data, specifically include：Mobile phone signaling, call bill data, public transport, subway IC card data etc. lack user social economy attribute information Location data, or by Ipad, and the data for lacking personal attribute and trip information of the offer such as wearable device. More specifically, mobile phone call bill data mainly includes caller, called, direct-cut operation, sends short messages, connects short message etc., and signaling data except Outside event with call bill data, start, shutdown, cell switching, position further comprises more etc..With Shenzhen's mobile phone call bill data Exemplified by, data field generally includes user identification code, cell ID, sector mark, access moment etc., specific such as table 1.

Table 1

S130, the motion track that data cleansing obtains user is carried out to big data, identifies that division user stops by stroke Section and trip section, obtain the Trip chain of user；With reference to residence time section, the information of stay time and dwell times is carried out Duty residence judges, obtains the duty residence position of user.Data cleansing is carried out to big data includes the processing of field missing, suppressing exception IMSI number records, duplicate data can not be left out with the record of architecture data match by leaving out, ping-pong processing, And signal drift processing.

The processing of field missing refers to the record deletion of some critical field loss of learnings in data in mobile phone, such as some bases Numbering of standing is 0, time term missing etc.；

Leave out the record of abnormal IMSI numberings.Due to some exceptions of storing process, some IMSI volumes are may result in Number generation.

Leave out the record that can not be matched with base station data.The research of the present embodiment is in the range of Shanghai, due to signal Problem, some records may be navigated in the base station in adjacent province, if there is the base station data of neighbouring province, then deleted related Data.

Leave out duplicate data.In implementation in addition to some real duplicate data, it is also possible to due to precision problem (such as when Between item, it was not result that the identical that the record of same time also obtains is accurate to the second originally to be accurate to after the second some) cause Record repeat.

Ping-pong processing in one embodiment comprises the following steps：By the data in mobile phone of every user spatially and the time By region merging technique, if subscriber signal fluctuates in the range of less than capacity-threshold L1, and exceed time threshold T1, then it is assumed that user During this period of time it is in same position.More specifically, capacity-threshold L1 be 400-500 rice diameter range, time threshold T1 For 25-30 minutes.

Signal drift processing in one embodiment comprises the following steps：By the data in mobile phone of every user spatially and the time By region merging technique, if user leaves capacity-threshold L2 in time threshold L2, return to again in the capacity-threshold L2 afterwards, then It is to be in same position to think user.Above-mentioned small spatial dimension L2 is left in the data in mobile phone record short time of user, it Situation about returning afterwards and quickly, it is to be in same position to be also considered as user.More specifically refer to signal leave above-mentioned zone and The position switch speed for returning to above-mentioned zone is more than 100km/h (Urban Express Roads speed limit) and leaves above-mentioned zone Time is no more than T_clean.More specifically, capacity-threshold L2 is the diameter range of 400-500 rice, and time threshold T2 is 25-30 points Clock.

Specifically, in the present embodiment, the traditional user's trip survey data in Shanghai in 2009 are converted into activity chain data, The user data that morning is screened from family and is gone home at night.Then the active duration of all types activity, structure are extracted Active duration is distributed, the time threshold T that 5% quantile that active duration is distributed is identified as stroke_stay, this reality Apply time threshold T in example_stayFor 25 minutes.Specifically, threshold speed V_stayFor 1m/s, L_stayFor 200-500 rice, time threshold T_stayFor 5-25 minutes, specific L_stayAnd T_stayNumerical point need to combine whole activity chain and specific actual conditions and consider.

With reference to residence time section, the information of stay time and dwell times carries out duty residence judgement, obtains user's duty and lives Position.Specifically, in an embodiment, specific step is as follows：

Extract all movable time starteds of each user, active duration, dwell times；Filter out each user's work Make the data of day, count total number of days is N；

For every a kind of stop place, night 20 is counted:00- next day 7:00 residence time was more than T_homeTotal number of days N_home；

If N_homeMore than the 60% of total number of days N, then the position is homeplace.Otherwise, section 9 at work are counted: 00-17:00 residence time was more than T_workTotal number of days N_work；

If N_workMore than the 60% of total number of days N, then the position is place of working；Otherwise, the position is other movable destinations.

More specifically, the time threshold T of the homeplace identification in the present embodiment_homeIt is traditional with Shanghai in 2009 User's trip survey data are sample, the trip data in user's trip survey data are converted into activity chain data, screening is early On from the family and at night user mobile phone data gone home；Extract night 20:00- next day 7:00 stops the activity that place is family, Construction activities continuous time and its distribution, the time threshold that 5% quantile homeplace of active duration distribution is identified T_home.Specifically, using the traditional user's trip survey data in Shanghai in 2009 as sample, 5% point of active duration distribution Digit is 540 minutes, that is, the time treated of being in is 9 hours.

The time threshold T of place of working identification_workUsing the traditional user's trip survey data in Shanghai in 2009 as sample, will use Trip data in family trip survey data is converted into activity chain data, the user's hand for screening morning from family and going home at night Machine data；Extract working hour 9:00-17:00 stops the activity that place is place of working, and construction activities continuous time and its distribution will The time threshold T that 5% quantile of active duration distribution identifies as place of working_work.Specifically, passed with Shanghai in 2009 User's trip survey data of system are sample, and 5% quantile of active duration distribution is 165 minutes, that is, at some The time that place is treated was more than 2 hours, close to 3 hours.The time normally to work is 7-8 hours, takes to be used as 165 minutes and is somebody's turn to do Time threshold is to be had office hours because the work of some may not be long-time in a place, such as the manager of company, or The occupations such as person teacher, they are exactly 2-3 hour in a local working time.Meanwhile except reference time threshold value 165 is divided Clock, further accounts for the number stopped, and whether the position of stop one is shown and excluded as shopping once in a while or situation about dining out.

Further it will be understood that the active characteristics gone to school in Activity Type and the active characteristics to work are similar, therefore The activity judgement of this one kind of going to school, has been incorporated into house or place of working judges this part.

S140, the position sequence of the preset time of user is obtained according to Trip chain and user's duty residence position, by position-order Row are input in topic model, to obtain the second theme probability distribution of User Activity.In one embodiment, according to family duty residence position Acquisition user's position sequence of one day is put to comprise the following steps：

Referring to Fig. 3, discretization was carried out for unit according to half an hour by 24 hours one day, is divided into 48 periods.Root According to the activity characteristic of different periods, multiple big time zones were divided into by one day, 5 big time zones are divided into the present embodiment, are 00 respectively: 00-07:00 is first big time zone, 07:00-09:00 is second big time zone, 09:00-16:00 is the 3rd big time zone, 16:00-18:00 is the 4th big time zone, 18:00-24:00 is the 5th big time zone.Each time is determined according to event trace The activity purpose label of each user of section, specifically, in the present embodiment, activity purpose label refers to family, place of working, school, and Other.In other embodiments, other can be embodied as needed, such as shopping, entertainment, business, and pick People etc..

Specifically, referring to Fig. 3, in an embodiment, position sequence includes more sub- position sequences, every sub- position sequence by Big time zone composition belonging to continuous two activity purpose labels of each period and the period, by being moved rearwards every time The mode of half an hour travels through all sub- position sequences caused by user in one day, counts all sub- positions in all users one day The word frequency (such as Fig. 4) and position sequence matrix (such as Fig. 5) of sequence, then using obtained position sequence matrix as data input master Inscribe in model.In Fig. 4, H expression activity purpose labels are family, and W expression activity purpose labels are work, S expression activity purpose marks Sign and represent other for school, O；1-5 represents the big time zone in sub- position sequence, and the numeral on every post is represented per seed position-order The word frequency of row.

S150, second theme probability distribution is inputted in activity pattern partitioning model, to obtain the activity pattern of user.

S160, activity pattern is inputted in the user property estimation model based on Bayesian network, to obtain each movable mold Conditional probability distribution corresponding to formula, and according to the attribute information of conditional probability distribution completion user.

Next coming in order describe topic model, activity pattern partitioning model, and the category of the user based on Bayesian network in detail The building process of property estimation model.

The present embodiment builds three of the above model by sample of the traditional user's trip survey data in Shanghai in 2009, first Activity chain data are extracted based on the conventional survey data containing personal attribute and trip information, determine to use according to user's trip purpose Input data of the family position sequence of one day as topic model；And optimal number of topics is determined according to position sequence；Last basis Optimal number of topics structure topic model.

Optimal number of topics is through the following steps that determine：

The degree of aliasing perplexity of optimal number of topics is calculated first, and formula is as follows：

Training set and test set are taken at random from survey data, using training set data solving model, and with test set number According to calculating degree of aliasing perplexity.In the present embodiment, conventional survey data is taken to 90% at random as training set, takes 10% work For test set.

Test number of topics K_topicFrom 2 to 50, each corresponding number of topics K_topicTraining 10 times, seeks degree of aliasing Perplexity average value.

Selection degree of aliasing perplexity from the minimum point for dropping to rising or drops to the critical point pair to tend towards stability The number of topics answered builds topic model as final number of topics.Specifically, respectively using three seed position sequence constructing methods (continuous two, three location tags count plus the big time zone belonging to it or simultaneously the sub- position sequence of first two), calculating is obscured Changes of the perplexity with number of topics is spent, as a result as shown in Figure 6.As a result the degree of aliasing of two location tags is shown Perplexity is minimum, and two add the degree of aliasing perplexity highests of three location tags.With the increase of number of topics, three positions The degree of aliasing perplexity for putting label tends towards stability in K=30, and two location tags tend towards stability in K=50, and two add Three labels tend towards stability in K=40.Thus, the sub- position sequence make of two location tags of final choice is that is, main It is 2 structure topic models to inscribe number.

Construction activities mode division model comprises the following steps that：

The the first theme probability distribution exported according to topic model, i.e., the one K dimension table using theme as coordinate reach

Determine preferable clustering number K_cluster, formula is as follows：

Select the minimum K of silhouette coefficient_clusterAs final cluster numbers, the classification of activity pattern is obtained；According to activity The classification construction activities mode division model of pattern.It is computed, as cluster numbers K_clusterFor 6 when, silhouette coefficient is minimum, finally sees Examine Statistical Clustering Analysis and obtain six kinds of activity patterns, respectively evening return working, early hair working, normal working, when normally going to school, be long other Trip, in short-term other trips.

Specifically, being averaged for pattern 1 (evening returns working) user leaves home to be about morning 9 constantly:40, go home constantly about at night 10:15, be in six classes the latest.Such most of the time of user one is in job site, residence time of being in It is most short in six classes.Such user may have the behavior worked overtime at night.Pattern 2 (early hair working) user is averaged the moment of leaving home About morning 6:15, it is earliest in six classes, the moment of averagely going home is then similar with normal working user, and about 16:30.This can Can be due to such user place of working and residence it is distant, it is necessary to set out in advance.Pattern 3 (normal working) user's is flat Leave home to be about morning 7 constantly:45, go home constantly about evening 18:Stay time in 20, one day in place of working is small for 9.5 When, after removing noon lunch break, operating time is normal level.Pattern 4 (normally go to school) user averagely leave home be about constantly Morning 7:15, go home constantly about evening 17:00, meet normal time for school, the place of going to school of user should be in residence Near.Pattern 1-4 is commuter, and its average travel number is at 2 times or so.

For pattern 5-6 based on other kinds of trip, average daily number of going on a journey is high compared with commuter, left in 2.6-2.8 It is right.Pattern 5 (other are gone on a journey when long) user daily goes on a journey number as 2.83 times, and stops total duration in other kinds of place and reach To about 8.5 hours.Averagely leave home to be about morning 8 constantly:40, go home constantly about evening 18:40, belong to the feelings come out early and return late Condition.Pattern 6 (other are gone on a journey in short-term) user daily goes on a journey number as 2.67 times, and in other types place, stop total duration is relatively low, About 1.9 hours, and stay time of being in reaches nearly 22 hours.Averagely leave home to be about morning 9 constantly:00, when averagely going home It is about noon 12 to carve:50, trip focused mostly in the morning.

User property estimation model of the structure based on Bayesian network comprises the following steps：

D represents training dataset；

S represents network structure；

In the present embodiment, passed through based on Shanghai user's trip survey data in 2009 at topic model and movable partitioning model Reason obtain it is late return working, the working of early hair, normal working, other are gone on a journey when normally going to school, be long, and other six kinds of work of going on a journey in short-term Dynamic model formula；Bayesian network structures will be inputted by six kinds of activity patterns above, and export the activity pattern lower age, sex, the text Conditional probability distribution corresponding to the attribute variables such as change degree, occupation, household register, concrete outcome are as follows：

1. the age：

2. sex：

3. schooling：

4. occupation：

Wherein, professional code 1-11 implication is as follows：1. the clerical worker of 2. professional and technical personnel of head of the unit 3. and have The private owner of pass personnel 4. and the business of 5. 6. production and transport equipment operator of farming, forestry, husbandary and fishing water conservancy producers of individual operator 7. Industry and attendant 8. retired personnel, the 9. retired students 11. of re-employment personnel 10. are other

5. household register situation

Social economy's attribute corresponding to the conditional probability distribution value of maximum is taken respectively as social corresponding to the activity pattern Economic attribution information.Various social economy's attributes so under each activity pattern determine, based on Bayesian network In user property estimation model, every kind of activity pattern all with user social economy attribute one-to-one corresponding, treats that the big data of completion passes through Cross topic model and the processing of movable partitioning model can obtain specific activity pattern afterwards, as long as specific activity pattern is inputted into base The corresponding various social economies category of this motility model can be immediately obtained in the Individual character estimation model of Bayesian network Property.

After building three of the above model, for the current big data for lacking social economy's attribute for treating completion, pass through Data cleansing motion track and the user position sequence of one day, subsequently position sequence is inputted in the topic model having been built up To obtain second theme probability distribution, then second theme probability distribution is inputted to the activity pattern partitioning model built again In, you can obtain the specific activity pattern of user, finally by specific activity pattern input built based on Bayesian network User property estimation model be that can obtain user social economy attribute information corresponding to the activity pattern.And then science, just Analyze and count promptly the individual quantity on each period road in one day, individual social economy's attribute, so as to enter one Step judges that each user is specifically to take bus or oneself driving, is finally Urban Traffic Planning, Traffic Demand Forecasting etc. Important reference frame is provided.

Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope that this specification is recorded all is considered to be.

Embodiment described above only expresses the several embodiments of the present invention, and its description is more specific and detailed, but simultaneously Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that come for one of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of User Activity mode division and attribute estimation method, it is characterised in that including：

Topic model is built based on the survey data containing personal attribute and trip information, first is exported using the topic model Theme probability distribution, and according to the first theme probability distribution construction activities mode division model；The activity pattern division Model includes various activities pattern, and pattra leaves is based on according to user social economy attribute information structure corresponding to each activity pattern The Individual character estimation model of this network；

Obtain the big data for treating completion；

To the big data carry out data cleansing obtain user motion track, by stroke identify division user stop section and Trip section, obtain the Trip chain of user；With reference to residence time section, the information of stay time and dwell times carries out duty residence Judge, obtain the duty residence position of user；

The position sequence of the preset time of user is obtained according to the Trip chain and user's duty residence position, position sequence is inputted Into the topic model, to obtain the second theme probability distribution of User Activity；

The second theme probability distribution is inputted in the activity pattern partitioning model, to obtain the activity pattern of user；

The activity pattern is inputted in the Individual character estimation model based on Bayesian network, to obtain each activity Social economy's attribute information of user corresponding to pattern.

2. User Activity mode division according to claim 1 and attribute estimation method, it is characterised in that described according to institute The position sequence for stating Trip chain and the preset time of user's duty residence position acquisition user comprises the following steps：By 24 hours one day It is that unit carries out discretization according to half an hour, is divided into 48 periods；According to the activity characteristic of different periods, one day is drawn It is divided into multiple big time zones；The activity purpose label of each user of each period is determined according to the event trace.

3. User Activity mode division according to claim 2 and attribute estimation method, it is characterised in that the movable mesh Label include family, place of working, school, shopping, entertainment, business and pick people.

4. User Activity mode division according to claim 2 and attribute estimation method, it is characterised in that the position-order Row include more individual sub- position sequences, the sub- position sequence by each period the continuous two activity purpose labels And the big time zone composition belonging to the period, the word frequency for counting all sub- position sequences in all users one day form position Put sequence matrix.

5. User Activity mode division according to claim 1 and attribute estimation method, it is characterised in that structure theme mould Type comprises the following steps：Based on containing personal attribute and trip information conventional survey data extract activity chain data, according to Family trip purpose determines input data of user's position sequence of one day as topic model；Determined most according to the position sequence Good number of topics；The topic model is built according to the optimal number of topics.

6. User Activity mode division according to claim 5 and attribute estimation method, it is characterised in that the determination is most Good number of topics comprises the following steps：

<mrow> <mi>P</mi> <mi>e</mi> <mi>r</mi> <mi>p</mi> <mi>l</mi> <mi>e</mi> <mi>x</mi> <mi>i</mi> <mi>t</mi> <mi>y</mi> <mo>=</mo> <mi>exp</mi> <mo>&lsqb;</mo> <mo>-</mo> <mfrac> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </msubsup> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>m</mi> </msub> <mo>|</mo> <mi>M</mi> <mo>)</mo> </mrow> </mrow> <mrow> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </msubsup> <msub> <mi>N</mi> <mi>m</mi> </msub> </mrow> </mfrac> <mo>&rsqb;</mo> </mrow>

Wherein, N_mFor the number of period, to be equal to 47, M be model, w_mIt is the total quantity for the word not occurred in position sequence, Nm It is the total quantity of the word occurred in position sequence；

Training set and test set are taken at random from survey data, using the training set data solving model, and with the test Collect data and calculate degree of aliasing perplexity；

Test number of topics K_topicFrom 2 to 50, each corresponding number of topics K_topicTraining 10 times, ask putting down for degree of aliasing perplexity Average；

Selection degree of aliasing perplexity from the minimum point for dropping to rising or is dropped to corresponding to the critical point to tend towards stability Number of topics builds topic model as optimal number of topics.

7. User Activity mode division according to claim 1 and attribute estimation method, it is characterised in that described according to institute The first theme probability distribution construction activities mode division model is stated to comprise the following steps：

The the first theme probability distribution exported according to the topic model, i.e., the one K dimension table using theme as coordinate, which reaches, to be determined most Good cluster numbers K_cluster, formula is as follows：

For each point i in class cluster, S (i) is the silhouette coefficient of i points, and a (i) is i points other points into its all affiliated class cluster The average value of distance, b (i) are minimum value of the i points to the average distance of the point of all non-place class cluster itself；By wheel a little Wide coefficient is averaging, and obtains the total silhouette coefficient of the cluster result；

Select the minimum K of silhouette coefficient_clusterAs final cluster numbers, the classification of activity pattern is obtained；According to the movable mold The classification construction activities mode division model of formula.

8. User Activity mode division according to claim 7 and attribute estimation method, it is characterised in that the movable mold The classification of formula includes：Evening returns working, early hair working, normally goes to work, normally goes to school.

9. resident's activity pattern division according to claim 1 and attribute estimation method, it is characterised in that the structure base Comprise the following steps in the Individual character estimation model of Bayesian network：

D represents training dataset；

S represents network structure；

Bayesian network structure is inputted using the activity pattern as known variables, each attribute under the activity pattern is exported and becomes Conditional probability distribution corresponding to amount；

Each attribute variable takes social economy's attribute corresponding to the conditional probability distribution value of maximum corresponding as the activity pattern respectively Social economy's attribute information.