CN107301433A

CN107301433A - Net based on clustering and discriminant model about car discrimination method and system

Info

Publication number: CN107301433A
Application number: CN201710573249.5A
Authority: CN
Inventors: 冷婷; 谈炜; 石路路; 王计斌
Original assignee: Nanjing Hua Su Science And Technology Ltd
Current assignee: Nanjing Hua Su Science And Technology Ltd
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2017-10-27

Abstract

The invention discloses a kind of net based on clustering and discriminant model about car discrimination method and system, this method comprises the following steps：Step (1)：Initial data is obtained, and randomly selects several known taxi driver users as sample set M, the driver user of several unknown classifications is randomly selected as sample set N；Step (2)：Carry out feature extraction；Step (3)：Feature is analyzed；Step (4)：Set up model；Step (5)：The unknown driver's signaling data collected is imported into the model of the step (4) foundation and judged.Based on the signaling data of mobile phone, extract the moving characteristic of driver, can be in the case where only knowing a class data label, whether the data for determining Unknown Label belong to known class, rapid and convenient, the result identified can hit the non-net of justice about car for traffic administration department and be serviced, and help them quickly to position suspected vehicles, the human cost of law enforcement is reduced, operating efficiency is lifted.

Description

Net based on clustering and discriminant model about car discrimination method and system

Technical field

The invention belongs to net about car administrative skill field, reflected more particularly, to a kind of net based on clustering and discriminant model about car Other method and system.

Background technology

Under the promotion of the background and market trend of " internet+", net about car is fast as a kind of emerging trip car mode Short-term training is the favorite in market, the important component gone on a journey as wisdom.

Net about car is online order taxi, is that one kind connects passenger, driver and vehicle, passenger passes through intelligence Mobile phone application software, preengages the trip mode of driver's pickup and delivery service.The appearance of net about car, meets public's variation trip need Ask, improve the utilization ratio of motor vehicle, but with the continuous expansion of net about car scale, a series of social supervisions that it brings Problem is also what be can not be ignored.

Net about car had not only been had any different but also had been related with traditional taxi.In vehicle color and vehicle, taxi typically has Unified color and mark, net about car is then varied.On operation way, taxi can cruise attract customers, website wait visitor and Reservation is received lodgers, and net about car cannot cruise prostitutions, can only preengage serve by the network platform.In supervision, Taxi is typically managed collectively by taxi company, and net about car then lacks certain oversight mechanism.

Initial stage, net about car is the supplement to taxi.With increasing for net about car dedicated driver, net about car is hired out to tradition Garage's industry forms certain impact, by the resistance of taxi driver to a certain extent.Further, since net about car platform pair The examination of driver and vehicle is not strict, and market confusion is lived again, and the social concern such as dispute, accident emerges in an endless stream, net about car market Need standardized administration badly.

In order to manage the confusion of net about car market,《The operating service of online order taxi manages Tentative Measures》In 2016 Implemented from November 1, in.Wherein clear stipulaties, in operating service, driver must not in the street cruise and attract customers, not Ying Ji Field, railway station etc. are set up uniformly to cruise car dispatch service station or carry out the objective place of time of queuing up and attracted customers.

Under the net overall background that about the new rule of car operation are put into effect, Department of Communications is used as public trip service management mechanism, it is necessary to plus Management to net about car by force.It is, to carry out by way of manually patrolling, but so to expend to the way to manage of net about car at present Substantial amounts of manpower, therefore, Department of Communications is in the urgent need to a kind of screening mode of automation, to help them to lock suspected vehicles, Realize law enforcement rapidly and efficiently.

The content of the invention

Based on the problem to be solved in the present invention is to provide a kind of signaling data by mobile phone, the mobile spy of driver is extracted The net based on clustering and discriminant model levied about car discrimination method.

To solve above-mentioned technical proposal, the technical solution adopted by the present invention is that the net based on clustering and discriminant model about car reflects Other method comprises the following steps：

Step (1)：Initial data is obtained, and randomly selects several known taxi driver users as sample set M, with Machine extracts the driver user of several unknown classifications as sample set N；

Step (2)：Obtain in the step (1) signaling of the driver user within a period of time in sample set M and sample set N Data, carry out feature extraction；

Step (3)：By analyzing the feature that the step (2) is extracted, it is known that net Yue Che driver and taxi department There is certain otherness in machine；

Step (4)：Model is set up, is cluster training set P and checking collection Q by the sample set M random divisions, by the sample This collection N is used as test set N；

Clustering is carried out for training set P, preferable clustering number K is calculated, the exceptional sample in the training set P is rejected Point, obtains cluster centre point, calculates in training set P each effective sample point to cluster centre point apart from sum, and be based on away from From the threshold value that increment situation of change draws classification；

Step (5)：The unknown driver's signaling data collected is imported into the model of the step (4) foundation and sentenced It is fixed.

In the present invention, based on the signaling data of mobile phone, the moving characteristic of driver is extracted, a class can be only being known In the case of data label, whether the data for determining Unknown Label belong to known class, rapid and convenient；Pass through step (3) Signature analysis, could be aware that whether the feature of extraction in step (2) correct, if without otherness, illustrating feature extraction It is problematic；Clustering Model using taxi driver as sample is established by the step (4), so, can in step (5) Whether known taxi classification is belonged to the signaling data for determining unknown driver user rapidly and efficiently.

It is preferred that, in the step (4), the model drawn in the step (4) is verified using checking collection Q, Tested using test set N.

The accuracy of Clustering Model can be improved using checking collection Q and test set N.

It is preferred that, in the step (2), the feature of extraction includes cell and switches and be resident duration, wherein, feature cell is cut Change including cell switching a few days average, cell switching a few days standard deviation, busy cell switching number average, busy cell switching number mark Accurate poor, idle cell switching number average and idle cell switching number standard deviation；Feature be resident duration include busy be resident median, Busy is resident average, busy and is resident the resident median of standard deviation, idle, the resident average of idle and the resident standard deviation of idle.

It is preferred that, in the step (4), for training set P, preferable clustering number K, profile system are calculated using silhouette coefficient Number is the evaluation index of the intensive and degree of scatter of class, and formula is as follows：

A (i) is the vectorial average values arrived with the dissimilar degree of other points in cluster of i, that is, measures the similarity in group；

B (i) is the minimum value for the average dissimilar degree that i vectors arrive other clusters, that is, measures the similarity between group；

Separating degree is relatively excellent between cohesion degree and group out of -1 to 1, value bigger explanation group for s (i) scope.

It is preferred that, in the step (4), calculate distance of each effective sample point to cluster centre point in training set P Sum, and sorting, draws increment graph, and X-axis represents sample sequence number in training set P, Y-axis represent sample point to central point distance it With draw training set P flex point, the value in Y-axis corresponding to the flex point is the threshold value y of classification；

Threshold=y_(x=101)=2.239995.

The present invention is other problem solved is that provide a kind of net based on clustering and discriminant model about car identification system, this is System includes data collection module, data clusters analysis module, data processing module；

Wherein, the data collection module：Signaling data for receiving net Yue Che driver and taxi driver；

Data clusters analysis module：Randomly select the taxi driver's letter being collected into several described data collection modules Data are made as sample set M；Randomly select the driver user for the unknown classification being collected into several described data collection modules It is used as sample set N；Feature is extracted, based on sample set M, clustering and discriminant model is set up；

Data processing module：Obtained driver's subscriber signaling data are imported, carry out judging to be somebody's turn to do by clustering and discriminant model The classification of driver user.

Net based on the clustering and discriminant model mobile phone signaling data that about car identification system is provided with mobile operator of the invention Based on, using the discrimination model based on cluster, taxi driver and net Yue Che driver are judged, the result energy identified Enough the non-net of justice about car is hit for traffic administration department to be serviced, help them quickly to position suspected vehicles, reduce the people of law enforcement Power cost, lifts operating efficiency.

Brief description of the drawings

It is further described below in conjunction with the accompanying drawings with embodiments of the present invention：

Fig. 1 is resident the sample distribution scatterplot that standard deviation characteristic is drawn to choose cell switching a few days standard deviation characteristic and idle Figure；

Fig. 2 dissipates to choose the sample distribution that cell switching a few days characteristics of mean and cell switching a few days standard deviation characteristic are drawn Point diagram；

Fig. 3 is t-SNE Feature Dimension Reduction sample distribution figures；

Fig. 4 is modeling analysis flow chart；

Fig. 5 is to obtain preferable clustering number schematic diagram；

Fig. 6 is cluster analysis result schematic diagram；

Fig. 7 is the cluster analysis result schematic diagram after rejecting abnormalities value；

Fig. 8 is cluster centre distribution line chart；

Fig. 9 is cluster sample distribution box-shaped figure in notable feature；

Figure 10 is increment graphs apart from sum sequence after of each effective sample point x to central point in training set P；

Figure 11 is the net of the invention based on clustering and discriminant model about car discrimination method simple process structure chart；

Figure 12 is the net of the invention based on clustering and discriminant model about car identification system structure chart.

Embodiment

As shown in figure 11, the net based on clustering and discriminant model of the embodiment of the present invention about car discrimination method includes following step Suddenly：

In the step (4), the model drawn in the step (4) is verified using checking collection Q, using test Collection N is tested.

In the step (2), the feature of extraction includes cell and switches and be resident duration, wherein, the switching of feature cell includes Cell switching a few days average, cell switching a few days standard deviation, busy cell switching number average, busy cell switching number standard deviation, Idle cell switches number average and idle cell switching number standard deviation；Feature, which is resident duration, includes busy resident median, busy Resident average, busy are resident standard deviation, idle and are resident the resident average of median, idle and the resident standard deviation of idle.

In addition, in the step (4), for training set P, preferable clustering number K, silhouette coefficient are calculated using silhouette coefficient It is the evaluation index of the intensive and degree of scatter of class, formula is as follows：

In the step (4), each effective sample point is calculated in training set P to cluster centre point apart from sum, and Sequence, draws increment graph, and X-axis represents sample sequence number in training set P, and Y-axis represents that sample point, apart from sum, is drawn to central point Training set P flex point, the value in Y-axis corresponding to the flex point is the threshold value y of classification；

Threshold=y_(x=101)=2.239995.

The about car discrimination method concrete operations of net of the present embodiment based on clustering and discriminant model are as follows：

Data acquisition：

As shown in table 1, driver user is obtained based on following 3 raw data sets：

Table 1

Dataset name	Explanation
		A	Taxi driver's user list that transportation department provides
B	Taxi group user list
		C	Base station occurred and using drop drop driver app driver user near southern station

Taxi driver's user data set is：

D=A ∩ B ∩ C

In data set D, 150 known taxi driver users are randomly selected as sample set M.

E=C-D

In data set E, the driver user of 150 unknown classifications is randomly selected as sample set N.

Feature extraction：

The signaling data in 300 users 6 days to two weeks between March 19 March in 2017 is used as feature extraction more than extracting Initial data.

Define the 9 of Mon-Fri:00-17:00 is busy, Mon-Fri 17:00-24:00 and 0:00-9:00 is the spare time When.

The feature of extraction mainly includes cell and switches and be resident duration, as shown in table 2：

Table 2

Features above is extracted, scatter diagram is drawn by choosing any 2 dimensional feature, as shown in Figure 1, 2：In Fig. 1, abscissa table Cell switching a few days standard deviation characteristic after indicating quasi- normalization, ordinate represents that the idle after standard normalization is resident standard deviation Feature；In Fig. 2, abscissa represents the cell switching a few days characteristics of mean after standard normalization, and ordinate represents that standard is normalized Cell switching a few days standard deviation characteristic afterwards.Red point represents sample set M, i.e. taxi driver, and blue point represents sample Collect N, i.e., the driver user of unknown classification；By Fig. 1 and Fig. 2, intuitively, sample set M and sample set N distribution is present necessarily Otherness, the behavior difference of two class drivers is reflected from side illustration feature to a certain extent.

Signature analysis：

T-SNE (t-Distributed Stochastic Neighbor Embedding) is by Laurens van der Maaten and Geoffrey Hinton propose a kind of method of (Manifold) Data Dimensionality Reduction of manifold.It is on SNE basis On develop, the t being distributed under lower dimensional space using heavier long-tail is distributed to avoid crowding problems and be difficult to optimize The problem of.

Euclidean distance first is converted to conditional probability to express similarity between points by the algorithm.It is given one The data x of N number of higher-dimension₁..., x_N, calculate Probability p_j|iFor：

To the y under low dimensional_i, 2 similarities after being distributed using t are：

The gradient of optimization is：

Dimension reduction and visualization is carried out to feature using t-SNE；As shown in figure 3, can be seen that base from Fig. 3 visualization result In the feature of selection, there is certain otherness in the distribution of two class drivers.

Set up model：

The discrimination model based on cluster is used to differentiate that unknown driver user still nets Yue Che driver for taxi driver, Specific analysis process is as shown in Figure 4.

1st, cluster numbers are selected

By sample set M according to 8:2 random divisions are cluster training set P and checking collection Q, regard sample set N as test set N.

For training set P, preferable clustering number K, profile are calculated using silhouette coefficient (Silhouette Coefficient) Coefficient is the evaluation index of the intensive and degree of scatter of class：

Wherein：

As shown in Figure 5, when cluster numbers are 3, s (i) value is maximum.Therefore, preferable clustering number K=3 is taken.

2nd, clustering

Clustering is carried out to training set P using K-Means algorithms.

K-Means belongs to division formula clustering algorithm, and cluster similarity is that the average for utilizing object in each cluster obtains one Individual center is calculated.Its main working process is：Arbitrarily k object of selection is as first first from n data object Beginning cluster centre, for remaining other objects, then according to their similarities (distance) with these cluster centres, respectively will They distribute to the cluster most like with it；Then calculate that each to obtain the cluster centre newly clustered (all right in the cluster again The average of elephant)；This process is constantly repeated untill canonical measure function starts convergence.Typically mean square deviation is used as standard Measure function.

Training set P is polymerized to 3 classes, obtained cluster result is as shown in Figure 6.

On the basis of above cluster result, abnormity point is handled, 108 effective sampling points are obtained.It mainly divides Cloth situation is as shown in table 3.

Table 3

Classification	cluster1	cluster2	cluster3	It is total
					Sample number	46	45	17	108

As shown in fig. 7, accordingly, for each clustering cluster, each dimensional characteristics value corresponding to central point can be obtained.

3rd, user behavior signature analysis

To be characterized as abscissa, characteristic value is ordinate, draws line chart, checks the distribution of three cluster centre points, such as Shown in Fig. 8.As shown in Figure 8, three above clustering cluster otherness in 6 indexs is larger：Mean_worktime (busy cells Switch number average)；Sd_worktime (busy cell switching number standard deviation)；(idle cell switches number to mean_nonworktime Average)；Sd_nonworktime (idle cell switching number standard deviation)；Switch_cell_number_daily_mean (cells Switch a few days average)；Switch_cell_number_daily_sd (cell switching a few days standard deviation).

Draw distribution box-shaped figure of three classification samples more than in 6 features respectively (see Fig. 9).Abscissa is in Fig. 9 Each classification, the lower edge of each box-shaped represents minimum value, and top edge represents maximum, and the bottom of chest represents a quarter point Position, the top of chest represents that the line in the middle of 3/4ths points of positions, chest represents median.The width of chest illustrates such very The number of this number.Generally speaking, box-shaped figure illustrates the distribution situation of sample in each classification.

As can be seen that in 6 above-mentioned features, cluster1 is compared close with cluster2 overall trend, and The corresponding characteristic values of cluster2 are below the corresponding characteristic values of cluster1；But cluster3 and cluster1 are in trend It is overall opposite.Specifically, have it is following some：

(1) for the driver in cluster1, have to draw a conclusion：

Mean_worktime (busy cell switching number average) index highest, illustrates such taxi driver on Monday extremely The 9 of Friday:00-17:00, i.e., daytime, activity was the most frequent；

Mean_nonworktime (idle cell switching number average) index is relatively low, illustrates such taxi driver on Monday To Friday 17:00-24:00 and 0:00-9:00, i.e. nocturnalism are less；

Switch_cell_number_daily_mean (cell switching a few days average) index highest, illustrates that such is hired out Car driver's mass activity is more frequent.

Therefore, such taxi driver is the driver with typical taxi crawler behavior feature.

(2) for the driver in cluster2, have to draw a conclusion：

Mean_worktime (busy cell switching number average) index is relatively low, illustrates such taxi driver on Monday extremely The 9 of Friday:00-17:00, i.e., daytime, activity was less frequent；

Mean_nonworktime (idle cell switching number average) index is relatively low, illustrates such taxi driver in week One to Friday 17:00-24:00 and 0:00-9:00, i.e. nocturnalism are also less frequently less；

Switch_cell_number_daily_mean (cell switching a few days average) index is equally relatively low, illustrates such The mass activity of taxi driver is infrequently.

As can be seen that such taxi driver switching cell number of times is relatively fewer, that is to say, that be more biased towards in some districts Domain be resident and received guests, therefore, for the angle of subordinate act feature, and the resident behavior ratio received guests of net Yue Che driver is relatively similar.

(3) for the driver in cluster3, have to draw a conclusion：

Mean_nonworktime (idle cell switching number average) index is higher, illustrates such taxi driver on Monday To Friday 17:00-24:00 and 0:00-9:00, i.e. nocturnalism are more frequent；

Switch_cell_number_daily_mean (cell switching a few days average) index is higher, illustrates that such is hired out The mass activity of car driver tends to be frequent.

As can be seen that such taxi driver has the characteristics of hiding by day and coming out at night, and therefore, for the angle of subordinate act feature, The characteristics of being hidden by day and come out at night with typical case net Yue Che driver is also than relatively similar.

(4) all in all：

User in cluster1 has typical taxi driver's behavioural characteristic；

Although user in cluster2 and cluster3 is taxi driver, but in behavioural characteristic and net Yue Che driver Than relatively similar；

4th, threshold value is set

Each effective sample point x is calculated in training set P to central point apart from sum, and is sorted, drafting increment graph, such as Figure 10 It is shown：In Figure 10, x-axis represents training sample sequence number, and y-axis represents sample point to central point apart from sum.

As can be seen from Figure：

Work as x<When 101, the growth rate of distance is more gentle；

Work as x>When 101, the growth rate of distance is very fast；

Thus draw：

X=101 is the flex point in sample set.Therefore, its corresponding distance, i.e. y values are set to the threshold value of classification：

Threshold=y_(x=101)=2.239995.

5th, result is exported

Classification to unknown sample belongs to judgement, and this patent uses the method being combined based on cluster and threshold value to realize out Hire a car the classification of driver and non-net of justice Yue Che drivers.

When the sample point in test set to three cluster centre points is more than threshold value apart from sum, that is, it is judged as the non-net of justice About car, conversely, being then determined as taxi.

The result drawn is as shown in table 4 to be judged to checking collection Q and test set N：

Table 4

(1) as can be seen here：

For 30 samples in checking collection Q, judge there are 23 driver users to belong to according to the model

Taxi, achieve 76.7% recall rate.

For 150 samples in test set N, using the discrimination model based on cluster, discovery has 97 driver user's category Taxi driver is identified as in the driver of taxi, i.e., 64.7%.

(2) further：

97 users to being judged as taxi in test set N, are classified according to it to the distance of three central points Further classification results are obtained, summarized results is as shown in table 5：

Table 5

Classification	cluster1	cluster2	cluster3	It is total
					Sample number	11	86	0	97
In test set N accounting	7.3%	57.3%	0	64.7%

Therefore, only 7.3% driver is typical taxi department in test set N it can be seen from above classification results Machine, remaining 57.3% be judged as taxi driver it is more similar with non-net of justice Yue Che driver in behavioural characteristic.

As shown in figure 12, the net of the invention based on clustering and discriminant model about car identification system includes Data Collection mould Block, data clusters analysis module, data processing module；

The driver's subscriber signaling data newly obtained are introduced directly into system by traffic administration department, you can the result identified, The non-net of justice about car can be hit for traffic administration department to be serviced, help them quickly to position suspected vehicles, reduction law enforcement Human cost, lifts operating efficiency.

Particular embodiments described above, has been carried out further in detail to the purpose of the present invention, technical scheme and beneficial effect Describe in detail it is bright, should be understood that the foregoing is only the present invention specific embodiment, be not intended to limit the invention；It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements done etc., should be included in the guarantor of the present invention Within the scope of shield.

Claims

1. a kind of net based on clustering and discriminant model about car discrimination method, it is characterised in that comprise the following steps：

Step (1)：Initial data is obtained, and randomly selects several known taxi driver users as sample set M, is taken out at random The driver user of several unknown classifications is taken as sample set N；

Step (2)：Obtain in the step (1) signaling number of the driver user within a period of time in sample set M and sample set N According to progress feature extraction；

Step (3)：By analyzing the feature that the step (2) is extracted, it is known that net Yue Che driver and taxi driver deposit In certain otherness；

Step (4)：Model is set up, is cluster training set P and checking collection Q by the sample set M random divisions, by the sample set N is used as test set N；

Clustering is carried out for training set P, preferable clustering number K is calculated, the exceptional sample point in the training set P is rejected, obtains Cluster centre point, calculate in training set P each effective sample point to cluster centre point apart from sum, and be based on distance increment Situation of change draws the threshold value of classification；

Step (5)：The unknown driver's signaling data collected is imported into the model of the step (4) foundation and judged.

2. the net according to claim 1 based on clustering and discriminant model about car discrimination method, it is characterised in that in the step Suddenly in (4), the model drawn in the step (4) is verified using checking collection Q, tested using test set N.

3. the net according to claim 1 based on clustering and discriminant model about car discrimination method, it is characterised in that the step (2) in, the feature of extraction includes cell and switches and be resident duration, wherein, the switching of feature cell include cell switch a few days average, Cell switching a few days standard deviation, busy cell switching number average, busy cell switching number standard deviation, idle cell switching number average Switch number standard deviation with idle cell；Feature, which is resident duration, includes the resident median of busy, the resident average of busy, the resident mark of busy Accurate poor, idle is resident median, idle and is resident average and the resident standard deviation of idle.

4. the net according to claim 1 based on clustering and discriminant model about car discrimination method, it is characterised in that in the step Suddenly in (4), for training set P, preferable clustering number K is calculated using silhouette coefficient, silhouette coefficient is the intensive and degree of scatter of class Evaluation index, formula is as follows：

5. the net according to claim 1 based on clustering and discriminant model about car discrimination method, it is characterised in that in the step Suddenly in (4), each effective sample point is calculated in training set P to cluster centre point apart from sum, and is sorted, drafting increment graph, X Axle represents sample sequence number in training set P, and Y-axis represents that sample point, apart from sum, draws training set P flex point to central point, should The value in Y-axis corresponding to flex point, the threshold value y as classified；

Threshold=y_(x=101)=2.239995.

6. a kind of net based on clustering and discriminant model about car identification system, it is characterised in that the system includes Data Collection mould Block, data clusters analysis module, data processing module；

Data clusters analysis module：Randomly select the taxi driver's signaling number being collected into several described data collection modules According to being used as sample set M；Randomly select the driver user's conduct for the unknown classification being collected into several described data collection modules Sample set N；Feature is extracted, based on sample set M, clustering and discriminant model is set up；

Data processing module：Obtained driver's subscriber signaling data are imported, carry out judging the driver by clustering and discriminant model The classification of user.