CN107633067B - Group identification method based on personnel behavior rule and data mining method - Google Patents

Group identification method based on personnel behavior rule and data mining method Download PDF

Info

Publication number
CN107633067B
CN107633067B CN201710862301.9A CN201710862301A CN107633067B CN 107633067 B CN107633067 B CN 107633067B CN 201710862301 A CN201710862301 A CN 201710862301A CN 107633067 B CN107633067 B CN 107633067B
Authority
CN
China
Prior art keywords
similarity
semantic information
area
staying
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710862301.9A
Other languages
Chinese (zh)
Other versions
CN107633067A (en
Inventor
丁治明
司云飞
才智
曹阳
迟远英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710862301.9A priority Critical patent/CN107633067B/en
Publication of CN107633067A publication Critical patent/CN107633067A/en
Application granted granted Critical
Publication of CN107633067B publication Critical patent/CN107633067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a group identification method based on a personnel behavior rule and a data mining method, belongs to the field of data mining, and particularly relates to a method for identifying key groups in large-scale activities based on the personnel behavior rule. The method comprises the steps of extracting staying areas and the frequency of people to each staying area by utilizing trajectory data information of the people, further extracting semantic information of each area to express user behaviors more accurately based on the extracted staying area information of the people, and carrying out group clustering by utilizing a data mining method according to the behavior rules and the characteristic similarity of the people to finally identify key special groups from target groups.

Description

Group identification method based on personnel behavior rule and data mining method
Technical Field
The invention belongs to the field of data mining, and relates to a method for identifying key groups in large-scale activities based on personnel behavior rules.
Background
With the increase of economic activities in the market and the improvement of the living standard of people's material culture, the holding of various large activities is more frequent, and the large activities pose serious challenges to the safe performance of the activities and the prevention of emergencies. The most important problem of doing safety precaution work for large-scale activities is how to identify special groups in target groups to do preventive work in advance. Meanwhile, the rapid development of the wireless communication technology promotes a large amount of mobile object data, the data depict the space-time dynamics of individuals and groups, contain the behavior information of the mobile objects, and can help people to know the behavior rules, group trends and the like of target people by analyzing the mobile data of the target people.
In recent years, technologies such as satellite communication, GPS equipment, RFID, wireless sensors, internet of things communication, video tracking, and the like are continuously developed and widely used, so that mobile objects of various sizes in the global range are accurately positioned and effectively tracked. By the technologies, the signal receiving device can collect a large amount of moving object data from the positioning terminal, the data contains very abundant information such as position information, time information and the like, and the data volume becomes more and more large and complex with the passage of time. Meanwhile, the moving object data also becomes a new data analysis way, and especially before a major activity event, the research on the motion trail of related groups can help people to perform group identification, understand group movement and analyze group behavior rules, so that people can make a preventive work for large-scale activities in a targeted manner.
The technology adopts a clustering method in data mining to mine data information, similar groups often have similar characteristics, and according to extracted personnel characteristic information data and a similarity calculation formula among designers, a proper clustering algorithm is selected to identify key special groups from target groups.
Disclosure of Invention
The invention provides a group identification method based on a personnel behavior rule and a data mining method, which is characterized in that a staying area and the frequency of personnel going to each staying area are extracted by utilizing trajectory data information of personnel, then semantic information of each area is further extracted to express user behavior more accurately based on the extracted information of the staying area of the personnel, the group clustering is carried out by utilizing the data mining method in combination with the personnel behavior rule and the characteristic similarity, and finally a key special group is identified from a target group.
A group identification method based on a personnel behavior rule and a data mining method comprises the following steps:
the method comprises the following steps: and extracting the staying areas and the frequency of the persons going to each staying area by using the trajectory data information of the persons.
Step 1.1: and extracting a single trajectory stop point of the personnel. The stay points represent the geographical positions where the persons stay for a period of time, and each stay point extracted from the trajectory of the person is associated with a real geographical position,these geographical locations can reflect to some extent the activities of the persons. Defining a single track as T ═ p1,p2,…,pn) Wherein p isi=(lati,loni,ti),0≤i≤n,(lati,loni) Represents the latitude and longitude, t, at location point iiRepresenting the time at position point i.
Given a segment of the track sequence t ═ p (p)i,…,pi+m) If distance (p)i,px)≤θd,|ti-tx|≥θt,i≤x≤i+m,pxRepresenting the xth track point in the sequence of tracks, m being an integer from 0 to n-i, thetadAnd thetatRespectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein
Figure BDA0001415349190000021
Step 1.2: the personnel have more stops in areas frequently visited, and conversely, have less stops in areas less visited. The DBSCAN algorithm is applied to the cluster with high time complexity and more input parameters, so that a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance less than tau through traversing each stop point, and if no cluster exists and the distance of the point is less than tau, the point is used as a new cluster.
Each cluster is a dwell area, noted
Figure BDA0001415349190000022
For all points in the dwell area, lat and lon are the center points of the set of dwell area points, and r is the radius of the dwell area.
Step two: and further extracting semantic information of each region based on the extracted information of the person staying region.
Step 2.1: sometimes, the relationship between people cannot be accurately judged only by the geographical position information, and semantic information of a staying area is also needed. POI (Point of inf)Format) describes the space and attribute information of the geographic entities, such as names, addresses, categories, coordinates and the like of the entities, thereby greatly enhancing the description capacity of the actual geographic position and reflecting the user behavior activity to a certain extent. In many cases, semantic information of a person staying area is not single, so that all kinds of information in the staying area cannot be simply classified into one kind, but a plurality of kinds and proportions thereof are recorded, and sem ═ c (c) and (d) are recorded<catg1,freq1>,<catg2,freq2>,…,<catgn,freqn>) n is greater than or equal to 1. sem represents semantic information in the stay area,<catg1,freq1the category of the first semantic information and the frequency of people to visit the geographic position corresponding to the semantic are expressed.
And modeling the semantic information in the staying area by adopting an LDA topic model, comparing the POI information in the staying area into a document, comparing the semantic information in the staying area into a topic, and taking each POI as a word. And extracting semantic information in each person staying area by using a modeling model, firstly training the model by taking POI (point of interest) information of all the person staying areas as input data, and then inferring the semantic information in each staying area by using the trained model.
Redefining dwell regions after extraction of semantic information to
Figure BDA0001415349190000031
Figure BDA0001415349190000032
Representative semantic information within a circle with r as the radius for the dwell region.
Step 2.2: and removing meaningless semantic information.
The semantic information set of person A is (< residential area, 150>, < caf, 5>, < gym, 45>), the semantic information set of person B is (< residential area, 200>, < research institution, 59>, < concert hall, 3>), and two items in parentheses represent semantic location information (for simplicity of description, only one type of semantic information is used to represent the area semantics) and the frequency of visiting the location. It can be seen that, in this example, the term "residential area" has a larger weight in the semantic information sets of the two, "residential area" has no practical meaning or even is an interference term in comparing the similarity of the semantic information sets of the two, and the true similarity of a and B is very low after the interference term is removed.
Generally, the semantic information of the 'residential area' is the semantic information commonly owned by people, the track semantic information of each person contains the information, and the obvious characteristics of the semantic information are that the visiting frequency is high, and the staying time period is fixed. The method for removing the meaningless semantic information comprises the following steps:
1) circularly judging each semantic information, judging whether the area is possibly a residential area from the area semantic information, if so, turning to 2), and if not, turning to 4);
2) judging whether the average residence time distribution of all the residence points in the residence area is correct, if so, turning to 3), and if not, turning to 4);
3) deleting the semantic information from the semantic information set;
4) jumping out for circulation;
step three: and (4) carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups.
Step 3.1: similarity measure
The calculation of the similarity, the geographic position similarity and the semantic position similarity are considered from two aspects.
In a first aspect, geographic location similarity. The expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the influence of frequency and vector length is considered unlike the cosine similarity. Given person a and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:
Figure BDA0001415349190000041
when judging whether the two geographic positions are the same, due to the error of the positioning device, the position relationship of the two geographic positions needs to be judged according to the overlapping degree of the stop points in the two geographic position areas. The degree of overlap, or similarity, of two dwell regions is defined as the ratio of the number of dwell points in the region containing the fewer dwell points to the number of all dwell points in the region containing the fewer dwell points in the intersection of the two regions. And then adding the similarity as a weight to the Tanimoto coefficient to form a new weighted geographic position similarity measurement. The formula is as follows:
Figure BDA0001415349190000042
in a second aspect, semantic location similarity. Given semantic information in a certain dwell region as sem ═ c (<c1,f1>,<c2,f2>,...<cn,fn>),n≥1,fiRepresents ciIs at a certain probability ofWhen comparing whether the semantic information in the two staying areas is the same or not and judging whether the geographic positions are the same or not, the similarity degree of the two is also considered. The sem includes a probability distribution (f) of semantic information1,f2,…,fn) Therefore, the KL distance is used to measure the distance between the two probability distributions.
In probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used to measure the difference between two probability distributions in the same event space. Given the probability distributions of sets of semantic information in a certain stay area for person a and person B are fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:
Figure BDA0001415349190000044
KL distance has no symmetry, i.e. DKL(fa||fb)≠DKL(fb fa), so it is not a true measure or distance. The JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0, 1%]On the closed interval of (c). The formula is as follows:
Figure BDA0001415349190000045
if it is not
Figure BDA0001415349190000046
δ is the distance threshold of the semantic information of the two, semA and senB are the semantic information sets of all the stay areas of the person a and the person B, respectively, and then the semantic information of the two areas is similar.
The calculation mode of the semantic position similarity of the two persons is the same as the calculation mode of the geographic position similarity, Tanimoto coefficients are adopted, and the formula is as follows:
Figure BDA0001415349190000051
sa and sb are the two-person semantic information frequency vectors, respectively, and w is the vector formed by the JS distance mentioned above.
With the geographic position similarity and the semantic position similarity, the two-person similarity defines the weighted sum of the two, and the formula is as follows:
sim(A,B)=∝·simloc(A,B)+(1-∝)·simsem(A,B) (6)
where ∈ is a value in the [0,1] interval, which determines the weight of semantic information.
Step 3.2: group clustering
A cluster based on shared nearest neighbor is adopted, and the cluster comprises a very important SNN similarity concept, wherein the SNN similarity represents the number of common items in k neighbors of two objects. It is due to the properties of SNN that it is good at handling noise and outliers and is capable of handling clusters of different sizes, shapes and densities, especially good at finding tight clusters of strongly related objects.
In the group clustering, clustering is carried out in three steps. The method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and the rest clusters are a group. By setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.
Firstly, k neighbor persons of each person are calculated in a proximity matrix construction algorithm, then SNN similarity between the persons is calculated, if the number of the shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, therefore, the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, and all the persons are repeatedly stored until the construction of the SNN proximity matrix is completed. And then constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding element of the vertex by the adjacency list. In the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added to the adjacency list of v2, and v2 is added to the adjacency list of v1 until all edges are added, and the complete graph is constructed. Finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.
Drawings
FIG. 1: a system flow diagram.
FIG. 2: and (4) a person track graph.
FIG. 3: a personnel stay spot diagram.
FIG. 4: and (4) a personnel stay area diagram.
FIG. 5: and (4) a personnel semantic information graph.
FIG. 6: stop position map of people 000 and 003.
FIG. 7: stop position maps of persons 007 and 036.
FIG. 8: stop point location maps for people 006 and 023.
FIG. 9: and (4) a population clustering flow chart.
FIG. 10: silhouette index (contour coefficient) is plotted against the change in k-value.
FIG. 11: dunn index (Dunn index) is plotted as a function of k.
FIG. 12: and (3) a similarity matrix diagram when k is 12 or k is 13.
FIG. 13: the number of clusters is plotted as a function of k value.
FIG. 14: and k is a similarity matrix diagram at 15.
FIG. 15: and (3) a similarity matrix diagram when k is 19.
FIG. 16: and (3) a similarity matrix diagram when k is 21.
Detailed Description
The invention is explained and illustrated below with reference to the accompanying drawings:
the data set adopted by the invention is an open source project of Microsoft, namely Geolife, and GPS track data (2007.4-2012.8) of 182 volunteers in five years are collected in the project. This data set contained 17621 traces for a total mileage of 1292951 kilometers for a total duration of 50176 hours. Each track contains a timestamp, latitude and longitude, and altitude. These traces are collected by different GPS sampling devices, with 91.5% of the samples being very dense, one point every 5 seconds or every 5-10 meters. This data set has recorded extensive personnel outdoor activities, not only including living habits such as returning home and working, but also some amusement and sports activities, such as shopping, sightseeing, food and beverage, hiking and riding bicycle. Although this data set is distributed in large quantities over 30 cities in china, and even some cities in the united states and europe, most of the data is in beijing hailake.
The POI data set was collected from a gold map that contained 156500 objects in the beijing haiji area, each object containing a name, address, category, and coordinates in three coordinate systems.
The method comprises the following steps: and extracting the staying areas and the frequency of the persons going to each staying area by using the trajectory data information of the persons.
Fig. 2 shows a student's movement track over a period of several days, wherein the track marked with each color represents the person's movement track for one day. And (3) extracting the stop points according to a single track stop point extraction method, extracting the stop points of the person shown in the attached figure 3, and then applying an SC algorithm to all the stop points of the person to divide the stop areas, wherein the division result is shown in the attached figure 4.
Step two: and further extracting semantic information of each region based on the extracted information of the person staying region.
And calculating semantic information of the student staying area according to the LDA topic model, wherein part of the semantic information is shown in figure 5.
Step three: and (4) carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups.
Step 3.1: similarity measure
8 persons with clear trajectory characteristics are extracted from the data set, and the similarity between the persons when α is 0.3 is compared according to formula (6), as shown in the following table:
Person similarities(α=0.3)
000 003 006 007 023 036 041 065
000 1.00000 0.53124 0.03872 0.02251 0.01670 0.09282 0.00858 0.01253
003 0.53124 1.00000 0.01927 0.02097 0.00663 0.04312 0.00531 0.00678
006 0.03872 0.01927 1.00000 0.01555 0.31987 0.04317 0.15063 0.11027
007 0.02251 0.02097 0.01555 1.00000 0.22439 0.31128 0.04005 0.04005
023 0.01670 0.00663 0.31987 0.22439 1.00000 0.01172 0.07134 0.04878
036 0.09282 0.04312 0.04317 0.31128 0.01172 1.00000 0.03154 0.04597
041 0.00858 0.00531 0.15063 0.04005 0.07134 0.03154 1.00000 0.15481
065 0.01253 0.00678 0.11027 0.04005 0.04878 0.04597 0.15481 1.00000
from the table above, several pairs of highly similar persons 000 and 003, 006 and 023, 007 and 036 can be found. The dwell points of 000 and 036 are distributed as shown in fig. 6, it is obvious that the geographic positions visited by the two persons overlap more, so the geographic position similarity is higher, and because the semantic information is obtained according to the geographic position area, the semantic information similarity of the two persons is also higher, so the similarity of the two persons is 0.53, which is expected. 007 and 036 are distributed as shown in figure 7, which is similar to 000 and 003, and the geographic positions are highly similar to each other, so that semantic information is generated, and the overall similarity between two persons is high. The distribution of the dwell points of 006 and 023 is shown in fig. 8, and it can be seen that there is almost no overlapping part of the dwell areas of the two people, but the similarity of the two people is 0.32. Obviously, according to the previous definition of similarity, the semantic information similarity of two people is higher. This is true, and the parts of the graph with the dense two-person dwell points are respectively located in Beijing aerospace university and Central nation university, and POI near the two positions are most common in science and education culture service class. This illustrates that people who are not found by geographic location similarity alone can be found by considering semantic similarity.
Step 3.2: group clustering
The population clustering process is shown in FIG. 9. Taking the trajectory data of all 181 persons in Geolife as an example, Dunn index and Sihouette index are used as clustering evaluation criteria.
Dunn index:
Figure BDA0001415349190000081
Wherein C isiTo representThe ith cluster, d (x, y), represents the distance between x and y. It minimizes the intra-cluster distance while maximizing the inter-cluster distance, so that a larger value thereof indicates a better clustering effect.
Sihouette index:
Figure BDA0001415349190000082
Wherein
Figure BDA0001415349190000083
NC is the number of clusters, niIs CiThe number of points in (1). a (x) represents the average distance of object x to all other objects in the cluster in which it is located, b (x) represents the minimum found for all clusters for object x and any cluster that does not contain the object, the average distance of the object to all objects in a given cluster being calculated.
Fig. 10 and 11 show the variation trend of the Silhouette index and Dunn index with the nearest neighbor number k, respectively. The nearest neighbor number k takes a value [10,30], which is a most representative value interval, k which is illegal for evaluating a standard value is omitted, and the threshold value gamma of the SNN similarity is 10. As can be seen from the figure, when the nearest neighbor number k takes 12 or 13, the two evaluation criteria reach the maximum value. The clustering results obtained by the two clusters are the same, and the two clusters respectively contain 032, 044 and 151, 162. The similarity matrix is shown in figure 12.
This indicates that we have found two groups of high similarity people from 182, each containing two people, with the rest of the people all being considered noise points. This is the best clustering result that the evaluation criteria indicate to us.
FIG. 13 shows the variation of the number of clusters with k. Where when k is 15, k is 19, and k is 21, the similarity matrices of the resulting clustering results are shown in fig. 14, 15, and 16, respectively. Resulting in 4, 11 and 8 clusters, respectively. Although none of the three clustering results is optimal for the evaluation criteria, they also provide some relatively reasonable results, which indicates that relatively reasonable clustering results under different clustering numbers can be obtained by the method, so that the method can effectively identify highly similar populations in large-scale populations.

Claims (2)

1. A group identification method based on a personnel behavior rule and a data mining method is characterized in that: the method comprises the following steps:
the method comprises the following steps: extracting the staying areas and the frequency of the persons to go to each staying area by using the trajectory data information of the persons;
step 1.1: extracting a single-track stop point of a person; the stay points represent the geographic positions of the persons staying for a period of time, and each stay point extracted from the person track is associated with a real geographic position which can reflect the activity condition of the persons to some extent; defining a single track as T ═ p1,p2,...,pn) Wherein p isi=(lati,loni,ti),0≤i≤n,(lati,loni) Represents the latitude and longitude, t, at location point iiRepresents the time at position point i;
given a segment of the track sequence t ═ p (p)i,...,pi+m) If distance (p)i,px)≤θd,|ti-tx|≥θt,i≤x≤i+m,pxRepresenting the xth track point in the sequence of tracks, m being an integer from 0 to n-i, thetadAnd thetatRespectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein
Figure FDA0002219765930000011
Step 1.2: the number of the personnel in the frequently visited areas is more, and on the contrary, the number of the personnel in the less visited areas is less; the DBSCAN algorithm is applied to the position, the time complexity is high, the input parameters are more, therefore, a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance being less than tau through traversing each stop point, and if no cluster exists and the distance between the point and the cluster is less than tau, the point is used as a new cluster;
each cluster is a dwell area, noted
Figure FDA0002219765930000012
Figure FDA0002219765930000013
All points in the staying area are used, lat and lon are the central points of the staying area point set, and r is the radius of the staying area;
step two: based on the extracted information of the personnel staying area, semantic information of each area is further extracted;
step 2.1: sometimes, the relation between people cannot be accurately judged only through geographical position information, and semantic information of a staying area is also needed; poi (point of information) describes the spatial and attribute information of these geographic entities, and most of the semantic information of the person staying area is not single, so that all the category information in the staying area cannot be simply classified into one, but multiple categories and their occupation ratios are recorded, and (m) =<catg1,freq1>,<catg2,freq2>,...,<catgn,freqn>) N is more than or equal to 1; sem represents semantic information in the stay area,<catg1,freq1>representing the category of the first semantic information and the frequency of people visiting the geographic position corresponding to the semantic;
modeling semantic information in the staying area by adopting an LDA topic model, comparing POI information in the staying area into a document, comparing the semantic information in the staying area into a topic, and taking each POI as a word; extracting semantic information in each person staying area by using a modeling model, firstly training the model by taking POI (point of interest) information of all the person staying areas as input data, and then inferring the semantic information in each staying area by using the trained model;
redefining dwell regions after extraction of semantic information to
Figure FDA0002219765930000021
Figure FDA0002219765930000022
Representative semantic information within a circle with r as the radius for the dwell region;
step 2.2: removing meaningless semantic information;
the semantic information set of the person A is (< residential area, 150>, < caf, 5>, < gym, 45>), the semantic information set of the person B is (< residential area, 200>, < scientific research institution, 59>, < concert hall, 3>), and two items in parentheses represent semantic position information and frequency of visiting the position; the 'residential area' item has a larger weight in the semantic information sets of the 'residential area' and the 'residential area' has no practical meaning or is an interference item in the aspect of comparing the similarity of the semantic information sets of the 'residential area' and the 'residential area', and the true similarity of the A and the B is very low after the interference item is removed;
generally, semantic information of a residential area is required to be the semantic information commonly owned by people, track semantic information of each person contains the semantic information, and the obvious characteristics of the semantic information are that the visiting frequency is high and the staying time period is fixed; the method for removing the meaningless semantic information comprises the following steps:
1) circularly judging each semantic information, judging whether the area is possibly a residential area from the area semantic information, if so, turning to 2), and if not, turning to 4);
2) judging whether the average residence time distribution of all the residence points in the residence area is correct, if so, turning to 3), and if not, turning to 4);
3) deleting the semantic information from the semantic information set;
4) jumping out for circulation;
step three: carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups;
step 3.1: similarity measure
Calculating similarity, namely considering the similarity of geographic positions and the similarity of semantic positions from two aspects;
in a first aspect, geographic location similarity; the expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the influence of frequency and vector length is considered unlike the cosine similarity; given person a and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:
Figure FDA0002219765930000023
when judging whether the two geographic positions are the same, judging the position relation of the two geographic positions according to the overlapping degree of the stop points in the two geographic position areas due to the error of the positioning equipment; the overlapping degree or similarity of the two staying areas is defined as the ratio of the number of staying points in the area containing less staying points and the number of all staying points in the area containing less staying points in the intersecting area of the two areas; then adding the similarity serving as a weight into a Tanimoto coefficient to form a new weighted geographic position similarity measurement; the formula is as follows:
Figure FDA0002219765930000031
in a second aspect, semantic location similarity; given semantic information in a certain dwell region as sem ═ c (<c1,f1>,<c2,f2>,...<cn,fn>),n≥1,fiRepresents ciIs at a certain probability of
Figure FDA0002219765930000032
Comparing whether the semantic information in the two staying areas is the same or not, judging whether the geographic positions are the same or not, and considering the similarity degree of the two staying areas; the sem includes a probability distribution (f) of semantic information1,f2,...,fn) Therefore, the KL distance is used for measuring the distance between the two probability distributions;
in probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used for measuring the difference condition of two probability distributions in the same event space; given the probability distributions of sets of semantic information in a certain stay area for person a and person B are fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:
Figure FDA0002219765930000033
KL distance has no symmetry, i.e. DKL(fa||fb)≠DKL(fb fa), so it is not a true measure or distance; the JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0, 1%]On the closed interval of (c); the formula is as follows:
Figure FDA0002219765930000034
if it is not
Figure FDA0002219765930000035
Delta is the distance threshold of the semantic information of the two, semA and semB are the semantic information sets of all the staying areas of the person A and the person B respectively, and the semantic information of the two areas is similar;
the calculation mode of the semantic position similarity of the two persons is the same as the calculation mode of the geographic position similarity, Tanimoto coefficients are adopted, and the formula is as follows:
Figure FDA0002219765930000036
sa and sb are semantic information frequency vectors of two persons respectively, and w is a vector formed by the JS distance mentioned above;
with the geographic position similarity and the semantic position similarity, the two-person similarity defines the weighted sum of the two, and the formula is as follows:
sim(A,B)=∝·simloc(A,B)+(1-∝)·simsem(A,B) (6)
where, oc is a value in the [0,1] interval, which determines the weight of semantic information;
step 3.2: group clustering
The method comprises the steps of adopting a cluster based on shared nearest neighbor, wherein the cluster comprises an important SNN similarity concept, and the SNN similarity represents the number of common items in k neighbors of two objects; due to the very nature of SNN, it is adept at handling noise and outliers, and is able to handle clusters of different sizes, shapes and densities, especially at finding compact clusters of strongly related objects;
in the group clustering, clustering is carried out in three steps; the method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and taking the rest clusters as a group; by setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.
2. The group identification method based on the personnel behavior law and the data mining method as claimed in claim 1, wherein: firstly, k neighbor personnel of each person are calculated in a constructed proximity matrix algorithm, then SNN similarity between the persons is calculated, if the number of shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, and therefore the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, all the persons are repeatedly stored in the way until the construction of the SNN proximity matrix is completed; then, constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding elements of the vertex by the adjacency list; in the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added to the adjacency list of v2, and v2 is added to the adjacency list of v1 until all edges are added, and a complete graph is constructed; finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.
CN201710862301.9A 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method Active CN107633067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710862301.9A CN107633067B (en) 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710862301.9A CN107633067B (en) 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method

Publications (2)

Publication Number Publication Date
CN107633067A CN107633067A (en) 2018-01-26
CN107633067B true CN107633067B (en) 2020-03-27

Family

ID=61102253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710862301.9A Active CN107633067B (en) 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method

Country Status (1)

Country Link
CN (1) CN107633067B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764011B (en) * 2018-03-26 2021-05-18 青岛科技大学 Group identification method based on graphical interaction relation modeling
CN109005515B (en) * 2018-09-05 2020-07-24 武汉大学 User behavior mode portrait drawing method based on movement track information
CN109543876A (en) * 2018-10-17 2019-03-29 天津大学 A kind of visual analysis method of urban issues
CN109388684B (en) * 2018-10-23 2020-07-07 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109743689B (en) * 2019-01-09 2020-11-17 南京航空航天大学 Indoor track staying area discovery method based on stability value
CN110008655B (en) * 2019-03-01 2020-11-17 北京数字融通科技有限公司 Infringement information identification system and method based on distributed network
CN110348133B (en) * 2019-07-15 2022-08-19 西南交通大学 System and method for constructing high-speed train three-dimensional product structure technical effect diagram
CN110457315A (en) * 2019-07-19 2019-11-15 国家计算机网络与信息安全管理中心 A kind of group's accumulation mode analysis method and system based on user trajectory data
CN110837512A (en) * 2019-11-15 2020-02-25 北京市商汤科技开发有限公司 Visitor information management method and device, electronic equipment and storage medium
CN110990455B (en) * 2019-11-29 2023-10-17 杭州数梦工场科技有限公司 Method and system for recognizing house property by big data
CN111460246B (en) * 2019-12-19 2020-12-08 南京柏跃软件有限公司 Real-time activity abnormal person discovery method based on data mining and density detection
CN111312406B (en) * 2020-03-15 2020-11-13 薪得付信息技术(山东)有限公司 Epidemic situation label data processing method and system
CN111428217B (en) * 2020-04-12 2023-07-28 中信银行股份有限公司 Fraudulent party identification method, apparatus, electronic device and computer readable storage medium
CN111639092B (en) * 2020-05-29 2023-09-26 京东城市(北京)数字科技有限公司 Personnel flow analysis method and device, electronic equipment and storage medium
CN111797291A (en) * 2020-06-02 2020-10-20 成都方未科技有限公司 Method, system and storage medium for social function mining by using trajectory data
CN111737387A (en) * 2020-06-11 2020-10-02 南京森根安全技术有限公司 Method and module for discovering specific personnel based on track similarity
CN111832304B (en) * 2020-06-29 2024-02-27 上海巧房信息科技有限公司 Weight checking method and device for building names, electronic equipment and storage medium
CN112765226A (en) * 2020-12-06 2021-05-07 复旦大学 Urban semantic map construction method based on trajectory data mining
CN112738725B (en) * 2020-12-18 2022-09-23 福建新大陆软件工程有限公司 Real-time identification method, device, equipment and medium for target crowd in semi-closed area
CN113792763B (en) * 2021-08-24 2022-08-12 中山大学 Social group behavior recognition method based on electromagnetic spectrum data mining, computer device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866429A (en) * 2010-06-01 2010-10-20 中国科学院计算技术研究所 Training method of multi-moving object action identification and multi-moving object action identification method
CN103413440A (en) * 2013-04-11 2013-11-27 江苏省邮电规划设计院有限责任公司 Fake-licensed vehicle identification method based on smart city data base and identification rule base
CN104750829A (en) * 2015-04-01 2015-07-01 华中科技大学 User position classifying method and system based on signing in features
CN105404890A (en) * 2015-10-13 2016-03-16 广西师范学院 Criminal gang discrimination method considering locus space-time meaning
CN106407519A (en) * 2016-08-31 2017-02-15 浙江大学 Modeling method for crowd moving rule

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7716226B2 (en) * 2005-09-27 2010-05-11 Patentratings, Llc Method and system for probabilistically quantifying and visualizing relevance between two or more citationally or contextually related data objects

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866429A (en) * 2010-06-01 2010-10-20 中国科学院计算技术研究所 Training method of multi-moving object action identification and multi-moving object action identification method
CN103413440A (en) * 2013-04-11 2013-11-27 江苏省邮电规划设计院有限责任公司 Fake-licensed vehicle identification method based on smart city data base and identification rule base
CN104750829A (en) * 2015-04-01 2015-07-01 华中科技大学 User position classifying method and system based on signing in features
CN105404890A (en) * 2015-10-13 2016-03-16 广西师范学院 Criminal gang discrimination method considering locus space-time meaning
CN106407519A (en) * 2016-08-31 2017-02-15 浙江大学 Modeling method for crowd moving rule

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
移动对象子轨迹段分割与聚类算法;张延玲 等;《计算机工程与应用》;20091231;全文 *

Also Published As

Publication number Publication date
CN107633067A (en) 2018-01-26

Similar Documents

Publication Publication Date Title
CN107633067B (en) Group identification method based on personnel behavior rule and data mining method
Mohamed et al. Accurate real-time map matching for challenging environments
Yin et al. A generative model of urban activities from cellular data
Chen et al. Probabilistic modeling of traffic lanes from GPS traces
EP3241370B1 (en) Analyzing semantic places and related data from a plurality of location data reports
Zheng Trajectory data mining: an overview
Pelekis et al. Mobility data management and exploration
Zhan et al. Inferring urban land use using large-scale social media check-in data
Zheng et al. Computing with spatial trajectories
Chung et al. A trip reconstruction tool for GPS-based personal travel surveys
Tang et al. CLRIC: Collecting lane-based road information via crowdsourcing
Orellana et al. Exploring visitor movement patterns in natural recreational areas
Yazdizadeh et al. An automated approach from GPS traces to complete trip information
CN110442715B (en) Comprehensive urban geography semantic mining method based on multivariate big data
Sharif et al. Context-awareness in similarity measures and pattern discoveries of trajectories: a context-based dynamic time warping method
CN105532030A (en) Apparatus, systems, and methods for analyzing movements of target entities
CN110334293A (en) A kind of facing position social networks has Time Perception position recommended method based on fuzzy clustering
Mota-Vargas et al. Taxonomy and ecological niche modeling: Implications for the conservation of wood partridges (genus Dendrortyx)
CN108256914A (en) A kind of point of interest category forecasting method based on tensor resolution model
Abbruzzo et al. A pre-processing and network analysis of GPS tracking data
Monnot et al. Inferring activities and optimal trips: Lessons from Singapore’s National Science Experiment
McKenzie et al. Measuring urban regional similarity through mobility signatures
CN113888867B (en) Parking space recommendation method and system based on LSTM (least squares) position prediction
Servizi et al. Mining User Behaviour from Smartphone data: a literature review
Millonig et al. Shadowing-Tracking-Interviewing: How to Explore Human Spatio-Temporal Behaviour Patterns.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant