CN107633067B

CN107633067B - Group identification method based on personnel behavior rule and data mining method

Info

Publication number: CN107633067B
Application number: CN201710862301.9A
Authority: CN
Inventors: 丁治明; 司云飞; 才智; 曹阳; 迟远英
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-09-21
Filing date: 2017-09-21
Publication date: 2020-03-27
Anticipated expiration: 2037-09-21
Also published as: CN107633067A

Abstract

The invention discloses a group identification method based on a personnel behavior rule and a data mining method, belongs to the field of data mining, and particularly relates to a method for identifying key groups in large-scale activities based on the personnel behavior rule. The method comprises the steps of extracting staying areas and the frequency of people to each staying area by utilizing trajectory data information of the people, further extracting semantic information of each area to express user behaviors more accurately based on the extracted staying area information of the people, and carrying out group clustering by utilizing a data mining method according to the behavior rules and the characteristic similarity of the people to finally identify key special groups from target groups.

Description

Group identification method based on personnel behavior rule and data mining method

Technical Field

The invention belongs to the field of data mining, and relates to a method for identifying key groups in large-scale activities based on personnel behavior rules.

Background

With the increase of economic activities in the market and the improvement of the living standard of people's material culture, the holding of various large activities is more frequent, and the large activities pose serious challenges to the safe performance of the activities and the prevention of emergencies. The most important problem of doing safety precaution work for large-scale activities is how to identify special groups in target groups to do preventive work in advance. Meanwhile, the rapid development of the wireless communication technology promotes a large amount of mobile object data, the data depict the space-time dynamics of individuals and groups, contain the behavior information of the mobile objects, and can help people to know the behavior rules, group trends and the like of target people by analyzing the mobile data of the target people.

In recent years, technologies such as satellite communication, GPS equipment, RFID, wireless sensors, internet of things communication, video tracking, and the like are continuously developed and widely used, so that mobile objects of various sizes in the global range are accurately positioned and effectively tracked. By the technologies, the signal receiving device can collect a large amount of moving object data from the positioning terminal, the data contains very abundant information such as position information, time information and the like, and the data volume becomes more and more large and complex with the passage of time. Meanwhile, the moving object data also becomes a new data analysis way, and especially before a major activity event, the research on the motion trail of related groups can help people to perform group identification, understand group movement and analyze group behavior rules, so that people can make a preventive work for large-scale activities in a targeted manner.

The technology adopts a clustering method in data mining to mine data information, similar groups often have similar characteristics, and according to extracted personnel characteristic information data and a similarity calculation formula among designers, a proper clustering algorithm is selected to identify key special groups from target groups.

Disclosure of Invention

The invention provides a group identification method based on a personnel behavior rule and a data mining method, which is characterized in that a staying area and the frequency of personnel going to each staying area are extracted by utilizing trajectory data information of personnel, then semantic information of each area is further extracted to express user behavior more accurately based on the extracted information of the staying area of the personnel, the group clustering is carried out by utilizing the data mining method in combination with the personnel behavior rule and the characteristic similarity, and finally a key special group is identified from a target group.

A group identification method based on a personnel behavior rule and a data mining method comprises the following steps:

the method comprises the following steps: and extracting the staying areas and the frequency of the persons going to each staying area by using the trajectory data information of the persons.

Step 1.1: and extracting a single trajectory stop point of the personnel. The stay points represent the geographical positions where the persons stay for a period of time, and each stay point extracted from the trajectory of the person is associated with a real geographical position,these geographical locations can reflect to some extent the activities of the persons. Defining a single track as T ═ p₁,p₂,…,p_n) Wherein p is_i＝(lat_i,lon_i,t_i),0≤i≤n，(lat_i,lon_i) Represents the latitude and longitude, t, at location point i_iRepresenting the time at position point i.

Given a segment of the track sequence t ═ p (p)_i,…,p_i+m) If distance (p)_i,p_x)≤θ_d，|t_i-t_x|≥θ_t，i≤x≤i+m，p_xRepresenting the xth track point in the sequence of tracks, m being an integer from 0 to n-i, theta_dAnd theta_tRespectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein

Step 1.2: the personnel have more stops in areas frequently visited, and conversely, have less stops in areas less visited. The DBSCAN algorithm is applied to the cluster with high time complexity and more input parameters, so that a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance less than tau through traversing each stop point, and if no cluster exists and the distance of the point is less than tau, the point is used as a new cluster.

Each cluster is a dwell area, noted

For all points in the dwell area, lat and lon are the center points of the set of dwell area points, and r is the radius of the dwell area.

Step two: and further extracting semantic information of each region based on the extracted information of the person staying region.

Step 2.1: sometimes, the relationship between people cannot be accurately judged only by the geographical position information, and semantic information of a staying area is also needed. POI (Point of inf)Format) describes the space and attribute information of the geographic entities, such as names, addresses, categories, coordinates and the like of the entities, thereby greatly enhancing the description capacity of the actual geographic position and reflecting the user behavior activity to a certain extent. In many cases, semantic information of a person staying area is not single, so that all kinds of information in the staying area cannot be simply classified into one kind, but a plurality of kinds and proportions thereof are recorded, and sem ═ c (c) and (d) are recorded<catg₁,freq₁＞,<catg₂,freq₂＞,…,<catg_n,freq_n>) n is greater than or equal to 1. sem represents semantic information in the stay area,<catg₁,freq₁the category of the first semantic information and the frequency of people to visit the geographic position corresponding to the semantic are expressed.

And modeling the semantic information in the staying area by adopting an LDA topic model, comparing the POI information in the staying area into a document, comparing the semantic information in the staying area into a topic, and taking each POI as a word. And extracting semantic information in each person staying area by using a modeling model, firstly training the model by taking POI (point of interest) information of all the person staying areas as input data, and then inferring the semantic information in each staying area by using the trained model.

Redefining dwell regions after extraction of semantic information to

Representative semantic information within a circle with r as the radius for the dwell region.

Step 2.2: and removing meaningless semantic information.

The semantic information set of person A is (< residential area, 150>, < caf, 5>, < gym, 45>), the semantic information set of person B is (< residential area, 200>, < research institution, 59>, < concert hall, 3>), and two items in parentheses represent semantic location information (for simplicity of description, only one type of semantic information is used to represent the area semantics) and the frequency of visiting the location. It can be seen that, in this example, the term "residential area" has a larger weight in the semantic information sets of the two, "residential area" has no practical meaning or even is an interference term in comparing the similarity of the semantic information sets of the two, and the true similarity of a and B is very low after the interference term is removed.

Generally, the semantic information of the 'residential area' is the semantic information commonly owned by people, the track semantic information of each person contains the information, and the obvious characteristics of the semantic information are that the visiting frequency is high, and the staying time period is fixed. The method for removing the meaningless semantic information comprises the following steps:

1) circularly judging each semantic information, judging whether the area is possibly a residential area from the area semantic information, if so, turning to 2), and if not, turning to 4);

2) judging whether the average residence time distribution of all the residence points in the residence area is correct, if so, turning to 3), and if not, turning to 4);

3) deleting the semantic information from the semantic information set;

4) jumping out for circulation;

step three: and (4) carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups.

Step 3.1: similarity measure

The calculation of the similarity, the geographic position similarity and the semantic position similarity are considered from two aspects.

In a first aspect, geographic location similarity. The expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the influence of frequency and vector length is considered unlike the cosine similarity. Given person a and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:

when judging whether the two geographic positions are the same, due to the error of the positioning device, the position relationship of the two geographic positions needs to be judged according to the overlapping degree of the stop points in the two geographic position areas. The degree of overlap, or similarity, of two dwell regions is defined as the ratio of the number of dwell points in the region containing the fewer dwell points to the number of all dwell points in the region containing the fewer dwell points in the intersection of the two regions. And then adding the similarity as a weight to the Tanimoto coefficient to form a new weighted geographic position similarity measurement. The formula is as follows:

in a second aspect, semantic location similarity. Given semantic information in a certain dwell region as sem ═ c (<c₁,f₁＞,<c₂,f₂>,...<c_n,f_n＞),n≥1，f_iRepresents c_iIs at a certain probability ofWhen comparing whether the semantic information in the two staying areas is the same or not and judging whether the geographic positions are the same or not, the similarity degree of the two is also considered. The sem includes a probability distribution (f) of semantic information₁,f₂,…,f_n) Therefore, the KL distance is used to measure the distance between the two probability distributions.

In probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used to measure the difference between two probability distributions in the same event space. Given the probability distributions of sets of semantic information in a certain stay area for person a and person B are fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:

KL distance has no symmetry, i.e. D_KL(fa||fb)≠D_KL(fb fa), so it is not a true measure or distance. The JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0, 1%]On the closed interval of (c). The formula is as follows:

if it is not

δ is the distance threshold of the semantic information of the two, semA and senB are the semantic information sets of all the stay areas of the person a and the person B, respectively, and then the semantic information of the two areas is similar.

The calculation mode of the semantic position similarity of the two persons is the same as the calculation mode of the geographic position similarity, Tanimoto coefficients are adopted, and the formula is as follows:

sa and sb are the two-person semantic information frequency vectors, respectively, and w is the vector formed by the JS distance mentioned above.

With the geographic position similarity and the semantic position similarity, the two-person similarity defines the weighted sum of the two, and the formula is as follows:

sim(A,B)＝∝·sim_loc(A,B)+(1-∝)·sim_sem(A,B) (6)

where ∈ is a value in the [0,1] interval, which determines the weight of semantic information.

Step 3.2: group clustering

A cluster based on shared nearest neighbor is adopted, and the cluster comprises a very important SNN similarity concept, wherein the SNN similarity represents the number of common items in k neighbors of two objects. It is due to the properties of SNN that it is good at handling noise and outliers and is capable of handling clusters of different sizes, shapes and densities, especially good at finding tight clusters of strongly related objects.

In the group clustering, clustering is carried out in three steps. The method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and the rest clusters are a group. By setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.

Firstly, k neighbor persons of each person are calculated in a proximity matrix construction algorithm, then SNN similarity between the persons is calculated, if the number of the shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, therefore, the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, and all the persons are repeatedly stored until the construction of the SNN proximity matrix is completed. And then constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding element of the vertex by the adjacency list. In the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added to the adjacency list of v2, and v2 is added to the adjacency list of v1 until all edges are added, and the complete graph is constructed. Finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.

Drawings

FIG. 1: a system flow diagram.

FIG. 2: and (4) a person track graph.

FIG. 3: a personnel stay spot diagram.

FIG. 4: and (4) a personnel stay area diagram.

FIG. 5: and (4) a personnel semantic information graph.

FIG. 6: stop position map of people 000 and 003.

FIG. 7: stop position maps of persons 007 and 036.

FIG. 8: stop point location maps for people 006 and 023.

FIG. 9: and (4) a population clustering flow chart.

FIG. 10: silhouette index (contour coefficient) is plotted against the change in k-value.

FIG. 11: dunn index (Dunn index) is plotted as a function of k.

FIG. 12: and (3) a similarity matrix diagram when k is 12 or k is 13.

FIG. 13: the number of clusters is plotted as a function of k value.

FIG. 14: and k is a similarity matrix diagram at 15.

FIG. 15: and (3) a similarity matrix diagram when k is 19.

FIG. 16: and (3) a similarity matrix diagram when k is 21.

Detailed Description

The invention is explained and illustrated below with reference to the accompanying drawings:

the data set adopted by the invention is an open source project of Microsoft, namely Geolife, and GPS track data (2007.4-2012.8) of 182 volunteers in five years are collected in the project. This data set contained 17621 traces for a total mileage of 1292951 kilometers for a total duration of 50176 hours. Each track contains a timestamp, latitude and longitude, and altitude. These traces are collected by different GPS sampling devices, with 91.5% of the samples being very dense, one point every 5 seconds or every 5-10 meters. This data set has recorded extensive personnel outdoor activities, not only including living habits such as returning home and working, but also some amusement and sports activities, such as shopping, sightseeing, food and beverage, hiking and riding bicycle. Although this data set is distributed in large quantities over 30 cities in china, and even some cities in the united states and europe, most of the data is in beijing hailake.

The POI data set was collected from a gold map that contained 156500 objects in the beijing haiji area, each object containing a name, address, category, and coordinates in three coordinate systems.

Fig. 2 shows a student's movement track over a period of several days, wherein the track marked with each color represents the person's movement track for one day. And (3) extracting the stop points according to a single track stop point extraction method, extracting the stop points of the person shown in the attached figure 3, and then applying an SC algorithm to all the stop points of the person to divide the stop areas, wherein the division result is shown in the attached figure 4.

And calculating semantic information of the student staying area according to the LDA topic model, wherein part of the semantic information is shown in figure 5.

Step 3.1: similarity measure

8 persons with clear trajectory characteristics are extracted from the data set, and the similarity between the persons when α is 0.3 is compared according to formula (6), as shown in the following table:

Person similarities(α＝0.3)

	000	003	006	007	023	036	041	065
									000	1.00000	0.53124	0.03872	0.02251	0.01670	0.09282	0.00858	0.01253
003	0.53124	1.00000	0.01927	0.02097	0.00663	0.04312	0.00531	0.00678
									006	0.03872	0.01927	1.00000	0.01555	0.31987	0.04317	0.15063	0.11027
007	0.02251	0.02097	0.01555	1.00000	0.22439	0.31128	0.04005	0.04005
									023	0.01670	0.00663	0.31987	0.22439	1.00000	0.01172	0.07134	0.04878
036	0.09282	0.04312	0.04317	0.31128	0.01172	1.00000	0.03154	0.04597
									041	0.00858	0.00531	0.15063	0.04005	0.07134	0.03154	1.00000	0.15481
065	0.01253	0.00678	0.11027	0.04005	0.04878	0.04597	0.15481	1.00000

from the table above, several pairs of highly similar persons 000 and 003, 006 and 023, 007 and 036 can be found. The dwell points of 000 and 036 are distributed as shown in fig. 6, it is obvious that the geographic positions visited by the two persons overlap more, so the geographic position similarity is higher, and because the semantic information is obtained according to the geographic position area, the semantic information similarity of the two persons is also higher, so the similarity of the two persons is 0.53, which is expected. 007 and 036 are distributed as shown in figure 7, which is similar to 000 and 003, and the geographic positions are highly similar to each other, so that semantic information is generated, and the overall similarity between two persons is high. The distribution of the dwell points of 006 and 023 is shown in fig. 8, and it can be seen that there is almost no overlapping part of the dwell areas of the two people, but the similarity of the two people is 0.32. Obviously, according to the previous definition of similarity, the semantic information similarity of two people is higher. This is true, and the parts of the graph with the dense two-person dwell points are respectively located in Beijing aerospace university and Central nation university, and POI near the two positions are most common in science and education culture service class. This illustrates that people who are not found by geographic location similarity alone can be found by considering semantic similarity.

Step 3.2: group clustering

The population clustering process is shown in FIG. 9. Taking the trajectory data of all 181 persons in Geolife as an example, Dunn index and Sihouette index are used as clustering evaluation criteria.

Dunn index:

Wherein C is_iTo representThe ith cluster, d (x, y), represents the distance between x and y. It minimizes the intra-cluster distance while maximizing the inter-cluster distance, so that a larger value thereof indicates a better clustering effect.

Sihouette index：

Wherein

NC is the number of clusters, n_iIs C_iThe number of points in (1). a (x) represents the average distance of object x to all other objects in the cluster in which it is located, b (x) represents the minimum found for all clusters for object x and any cluster that does not contain the object, the average distance of the object to all objects in a given cluster being calculated.

Fig. 10 and 11 show the variation trend of the Silhouette index and Dunn index with the nearest neighbor number k, respectively. The nearest neighbor number k takes a value [10,30], which is a most representative value interval, k which is illegal for evaluating a standard value is omitted, and the threshold value gamma of the SNN similarity is 10. As can be seen from the figure, when the nearest neighbor number k takes 12 or 13, the two evaluation criteria reach the maximum value. The clustering results obtained by the two clusters are the same, and the two clusters respectively contain 032, 044 and 151, 162. The similarity matrix is shown in figure 12.

This indicates that we have found two groups of high similarity people from 182, each containing two people, with the rest of the people all being considered noise points. This is the best clustering result that the evaluation criteria indicate to us.

FIG. 13 shows the variation of the number of clusters with k. Where when k is 15, k is 19, and k is 21, the similarity matrices of the resulting clustering results are shown in fig. 14, 15, and 16, respectively. Resulting in 4, 11 and 8 clusters, respectively. Although none of the three clustering results is optimal for the evaluation criteria, they also provide some relatively reasonable results, which indicates that relatively reasonable clustering results under different clustering numbers can be obtained by the method, so that the method can effectively identify highly similar populations in large-scale populations.

Claims

1. A group identification method based on a personnel behavior rule and a data mining method is characterized in that: the method comprises the following steps:

the method comprises the following steps: extracting the staying areas and the frequency of the persons to go to each staying area by using the trajectory data information of the persons;

step 1.1: extracting a single-track stop point of a person; the stay points represent the geographic positions of the persons staying for a period of time, and each stay point extracted from the person track is associated with a real geographic position which can reflect the activity condition of the persons to some extent; defining a single track as T ═ p₁，p₂，...，p_n) Wherein p is_i＝(lat_i，lon_i，t_i)，0≤i≤n，(lat_i，lon_i) Represents the latitude and longitude, t, at location point i_iRepresents the time at position point i;

given a segment of the track sequence t ═ p (p)_i，...，p_i+m) If distance (p)_i，p_x)≤θ_d，|t_i-t_x|≥θ_t，i≤x≤i+m，p_xRepresenting the xth track point in the sequence of tracks, m being an integer from 0 to n-i, theta_dAnd theta_tRespectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein

Step 1.2: the number of the personnel in the frequently visited areas is more, and on the contrary, the number of the personnel in the less visited areas is less; the DBSCAN algorithm is applied to the position, the time complexity is high, the input parameters are more, therefore, a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance being less than tau through traversing each stop point, and if no cluster exists and the distance between the point and the cluster is less than tau, the point is used as a new cluster;

each cluster is a dwell area, noted

All points in the staying area are used, lat and lon are the central points of the staying area point set, and r is the radius of the staying area;

step two: based on the extracted information of the personnel staying area, semantic information of each area is further extracted;

step 2.1: sometimes, the relation between people cannot be accurately judged only through geographical position information, and semantic information of a staying area is also needed; poi (point of information) describes the spatial and attribute information of these geographic entities, and most of the semantic information of the person staying area is not single, so that all the category information in the staying area cannot be simply classified into one, but multiple categories and their occupation ratios are recorded, and (m) =<catg₁，freq₁>，<catg₂，freq₂>，...，<catg_n，freq_n>) N is more than or equal to 1; sem represents semantic information in the stay area,<catg₁，freq₁>representing the category of the first semantic information and the frequency of people visiting the geographic position corresponding to the semantic;

modeling semantic information in the staying area by adopting an LDA topic model, comparing POI information in the staying area into a document, comparing the semantic information in the staying area into a topic, and taking each POI as a word; extracting semantic information in each person staying area by using a modeling model, firstly training the model by taking POI (point of interest) information of all the person staying areas as input data, and then inferring the semantic information in each staying area by using the trained model;

redefining dwell regions after extraction of semantic information to

Representative semantic information within a circle with r as the radius for the dwell region;

step 2.2: removing meaningless semantic information;

the semantic information set of the person A is (< residential area, 150>, < caf, 5>, < gym, 45>), the semantic information set of the person B is (< residential area, 200>, < scientific research institution, 59>, < concert hall, 3>), and two items in parentheses represent semantic position information and frequency of visiting the position; the 'residential area' item has a larger weight in the semantic information sets of the 'residential area' and the 'residential area' has no practical meaning or is an interference item in the aspect of comparing the similarity of the semantic information sets of the 'residential area' and the 'residential area', and the true similarity of the A and the B is very low after the interference item is removed;

generally, semantic information of a residential area is required to be the semantic information commonly owned by people, track semantic information of each person contains the semantic information, and the obvious characteristics of the semantic information are that the visiting frequency is high and the staying time period is fixed; the method for removing the meaningless semantic information comprises the following steps:

3) deleting the semantic information from the semantic information set;

4) jumping out for circulation;

step three: carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups;

step 3.1: similarity measure

Calculating similarity, namely considering the similarity of geographic positions and the similarity of semantic positions from two aspects;

in a first aspect, geographic location similarity; the expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the influence of frequency and vector length is considered unlike the cosine similarity; given person a and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:

when judging whether the two geographic positions are the same, judging the position relation of the two geographic positions according to the overlapping degree of the stop points in the two geographic position areas due to the error of the positioning equipment; the overlapping degree or similarity of the two staying areas is defined as the ratio of the number of staying points in the area containing less staying points and the number of all staying points in the area containing less staying points in the intersecting area of the two areas; then adding the similarity serving as a weight into a Tanimoto coefficient to form a new weighted geographic position similarity measurement; the formula is as follows:

in a second aspect, semantic location similarity; given semantic information in a certain dwell region as sem ═ c (<c₁，f₁>，<c₂，f₂>，...<c_n，f_n>)，n≥1，f_iRepresents c_iIs at a certain probability of

Comparing whether the semantic information in the two staying areas is the same or not, judging whether the geographic positions are the same or not, and considering the similarity degree of the two staying areas; the sem includes a probability distribution (f) of semantic information₁，f₂，...，f_n) Therefore, the KL distance is used for measuring the distance between the two probability distributions;

in probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used for measuring the difference condition of two probability distributions in the same event space; given the probability distributions of sets of semantic information in a certain stay area for person a and person B are fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:

KL distance has no symmetry, i.e. D_KL(fa||fb)≠D_KL(fb fa), so it is not a true measure or distance; the JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0, 1%]On the closed interval of (c); the formula is as follows:

if it is not

Delta is the distance threshold of the semantic information of the two, semA and semB are the semantic information sets of all the staying areas of the person A and the person B respectively, and the semantic information of the two areas is similar;

sa and sb are semantic information frequency vectors of two persons respectively, and w is a vector formed by the JS distance mentioned above;

sim(A，B)＝∝·sim_loc(A，B)+(1-∝)·sim_sem(A，B) (6)

where, oc is a value in the [0,1] interval, which determines the weight of semantic information;

step 3.2: group clustering

The method comprises the steps of adopting a cluster based on shared nearest neighbor, wherein the cluster comprises an important SNN similarity concept, and the SNN similarity represents the number of common items in k neighbors of two objects; due to the very nature of SNN, it is adept at handling noise and outliers, and is able to handle clusters of different sizes, shapes and densities, especially at finding compact clusters of strongly related objects;

in the group clustering, clustering is carried out in three steps; the method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and taking the rest clusters as a group; by setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.

2. The group identification method based on the personnel behavior law and the data mining method as claimed in claim 1, wherein: firstly, k neighbor personnel of each person are calculated in a constructed proximity matrix algorithm, then SNN similarity between the persons is calculated, if the number of shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, and therefore the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, all the persons are repeatedly stored in the way until the construction of the SNN proximity matrix is completed; then, constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding elements of the vertex by the adjacency list; in the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added to the adjacency list of v2, and v2 is added to the adjacency list of v1 until all edges are added, and a complete graph is constructed; finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.