CN107633067A

CN107633067A - A kind of Stock discrimination method based on human behavior rule and data digging method

Info

Publication number: CN107633067A
Application number: CN201710862301.9A
Authority: CN
Inventors: 丁治明; 司云飞; 才智; 曹阳; 迟远英
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-09-21
Filing date: 2017-09-21
Publication date: 2018-01-26
Anticipated expiration: 2037-09-21
Also published as: CN107633067B

Abstract

The invention discloses a kind of Stock discrimination method based on human behavior rule and data digging method, belong to a kind of method of emphasis Stock discrimination in Data Mining, more particularly to large-scale activity based on human behavior rule.The frequency of each dwell regions is gone to using its dwell regions of the track data information extraction of personnel and personnel, it is then based on the personnel's dwell regions information extracted, each region semantic information is further extracted more accurately to express user behavior, with reference to human behavior rule and characteristic similarity, group clustering is carried out using data digging method, emphasis specific group is finally identified from target group.

Description

Group identification method based on personnel behavior rule and data mining method

Technical Field

The invention belongs to the field of data mining, and relates to a method for identifying key groups in large-scale activities based on a personnel behavior rule.

Background

With the increase of economic activities in the market and the improvement of the living standard of people's material culture, the holding of various large activities is more frequent, and the large activities pose serious challenges to the safe performance of the activities and the prevention of emergencies. The most important problem of doing safety precaution work for large-scale activities is how to identify special groups in target groups to do preventive work in advance. Meanwhile, the rapid development of the wireless communication technology promotes a large amount of mobile object data, the data depict the space-time dynamics of individuals and groups, contain the behavior information of the mobile objects, and can help people to know the behavior rules, group trends and the like of target people by analyzing the mobile data of the target people.

In recent years, technologies such as satellite communication, GPS equipment, RFID, wireless sensors, internet of things communication, video tracking, and the like are continuously developed and widely applied, so that mobile objects of various sizes in the global area are accurately positioned and effectively tracked. By the technologies, the signal receiving device can collect a large amount of moving object data from the positioning terminal, the data contains very abundant information such as position information, time information and the like, and the data volume becomes more and more large and complex with the passage of time. Meanwhile, the moving object data also becomes a new data analysis approach, and especially before major activity events, the research on the motion tracks of related groups can help people to identify groups, know group trends and analyze group behavior laws, so that people can make preventive work for large-scale activities in a targeted manner.

The technology adopts a clustering method in data mining to mine data information, similar groups often have similar characteristics, and according to extracted personnel characteristic information data and a similarity calculation formula among designers, a proper clustering algorithm is selected to identify key special groups from target groups.

Disclosure of Invention

The invention provides a group identification method based on a personnel behavior rule and a data mining method, which utilizes trajectory data information of personnel to extract staying areas and the frequency of the personnel to each staying area, then further extracts semantic information of each area to more accurately express user behavior based on the extracted information of the staying areas of the personnel, combines the personnel behavior rule and the characteristic similarity, utilizes the data mining method to perform group clustering, and finally identifies key special groups from target groups.

A group identification method based on a personnel behavior rule and a data mining method comprises the following steps:

the method comprises the following steps: and extracting the staying areas and the frequency of the persons going to each staying area by using the trajectory data information of the persons.

Step 1.1: and extracting the single trajectory stop points of the personnel. The stay points represent the geographical positions where the person stays for a period of time, and each stay point extracted from the trajectory of the person is associated with a real geographical position, which can reflect the activity of the person to some extent. Defining a single track as T = (p) ₁ ,p ₂ ,…,p _n ) Wherein p is _i ＝(lat _i ,lon _i ,t _i ),0≤i≤n，(lat _i ,lon _i ) Represents the latitude and longitude, t, at location point i _i Representing the time at position point i.

Given a track sequence t = (p) _i ,…,p _i+m ) If distance (p) _i ,p _x )≤θ _d ，|t _i -t _x |≥θ _t ，i≤x≤i+m，p _x Representing the xth track point in the sequence of tracks, m being an integer from 0 to n-i, theta _d And theta _t Respectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein

Step 1.2: the personnel have more stops in areas frequently visited, and conversely, have less stops in areas less visited. The DBSCAN algorithm is applied to the cluster with high time complexity and more input parameters, so that a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance less than tau through traversing each stop point, and if no cluster exists and the distance of the point is less than tau, the point is used as a new cluster.

Each cluster is a dwell zone, notedFor all points in the dwell area, lat and lon are the center points of the set of dwell area points, and r is the radius of the dwell area.

Step two: and further extracting semantic information of each region based on the extracted information of the person staying region.

Step 2.1: sometimes, the relationship between people cannot be accurately judged only by the geographical position information, and semantic information of a staying area is also needed. POI (Point of information) describes the space and attribute information of the geographic entities, such as names, addresses, categories, coordinates and the like of the entities, so that the description capacity of the actual geographic position is enhanced to a great extent, and the user behavior activity can be reflected to a certain extent. In most cases, the semantic information of the people staying area is not single, therefore, it is not possible to simply classify all the category information in the staying area into one, but rather to record a plurality of categories and their proportions, sem = (c) ((c))<catg ₁ ,freq ₁ ＞,<catg ₂ ,freq ₂ ＞,…,<catg _n ,freq _n >) n is greater than or equal to 1.sem represents semantic information in the stay region,<catg ₁ ,freq ₁ the category of the first semantic information and the frequency of people visiting the geographic position corresponding to the semantic.

Modeling the semantic information in the stay area by adopting an LDA topic model, comparing the POI information in the stay area into a document, comparing the semantic information in the stay area into a topic, and taking each POI as a word. And extracting semantic information in each person staying area by using a modeling model, firstly training the model by taking POI (point of interest) information of all the person staying areas as input data, and then inferring the semantic information in each staying area by using the trained model.

Redefining dwell regions after extraction of semantic information to Representative semantic information within a circle with r as the radius for the dwell region.

Step 2.2: and removing meaningless semantic information.

The semantic information set of person a is (< residential area, 150>, < caf, 5>, < gym, 45 >), the semantic information set of person B is (< residential area, 200>, < scientific research institution, 59>, < concert hall, 3 >), and two items in parentheses represent semantic location information (for simplicity of description, only one semantic information is used to represent the area semantics) and the frequency of visiting the location. It can be seen that, in this example, the term "residential area" has a larger weight in the semantic information sets of the two, "residential area" has no practical meaning or even is an interference term in comparing the similarity of the semantic information sets of the two, and the true similarity of a and B is very low after the interference term is removed.

Generally, the semantic information of the 'residential area' is the semantic information commonly owned by people, the track semantic information of each person contains the information, and the obvious characteristics of the semantic information are that the visiting frequency is high, and the staying time period is fixed. The method for removing the meaningless semantic information comprises the following steps:

1) Circularly judging each semantic information, judging whether the area is possibly a residential area from the area semantic information, if so, turning to 2), and if not, turning to 4);

2) Judging whether the average residence time distribution of all the residence points in the residence area is correct, if so, turning to 3), and if not, turning to 4);

3) Deleting the semantic information from the semantic information set;

4) Jumping out for circulation;

step three: and (4) carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups.

Step 3.1: similarity measure

The calculation of the similarity, the geographic position similarity and the semantic position similarity are considered from two aspects.

In a first aspect, geographic location similarity. The expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the influence of frequency and vector length is considered unlike the cosine similarity. Given person A and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:

when judging whether the two geographic positions are the same, due to the error of the positioning device, the position relationship between the two geographic positions needs to be judged according to the overlapping degree of the stop points in the two geographic position areas. The degree of overlap, or similarity, of two dwell regions is defined as the ratio of the number of dwell points in the region containing the fewer dwell points to the number of all dwell points in the region containing the fewer dwell points in the intersection of the two regions. And then adding the similarity as a weight to the Tanimoto coefficient to form a new weighted geographic position similarity measurement. The formula is as follows:

in a second aspect, semantic location similarity. Given a certain dwell area semantic information of (1) sem = (b =: (b))<c ₁ ,f ₁ ＞,<c ₂ ,f ₂ >,...<c _n ,f _n ＞),n≥1，f _i Represents c _i Is at a certain probability ofComparing whether the semantic information in the two staying areas is the sameAnd judging whether the geographic positions are the same or similar, and considering the similarity degree of the two. The sem includes a probability distribution (f) of semantic information ₁ ,f ₂ ,…,f _n ) Therefore, the KL distance is used to measure the distance between the two probability distributions.

In probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used to measure the difference between two probability distributions in the same event space. Given the probability distributions of sets of semantic information in a certain stay area for person a and person B as fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:

KL distance has no symmetry, i.e. D _KL (fa||fb)≠D _KL (fb | fa), so it is not a true measure or distance. The JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0,1]On the closed interval. The formula is as follows:

if it is notδ is the distance threshold of the semantic information of the two, semA and senB are the semantic information sets of all the stay areas of the person a and the person B, respectively, and then the semantic information of the two areas is similar.

The calculation mode of the semantic position similarity of the two persons is the same as the calculation mode of the geographic position similarity, tanimoto coefficients are adopted, and the formula is as follows:

sa and sb are the two-person semantic information frequency vectors, respectively, and w is the vector consisting of the JS distance mentioned above.

With the geographic position similarity and the semantic position similarity, the two-person similarity defines the weighted sum of the two, and the formula is as follows:

sim(A,B)＝∝·sim _loc (A,B)+(1-∝)·sim _sem (A,B) (6)

where ∈ is a value in the [0,1] interval, which determines the weight of semantic information.

Step 3.2: population clustering

A cluster based on shared nearest neighbor is adopted, and the cluster comprises a very important SNN similarity concept, wherein the SNN similarity represents the number of common items in k neighbors of two objects. It is due to the properties of SNN that it is good at handling noise and outliers, and is capable of handling clusters of different sizes, shapes and densities, especially good at finding tight clusters of strongly correlated objects.

In the group clustering, clustering is carried out in three steps. The method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and the rest clusters are a group. By setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.

Firstly, k neighbor persons of each person are calculated in a proximity matrix construction algorithm, then SNN similarity between the persons is calculated, if the number of the shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, therefore, the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, and all the persons are repeatedly stored until the construction of the SNN proximity matrix is completed. And then constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding element of the vertex by the adjacency list. In the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added into the adjacency list of v2, and v2 is added into the adjacency list of v1 until all edges are added, so that a complete graph is constructed. Finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.

Drawings

FIG. 1: and (4) a system flow chart.

FIG. 2: and (4) a person track graph.

FIG. 3: a personnel stay spot diagram.

FIG. 4: and (4) a personnel stay area diagram.

FIG. 5: and (4) a personnel semantic information graph.

FIG. 6: stop position plot for persons 000 and 003.

FIG. 7: stop position maps of persons 007 and 036.

FIG. 8: stop point location maps for people 006 and 023.

FIG. 9: and (4) a population clustering flow chart.

FIG. 10: silhouette index (contour coefficient) is plotted against the change in k-value.

FIG. 11: dunn index (Dunn index) is plotted as a function of k.

FIG. 12: similarity matrix diagrams at k =12 or k = 13.

FIG. 13: the number of clusters is plotted as a function of k value.

FIG. 14: similarity matrix plot when k = 15.

FIG. 15: similarity matrix plot at k = 19.

FIG. 16: similarity matrix plot when k = 21.

Detailed Description

The invention is explained and illustrated below with reference to the accompanying drawings:

the data set adopted by the invention is the open source project Geolife of Microsoft, and GPS track data of 182 volunteers in five years are collected in the project (2007.4-2012.8). The data set contained 17621 traces with a total mileage of 1292951 km for a total length of 50176 hours. Each track contains a timestamp, latitude and longitude, and altitude. These traces are collected by different GPS sampling devices, with 91.5% of the samples being very dense, one point every 5 seconds or every 5-10 meters. This data set records a wide range of personnel outdoor activities, including not only returning home and working and other habits, but also some entertainment and sports activities, such as shopping, sightseeing, dining, hiking and riding a bicycle. Although this data set is distributed in large quantities over 30 cities in china, and even some cities in the united states and europe, most of the data is in beijing hailake.

The POI data set was collected from a gold map that contained 156500 objects in the beijing haichi region, each object containing a name, address, category, and coordinates in three coordinate systems.

The method comprises the following steps: and extracting the staying areas of the persons and the frequency of the persons to go to each staying area by using the trajectory data information of the persons.

Fig. 2 illustrates a student's movement track over a period of days, wherein the track for each color label represents the person's movement track for one day. And (3) extracting the stop points according to a single track stop point extraction method, extracting the stop points of the person shown in the attached figure 3, and then applying an SC algorithm to all the stop points of the person to divide the stop areas, wherein the division result is shown in the attached figure 4.

And calculating semantic information of the student staying area according to the LDA topic model, wherein part of the semantic information is shown in figure 5.

Step three: and (4) carrying out group clustering by using a data mining method according to the behavior rules and the characteristic similarity of the personnel, and finally identifying key special groups from the target groups.

Step 3.1: similarity measure

8 persons with clear trajectory characteristics are extracted from the data set, and the similarity between the persons when α =0.3 is compared according to formula (6), as shown in the following table:

Person similarities(α＝0.3)

	000	003	006	007	023	036	041	065
									000	1.00000	0.53124	0.03872	0.02251	0.01670	0.09282	0.00858	0.01253
003	0.53124	1.00000	0.01927	0.02097	0.00663	0.04312	0.00531	0.00678
									006	0.03872	0.01927	1.00000	0.01555	0.31987	0.04317	0.15063	0.11027
007	0.02251	0.02097	0.01555	1.00000	0.22439	0.31128	0.04005	0.04005
									023	0.01670	0.00663	0.31987	0.22439	1.00000	0.01172	0.07134	0.04878
036	0.09282	0.04312	0.04317	0.31128	0.01172	1.00000	0.03154	0.04597
									041	0.00858	0.00531	0.15063	0.04005	0.07134	0.03154	1.00000	0.15481
065	0.01253	0.00678	0.11027	0.04005	0.04878	0.04597	0.15481	1.00000

from the table above, several pairs of highly similar persons 000 and 003, 006 and 023, 007 and 036 can be found. The dwell points of 000 and 036 are distributed as shown in fig. 6, it is obvious that the geographic positions visited by the two persons overlap more, so the geographic position similarity is higher, and because the semantic information is obtained according to the geographic position area, the semantic information similarity of the two persons is also higher, so the similarity of the two persons is 0.53, which is expected. 007 and 036 are distributed as shown in figure 7, which is similar to 000 and 003, and the geographic positions are highly similar to each other, so that the semantic information is obtained, and therefore, the overall similarity between two persons is high. The distribution of the dwell points of 006 and 023 is shown in fig. 8, and it can be seen that there is almost no overlapping part of the dwell areas of the two people, but the similarity of the two people is 0.32. Obviously, according to the previous definition of similarity, the semantic information similarity of two people is higher. The fact is also true that the parts of the graph with dense two-person dwell points are respectively located in Beijing aerospace university and Central nation university, and POI near the two positions are mostly scientific and educational culture service classes. This illustrates that people who are not found by geographic location similarity alone can be found by considering semantic similarity.

Step 3.2: population clustering

The population clustering process is shown in FIG. 9. Taking the trajectory data of all 181 persons in Geolife as an example, dunn index and Sihouette index are used as clustering evaluation criteria.

Dunn index:

Wherein C _i Denotes the ith cluster, and d (x, y) denotes the distance between x and y. It minimizes the intra-cluster distance while maximizing the inter-cluster distance, so that a larger value thereof indicates a better clustering effect.

Sihouette index：

WhereinNC is the number of clusters, n _i Is C _i The number of points in (b). a (x) represents the average distance of object x to all other objects in the cluster in which it is located, and b (x) represents the minimum found for object x and any cluster that does not contain the object, calculating the average distance of the object to all objects in a given cluster, with respect to all clusters.

Fig. 10 and 11 show the variation trend of the Silhouette index and Dunn index with the nearest neighbor number k, respectively. The nearest neighbor number k takes a value of [10,30], which is a most representative value interval, k which is illegal for evaluating a standard value is omitted, and the threshold value gamma =10 of the SNN similarity. As can be seen from the figure, when the nearest neighbor number k takes 12 or 13, the two evaluation criteria reach the maximum value. The clustering results obtained by the two clusters are the same, and the two clusters respectively contain 032, 044 and 151, 162. The similarity matrix is shown in figure 12.

This indicates that we have found two groups of high similarity people from 182, each containing two people, with the rest of the people all being considered noise points. This is the best clustering result that the evaluation criteria indicate to us.

FIG. 13 shows the variation of the number of clusters with k. Wherein when k =15, k =19, k =21, the similarity matrix of the clustering result is as shown in fig. 14, 15, and 16, respectively. Resulting in 4, 11 and 8 clusters, respectively. Although none of the three clustering results is optimal for the evaluation criteria, they also provide some relatively reasonable results, which indicates that relatively reasonable clustering results under different clustering numbers can be obtained by the method, so that the method can effectively identify highly similar groups in large-scale groups.

Claims

1. A group identification method based on a personnel behavior rule and a data mining method is characterized in that: the method comprises the following steps:

the method comprises the following steps: extracting the staying areas and the frequency of the persons to go to each staying area by using the trajectory data information of the persons;

step 1.1: extracting single-track stop points of personnel; the stay points represent the geographic positions of the persons staying for a period of time, and each stay point extracted from the trajectory of the persons is associated with a real geographic position, so that the geographic positions can reflect the activity condition of the persons to some extent; defining a single track as T = (p) ₁ ,p ₂ ,…,p _n ) Wherein p is _i ＝(lat _i ,lon _i ,t _i ),0≤i≤n，(lat _i ,lon _i ) Represents the latitude and longitude, t, at the location point i _i Represents the time at position point i;

given a track sequence t = (p) _i ,…,p _i+m ) If distance (p) _i ,p _x )≤θ _d ，|t _i -t _x |≥θ _t ，i≤x≤i+m，p _x Representing the xth track point in the track sequence, m being an integer from 0 to n-i, theta _d And theta _t Respectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein

Step 1.2: the number of the personnel in the frequently visited areas is more, and on the contrary, the number of the personnel in the less visited areas is less; the DBSCAN algorithm is applied to the position, the time complexity is high, and the input parameters are more, so that a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance less than tau through traversing each stop point, and if no cluster exists and the distance between the point is less than tau, the point serves as a new cluster;

each cluster is a dwell area, noted All points in the staying area are used, lat and lon are the central points of the staying area point set, and r is the radius of the staying area;

step two: based on the extracted information of the personnel staying area, semantic information of each area is further extracted;

step 2.1: sometimes, the relation between people cannot be accurately judged only through geographical position information, and semantic information of a staying area is needed; POI (Point of information) describes the space and attribute information of the geographic entities, and in most cases, the semantic information of a person staying area is not single, so that all the category information in the staying area cannot be simply summarized into one type, but a plurality of categories and the occupation ratios thereof are recorded, and sem = (< catg) ₁ ,freq ₁ >,<catg ₂ ,freq ₂ ＞,…,<catg _n ,freq _n >) n is greater than or equal to 1; sem represents semantic information in the stay area,<catg ₁ ,freq ₁ the category of the first semantic information and the frequency of people visiting the geographic position corresponding to the semantic are represented;

modeling semantic information in the staying area by adopting an LDA topic model, comparing POI information in the staying area into a document, comparing the semantic information in the staying area into a topic, and taking each POI as a word; extracting semantic information in each person staying area by using a modeling model, firstly, taking POI (point of interest) information of all the staying areas of the persons as input data to train the model, and then, inferring the semantic information in each staying area by using the trained model;

redefining dwell regions after extraction of semantic information to Representative semantic information in a circle with r as the radius for the stay area;

step 2.2: removing meaningless semantic information;

the semantic information set of the person A is (< residential area, 150>, < caf, 5>, < gym, 45 >), the semantic information set of the person B is (< residential area, 200>, < scientific research institution, 59>, < concert hall, 3 >), and two items in parentheses represent semantic position information and frequency of visiting the position; the 'residential area' item has a larger weight in the semantic information sets of the 'residential area' and the 'residential area' has no practical meaning or is an interference item in the aspect of comparing the similarity of the semantic information sets of the 'residential area' and the 'residential area', and the true similarity of the A and the B is very low after the interference item is removed;

generally, semantic information of a residential area is required to be the semantic information commonly owned by people, track semantic information of each person contains the semantic information, and the obvious characteristics of the semantic information are that the visiting frequency is high and the staying time period is fixed; the method for removing the meaningless semantic information comprises the following steps:

3) Deleting the semantic information from the semantic information set;

4) Jumping out for circulation;

step three: carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups;

step 3.1: similarity measure

Calculating similarity, namely considering the similarity of a geographical position and the similarity of a semantic position from two aspects;

in a first aspect, geographic location similarity; the expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the expanded Tanimoto coefficient is different from the cosine similarity and takes the influence of frequency and vector length into consideration; given person A and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:

when judging whether the two geographic positions are the same, judging the position relation of the two geographic positions according to the overlapping degree of the stop points in the two geographic position areas due to the error of the positioning equipment; the overlapping degree or similarity of the two staying areas is defined as the ratio of the number of staying points in the area containing less staying points and the number of all staying points in the area containing less staying points in the intersecting area of the two areas; then adding the similarity serving as a weight into a Tanimoto coefficient to form a new weighted geographic position similarity measure; the formula is as follows:

in a second aspect, semantic location similarity; semantic information within a certain dwell region is given as sem = (< c) ₁ ,f ₁ ＞,<c ₂ ,f ₂ >,...<c _n ,f _n >) n is greater than or equal to 1, fi represents the probability of ci, so thatComparing whether the semantic information in the two staying areas is the same or not, judging whether the geographic positions are the same or not, and considering the similarity degree of the two staying areas; the sem includes a probability distribution (f) of semantic information ₁ ,f ₂ ,…,f _n ) Therefore, the KL distance is used for measuring the distance between the two probability distributions;

in probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used for measuring the difference condition of two probability distributions in the same event space; given the probability distributions of sets of semantic information in a certain dwell region for person a and person B as fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:

KL distance has no symmetry, i.e. D _KL (fa||fb)≠D _KL (fb fa), so it is not a true measure or distance; the JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0,1]On the closed interval of (c); the formula is as follows:

if it is notDelta is the distance threshold of the semantic information of the two, semA and semB are the semantic information sets of all the staying areas of the person A and the person B respectively, and the semantic information of the two areas is similar;

sa and sb are semantic information frequency vectors of two persons respectively, and w is a vector formed by the JS distance mentioned above;

sim(A,B)＝∝·sim _loc (A,B)+(1-∝)·sim _sem (A,B) (6)

where ℃ is a value in the [0,1] interval, which determines the weight of semantic information;

step 3.2: group clustering

The method comprises the steps of adopting a cluster based on shared nearest neighbor, wherein the cluster comprises an important SNN similarity concept, and the SNN similarity represents the number of common items in k neighbors of two objects; due to the very nature of SNN, it is adept at handling noise and outliers, and is able to handle clusters of different sizes, shapes and densities, especially at finding compact clusters of strongly related objects;

in the group clustering, clustering is carried out in three steps; the method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and taking the rest clusters as a group; by setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.

2. The group identification method based on the personnel behavior law and the data mining method as claimed in claim 1, wherein: firstly, k neighbor personnel of each person are calculated in a constructed proximity matrix algorithm, then SNN similarity between the persons is calculated, if the number of shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, and therefore the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, all the persons are repeatedly stored in the way until the construction of the SNN proximity matrix is completed; then, constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding elements of the vertex by the adjacency list; in the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added into an adjacent table of v2, and v2 is added into the adjacent table of v1 until all edges are added, so that a complete graph is constructed; finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.