CN107633067A - A kind of Stock discrimination method based on human behavior rule and data digging method - Google Patents

A kind of Stock discrimination method based on human behavior rule and data digging method Download PDF

Info

Publication number
CN107633067A
CN107633067A CN201710862301.9A CN201710862301A CN107633067A CN 107633067 A CN107633067 A CN 107633067A CN 201710862301 A CN201710862301 A CN 201710862301A CN 107633067 A CN107633067 A CN 107633067A
Authority
CN
China
Prior art keywords
similarity
semantic information
area
staying
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710862301.9A
Other languages
Chinese (zh)
Other versions
CN107633067B (en
Inventor
丁治明
司云飞
才智
曹阳
迟远英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201710862301.9A priority Critical patent/CN107633067B/en
Publication of CN107633067A publication Critical patent/CN107633067A/en
Application granted granted Critical
Publication of CN107633067B publication Critical patent/CN107633067B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of Stock discrimination method based on human behavior rule and data digging method, belong to a kind of method of emphasis Stock discrimination in Data Mining, more particularly to large-scale activity based on human behavior rule.The frequency of each dwell regions is gone to using its dwell regions of the track data information extraction of personnel and personnel, it is then based on the personnel's dwell regions information extracted, each region semantic information is further extracted more accurately to express user behavior, with reference to human behavior rule and characteristic similarity, group clustering is carried out using data digging method, emphasis specific group is finally identified from target group.

Description

Group identification method based on personnel behavior rule and data mining method
Technical Field
The invention belongs to the field of data mining, and relates to a method for identifying key groups in large-scale activities based on a personnel behavior rule.
Background
With the increase of economic activities in the market and the improvement of the living standard of people's material culture, the holding of various large activities is more frequent, and the large activities pose serious challenges to the safe performance of the activities and the prevention of emergencies. The most important problem of doing safety precaution work for large-scale activities is how to identify special groups in target groups to do preventive work in advance. Meanwhile, the rapid development of the wireless communication technology promotes a large amount of mobile object data, the data depict the space-time dynamics of individuals and groups, contain the behavior information of the mobile objects, and can help people to know the behavior rules, group trends and the like of target people by analyzing the mobile data of the target people.
In recent years, technologies such as satellite communication, GPS equipment, RFID, wireless sensors, internet of things communication, video tracking, and the like are continuously developed and widely applied, so that mobile objects of various sizes in the global area are accurately positioned and effectively tracked. By the technologies, the signal receiving device can collect a large amount of moving object data from the positioning terminal, the data contains very abundant information such as position information, time information and the like, and the data volume becomes more and more large and complex with the passage of time. Meanwhile, the moving object data also becomes a new data analysis approach, and especially before major activity events, the research on the motion tracks of related groups can help people to identify groups, know group trends and analyze group behavior laws, so that people can make preventive work for large-scale activities in a targeted manner.
The technology adopts a clustering method in data mining to mine data information, similar groups often have similar characteristics, and according to extracted personnel characteristic information data and a similarity calculation formula among designers, a proper clustering algorithm is selected to identify key special groups from target groups.
Disclosure of Invention
The invention provides a group identification method based on a personnel behavior rule and a data mining method, which utilizes trajectory data information of personnel to extract staying areas and the frequency of the personnel to each staying area, then further extracts semantic information of each area to more accurately express user behavior based on the extracted information of the staying areas of the personnel, combines the personnel behavior rule and the characteristic similarity, utilizes the data mining method to perform group clustering, and finally identifies key special groups from target groups.
A group identification method based on a personnel behavior rule and a data mining method comprises the following steps:
the method comprises the following steps: and extracting the staying areas and the frequency of the persons going to each staying area by using the trajectory data information of the persons.
Step 1.1: and extracting the single trajectory stop points of the personnel. The stay points represent the geographical positions where the person stays for a period of time, and each stay point extracted from the trajectory of the person is associated with a real geographical position, which can reflect the activity of the person to some extent. Defining a single track as T = (p) 1 ,p 2 ,…,p n ) Wherein p is i =(lat i ,lon i ,t i ),0≤i≤n,(lat i ,lon i ) Represents the latitude and longitude, t, at location point i i Representing the time at position point i.
Given a track sequence t = (p) i ,…,p i+m ) If distance (p) i ,p x )≤θ d ,|t i -t x |≥θ t ,i≤x≤i+m,p x Representing the xth track point in the sequence of tracks, m being an integer from 0 to n-i, theta d And theta t Respectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein
Step 1.2: the personnel have more stops in areas frequently visited, and conversely, have less stops in areas less visited. The DBSCAN algorithm is applied to the cluster with high time complexity and more input parameters, so that a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance less than tau through traversing each stop point, and if no cluster exists and the distance of the point is less than tau, the point is used as a new cluster.
Each cluster is a dwell zone, notedFor all points in the dwell area, lat and lon are the center points of the set of dwell area points, and r is the radius of the dwell area.
Step two: and further extracting semantic information of each region based on the extracted information of the person staying region.
Step 2.1: sometimes, the relationship between people cannot be accurately judged only by the geographical position information, and semantic information of a staying area is also needed. POI (Point of information) describes the space and attribute information of the geographic entities, such as names, addresses, categories, coordinates and the like of the entities, so that the description capacity of the actual geographic position is enhanced to a great extent, and the user behavior activity can be reflected to a certain extent. In most cases, the semantic information of the people staying area is not single, therefore, it is not possible to simply classify all the category information in the staying area into one, but rather to record a plurality of categories and their proportions, sem = (c) ((c))<catg 1 ,freq 1 >,<catg 2 ,freq 2 >,…,<catg n ,freq n >) n is greater than or equal to 1.sem represents semantic information in the stay region,<catg 1 ,freq 1 the category of the first semantic information and the frequency of people visiting the geographic position corresponding to the semantic.
Modeling the semantic information in the stay area by adopting an LDA topic model, comparing the POI information in the stay area into a document, comparing the semantic information in the stay area into a topic, and taking each POI as a word. And extracting semantic information in each person staying area by using a modeling model, firstly training the model by taking POI (point of interest) information of all the person staying areas as input data, and then inferring the semantic information in each staying area by using the trained model.
Redefining dwell regions after extraction of semantic information to Representative semantic information within a circle with r as the radius for the dwell region.
Step 2.2: and removing meaningless semantic information.
The semantic information set of person a is (< residential area, 150>, < caf, 5>, < gym, 45 >), the semantic information set of person B is (< residential area, 200>, < scientific research institution, 59>, < concert hall, 3 >), and two items in parentheses represent semantic location information (for simplicity of description, only one semantic information is used to represent the area semantics) and the frequency of visiting the location. It can be seen that, in this example, the term "residential area" has a larger weight in the semantic information sets of the two, "residential area" has no practical meaning or even is an interference term in comparing the similarity of the semantic information sets of the two, and the true similarity of a and B is very low after the interference term is removed.
Generally, the semantic information of the 'residential area' is the semantic information commonly owned by people, the track semantic information of each person contains the information, and the obvious characteristics of the semantic information are that the visiting frequency is high, and the staying time period is fixed. The method for removing the meaningless semantic information comprises the following steps:
1) Circularly judging each semantic information, judging whether the area is possibly a residential area from the area semantic information, if so, turning to 2), and if not, turning to 4);
2) Judging whether the average residence time distribution of all the residence points in the residence area is correct, if so, turning to 3), and if not, turning to 4);
3) Deleting the semantic information from the semantic information set;
4) Jumping out for circulation;
step three: and (4) carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups.
Step 3.1: similarity measure
The calculation of the similarity, the geographic position similarity and the semantic position similarity are considered from two aspects.
In a first aspect, geographic location similarity. The expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the influence of frequency and vector length is considered unlike the cosine similarity. Given person A and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:
when judging whether the two geographic positions are the same, due to the error of the positioning device, the position relationship between the two geographic positions needs to be judged according to the overlapping degree of the stop points in the two geographic position areas. The degree of overlap, or similarity, of two dwell regions is defined as the ratio of the number of dwell points in the region containing the fewer dwell points to the number of all dwell points in the region containing the fewer dwell points in the intersection of the two regions. And then adding the similarity as a weight to the Tanimoto coefficient to form a new weighted geographic position similarity measurement. The formula is as follows:
in a second aspect, semantic location similarity. Given a certain dwell area semantic information of (1) sem = (b =: (b))<c 1 ,f 1 >,<c 2 ,f 2 >,...<c n ,f n >),n≥1,f i Represents c i Is at a certain probability ofComparing whether the semantic information in the two staying areas is the sameAnd judging whether the geographic positions are the same or similar, and considering the similarity degree of the two. The sem includes a probability distribution (f) of semantic information 1 ,f 2 ,…,f n ) Therefore, the KL distance is used to measure the distance between the two probability distributions.
In probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used to measure the difference between two probability distributions in the same event space. Given the probability distributions of sets of semantic information in a certain stay area for person a and person B as fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:
KL distance has no symmetry, i.e. D KL (fa||fb)≠D KL (fb | fa), so it is not a true measure or distance. The JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0,1]On the closed interval. The formula is as follows:
if it is notδ is the distance threshold of the semantic information of the two, semA and senB are the semantic information sets of all the stay areas of the person a and the person B, respectively, and then the semantic information of the two areas is similar.
The calculation mode of the semantic position similarity of the two persons is the same as the calculation mode of the geographic position similarity, tanimoto coefficients are adopted, and the formula is as follows:
sa and sb are the two-person semantic information frequency vectors, respectively, and w is the vector consisting of the JS distance mentioned above.
With the geographic position similarity and the semantic position similarity, the two-person similarity defines the weighted sum of the two, and the formula is as follows:
sim(A,B)=∝·sim loc (A,B)+(1-∝)·sim sem (A,B) (6)
where ∈ is a value in the [0,1] interval, which determines the weight of semantic information.
Step 3.2: population clustering
A cluster based on shared nearest neighbor is adopted, and the cluster comprises a very important SNN similarity concept, wherein the SNN similarity represents the number of common items in k neighbors of two objects. It is due to the properties of SNN that it is good at handling noise and outliers, and is capable of handling clusters of different sizes, shapes and densities, especially good at finding tight clusters of strongly correlated objects.
In the group clustering, clustering is carried out in three steps. The method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and the rest clusters are a group. By setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.
Firstly, k neighbor persons of each person are calculated in a proximity matrix construction algorithm, then SNN similarity between the persons is calculated, if the number of the shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, therefore, the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, and all the persons are repeatedly stored until the construction of the SNN proximity matrix is completed. And then constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding element of the vertex by the adjacency list. In the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added into the adjacency list of v2, and v2 is added into the adjacency list of v1 until all edges are added, so that a complete graph is constructed. Finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.
Drawings
FIG. 1: and (4) a system flow chart.
FIG. 2: and (4) a person track graph.
FIG. 3: a personnel stay spot diagram.
FIG. 4: and (4) a personnel stay area diagram.
FIG. 5: and (4) a personnel semantic information graph.
FIG. 6: stop position plot for persons 000 and 003.
FIG. 7: stop position maps of persons 007 and 036.
FIG. 8: stop point location maps for people 006 and 023.
FIG. 9: and (4) a population clustering flow chart.
FIG. 10: silhouette index (contour coefficient) is plotted against the change in k-value.
FIG. 11: dunn index (Dunn index) is plotted as a function of k.
FIG. 12: similarity matrix diagrams at k =12 or k = 13.
FIG. 13: the number of clusters is plotted as a function of k value.
FIG. 14: similarity matrix plot when k = 15.
FIG. 15: similarity matrix plot at k = 19.
FIG. 16: similarity matrix plot when k = 21.
Detailed Description
The invention is explained and illustrated below with reference to the accompanying drawings:
the data set adopted by the invention is the open source project Geolife of Microsoft, and GPS track data of 182 volunteers in five years are collected in the project (2007.4-2012.8). The data set contained 17621 traces with a total mileage of 1292951 km for a total length of 50176 hours. Each track contains a timestamp, latitude and longitude, and altitude. These traces are collected by different GPS sampling devices, with 91.5% of the samples being very dense, one point every 5 seconds or every 5-10 meters. This data set records a wide range of personnel outdoor activities, including not only returning home and working and other habits, but also some entertainment and sports activities, such as shopping, sightseeing, dining, hiking and riding a bicycle. Although this data set is distributed in large quantities over 30 cities in china, and even some cities in the united states and europe, most of the data is in beijing hailake.
The POI data set was collected from a gold map that contained 156500 objects in the beijing haichi region, each object containing a name, address, category, and coordinates in three coordinate systems.
The method comprises the following steps: and extracting the staying areas of the persons and the frequency of the persons to go to each staying area by using the trajectory data information of the persons.
Fig. 2 illustrates a student's movement track over a period of days, wherein the track for each color label represents the person's movement track for one day. And (3) extracting the stop points according to a single track stop point extraction method, extracting the stop points of the person shown in the attached figure 3, and then applying an SC algorithm to all the stop points of the person to divide the stop areas, wherein the division result is shown in the attached figure 4.
Step two: and further extracting semantic information of each region based on the extracted information of the person staying region.
And calculating semantic information of the student staying area according to the LDA topic model, wherein part of the semantic information is shown in figure 5.
Step three: and (4) carrying out group clustering by using a data mining method according to the behavior rules and the characteristic similarity of the personnel, and finally identifying key special groups from the target groups.
Step 3.1: similarity measure
8 persons with clear trajectory characteristics are extracted from the data set, and the similarity between the persons when α =0.3 is compared according to formula (6), as shown in the following table:
Person similarities(α=0.3)
000 003 006 007 023 036 041 065
000 1.00000 0.53124 0.03872 0.02251 0.01670 0.09282 0.00858 0.01253
003 0.53124 1.00000 0.01927 0.02097 0.00663 0.04312 0.00531 0.00678
006 0.03872 0.01927 1.00000 0.01555 0.31987 0.04317 0.15063 0.11027
007 0.02251 0.02097 0.01555 1.00000 0.22439 0.31128 0.04005 0.04005
023 0.01670 0.00663 0.31987 0.22439 1.00000 0.01172 0.07134 0.04878
036 0.09282 0.04312 0.04317 0.31128 0.01172 1.00000 0.03154 0.04597
041 0.00858 0.00531 0.15063 0.04005 0.07134 0.03154 1.00000 0.15481
065 0.01253 0.00678 0.11027 0.04005 0.04878 0.04597 0.15481 1.00000
from the table above, several pairs of highly similar persons 000 and 003, 006 and 023, 007 and 036 can be found. The dwell points of 000 and 036 are distributed as shown in fig. 6, it is obvious that the geographic positions visited by the two persons overlap more, so the geographic position similarity is higher, and because the semantic information is obtained according to the geographic position area, the semantic information similarity of the two persons is also higher, so the similarity of the two persons is 0.53, which is expected. 007 and 036 are distributed as shown in figure 7, which is similar to 000 and 003, and the geographic positions are highly similar to each other, so that the semantic information is obtained, and therefore, the overall similarity between two persons is high. The distribution of the dwell points of 006 and 023 is shown in fig. 8, and it can be seen that there is almost no overlapping part of the dwell areas of the two people, but the similarity of the two people is 0.32. Obviously, according to the previous definition of similarity, the semantic information similarity of two people is higher. The fact is also true that the parts of the graph with dense two-person dwell points are respectively located in Beijing aerospace university and Central nation university, and POI near the two positions are mostly scientific and educational culture service classes. This illustrates that people who are not found by geographic location similarity alone can be found by considering semantic similarity.
Step 3.2: population clustering
The population clustering process is shown in FIG. 9. Taking the trajectory data of all 181 persons in Geolife as an example, dunn index and Sihouette index are used as clustering evaluation criteria.
Dunn index:
Wherein C i Denotes the ith cluster, and d (x, y) denotes the distance between x and y. It minimizes the intra-cluster distance while maximizing the inter-cluster distance, so that a larger value thereof indicates a better clustering effect.
Sihouette index:
WhereinNC is the number of clusters, n i Is C i The number of points in (b). a (x) represents the average distance of object x to all other objects in the cluster in which it is located, and b (x) represents the minimum found for object x and any cluster that does not contain the object, calculating the average distance of the object to all objects in a given cluster, with respect to all clusters.
Fig. 10 and 11 show the variation trend of the Silhouette index and Dunn index with the nearest neighbor number k, respectively. The nearest neighbor number k takes a value of [10,30], which is a most representative value interval, k which is illegal for evaluating a standard value is omitted, and the threshold value gamma =10 of the SNN similarity. As can be seen from the figure, when the nearest neighbor number k takes 12 or 13, the two evaluation criteria reach the maximum value. The clustering results obtained by the two clusters are the same, and the two clusters respectively contain 032, 044 and 151, 162. The similarity matrix is shown in figure 12.
This indicates that we have found two groups of high similarity people from 182, each containing two people, with the rest of the people all being considered noise points. This is the best clustering result that the evaluation criteria indicate to us.
FIG. 13 shows the variation of the number of clusters with k. Wherein when k =15, k =19, k =21, the similarity matrix of the clustering result is as shown in fig. 14, 15, and 16, respectively. Resulting in 4, 11 and 8 clusters, respectively. Although none of the three clustering results is optimal for the evaluation criteria, they also provide some relatively reasonable results, which indicates that relatively reasonable clustering results under different clustering numbers can be obtained by the method, so that the method can effectively identify highly similar groups in large-scale groups.

Claims (2)

1. A group identification method based on a personnel behavior rule and a data mining method is characterized in that: the method comprises the following steps:
the method comprises the following steps: extracting the staying areas and the frequency of the persons to go to each staying area by using the trajectory data information of the persons;
step 1.1: extracting single-track stop points of personnel; the stay points represent the geographic positions of the persons staying for a period of time, and each stay point extracted from the trajectory of the persons is associated with a real geographic position, so that the geographic positions can reflect the activity condition of the persons to some extent; defining a single track as T = (p) 1 ,p 2 ,…,p n ) Wherein p is i =(lat i ,lon i ,t i ),0≤i≤n,(lat i ,lon i ) Represents the latitude and longitude, t, at the location point i i Represents the time at position point i;
given a track sequence t = (p) i ,…,p i+m ) If distance (p) i ,p x )≤θ d ,|t i -t x |≥θ t ,i≤x≤i+m,p x Representing the xth track point in the track sequence, m being an integer from 0 to n-i, theta d And theta t Respectively, a geographic distance threshold and a time threshold, then p (lat, lon) is the stop point, wherein
Step 1.2: the number of the personnel in the frequently visited areas is more, and on the contrary, the number of the personnel in the less visited areas is less; the DBSCAN algorithm is applied to the position, the time complexity is high, and the input parameters are more, so that a simple clustering algorithm (SC) is designed, the speed is high, only one input parameter, namely a distance threshold value tau is needed, each stop point is assigned to a cluster with the distance less than tau through traversing each stop point, and if no cluster exists and the distance between the point is less than tau, the point serves as a new cluster;
each cluster is a dwell area, noted All points in the staying area are used, lat and lon are the central points of the staying area point set, and r is the radius of the staying area;
step two: based on the extracted information of the personnel staying area, semantic information of each area is further extracted;
step 2.1: sometimes, the relation between people cannot be accurately judged only through geographical position information, and semantic information of a staying area is needed; POI (Point of information) describes the space and attribute information of the geographic entities, and in most cases, the semantic information of a person staying area is not single, so that all the category information in the staying area cannot be simply summarized into one type, but a plurality of categories and the occupation ratios thereof are recorded, and sem = (< catg) 1 ,freq 1 >,<catg 2 ,freq 2 >,…,<catg n ,freq n >) n is greater than or equal to 1; sem represents semantic information in the stay area,<catg 1 ,freq 1 the category of the first semantic information and the frequency of people visiting the geographic position corresponding to the semantic are represented;
modeling semantic information in the staying area by adopting an LDA topic model, comparing POI information in the staying area into a document, comparing the semantic information in the staying area into a topic, and taking each POI as a word; extracting semantic information in each person staying area by using a modeling model, firstly, taking POI (point of interest) information of all the staying areas of the persons as input data to train the model, and then, inferring the semantic information in each staying area by using the trained model;
redefining dwell regions after extraction of semantic information to Representative semantic information in a circle with r as the radius for the stay area;
step 2.2: removing meaningless semantic information;
the semantic information set of the person A is (< residential area, 150>, < caf, 5>, < gym, 45 >), the semantic information set of the person B is (< residential area, 200>, < scientific research institution, 59>, < concert hall, 3 >), and two items in parentheses represent semantic position information and frequency of visiting the position; the 'residential area' item has a larger weight in the semantic information sets of the 'residential area' and the 'residential area' has no practical meaning or is an interference item in the aspect of comparing the similarity of the semantic information sets of the 'residential area' and the 'residential area', and the true similarity of the A and the B is very low after the interference item is removed;
generally, semantic information of a residential area is required to be the semantic information commonly owned by people, track semantic information of each person contains the semantic information, and the obvious characteristics of the semantic information are that the visiting frequency is high and the staying time period is fixed; the method for removing the meaningless semantic information comprises the following steps:
1) Circularly judging each semantic information, judging whether the area is possibly a residential area from the area semantic information, if so, turning to 2), and if not, turning to 4);
2) Judging whether the average residence time distribution of all the residence points in the residence area is correct, if so, turning to 3), and if not, turning to 4);
3) Deleting the semantic information from the semantic information set;
4) Jumping out for circulation;
step three: carrying out group clustering by using a data mining method according to the personnel behavior rules and the characteristic similarity, and finally identifying key special groups from target groups;
step 3.1: similarity measure
Calculating similarity, namely considering the similarity of a geographical position and the similarity of a semantic position from two aspects;
in a first aspect, geographic location similarity; the expanded Tanimoto coefficient of the cosine similarity is adopted to compare the similarity of two persons, and the expanded Tanimoto coefficient is different from the cosine similarity and takes the influence of frequency and vector length into consideration; given person A and person B, the two-person geographic location frequency vectors are la and lb, respectively, and are expressed as:
when judging whether the two geographic positions are the same, judging the position relation of the two geographic positions according to the overlapping degree of the stop points in the two geographic position areas due to the error of the positioning equipment; the overlapping degree or similarity of the two staying areas is defined as the ratio of the number of staying points in the area containing less staying points and the number of all staying points in the area containing less staying points in the intersecting area of the two areas; then adding the similarity serving as a weight into a Tanimoto coefficient to form a new weighted geographic position similarity measure; the formula is as follows:
in a second aspect, semantic location similarity; semantic information within a certain dwell region is given as sem = (< c) 1 ,f 1 >,<c 2 ,f 2 >,...<c n ,f n >) n is greater than or equal to 1, fi represents the probability of ci, so thatComparing whether the semantic information in the two staying areas is the same or not, judging whether the geographic positions are the same or not, and considering the similarity degree of the two staying areas; the sem includes a probability distribution (f) of semantic information 1 ,f 2 ,…,f n ) Therefore, the KL distance is used for measuring the distance between the two probability distributions;
in probability theory and information theory, KL distance (Kullback-Leibler Divergence) is used for measuring the difference condition of two probability distributions in the same event space; given the probability distributions of sets of semantic information in a certain dwell region for person a and person B as fa (x) and fb (x), respectively, the KL distance between fa (x) and fb (x) is expressed as:
KL distance has no symmetry, i.e. D KL (fa||fb)≠D KL (fb fa), so it is not a true measure or distance; the JS distance is a symmetric improvement of the KL distance, and the distance is defined at [0,1]On the closed interval of (c); the formula is as follows:
if it is notDelta is the distance threshold of the semantic information of the two, semA and semB are the semantic information sets of all the staying areas of the person A and the person B respectively, and the semantic information of the two areas is similar;
the calculation mode of the semantic position similarity of the two persons is the same as the calculation mode of the geographic position similarity, tanimoto coefficients are adopted, and the formula is as follows:
sa and sb are semantic information frequency vectors of two persons respectively, and w is a vector formed by the JS distance mentioned above;
with the geographic position similarity and the semantic position similarity, the two-person similarity defines the weighted sum of the two, and the formula is as follows:
sim(A,B)=∝·sim loc (A,B)+(1-∝)·sim sem (A,B) (6)
where ℃ is a value in the [0,1] interval, which determines the weight of semantic information;
step 3.2: group clustering
The method comprises the steps of adopting a cluster based on shared nearest neighbor, wherein the cluster comprises an important SNN similarity concept, and the SNN similarity represents the number of common items in k neighbors of two objects; due to the very nature of SNN, it is adept at handling noise and outliers, and is able to handle clusters of different sizes, shapes and densities, especially at finding compact clusters of strongly related objects;
in the group clustering, clustering is carried out in three steps; the method comprises the steps of firstly, constructing an SNN proximity matrix according to personnel characteristic information data and a similarity measurement formula, secondly, constructing an SNN similarity graph by using the proximity matrix, and thirdly, finding out all connected branches of the similarity graph, wherein each connected branch is a cluster, removing the clusters with only one point, and taking the rest clusters as a group; by setting the reasonable nearest neighbor number k and the SNN similarity threshold value gamma, key groups with close relations among the groups of people are effectively found out.
2. The group identification method based on the personnel behavior law and the data mining method as claimed in claim 1, wherein: firstly, k neighbor personnel of each person are calculated in a constructed proximity matrix algorithm, then SNN similarity between the persons is calculated, if the number of shared nearest neighbors of the two persons exceeds a threshold value eps, the two persons are communicated in the graph, and therefore the positions corresponding to the IDs of the two persons in the proximity matrix are marked as 1, all the persons are repeatedly stored in the way until the construction of the SNN proximity matrix is completed; then, constructing a graph by using the SNN proximity matrix, wherein the constructed graph is represented by an adjacency list array, the adjacency list is one of common storage structures of the graph, and all adjacent vertexes of each vertex are stored in a linked list pointed by the corresponding elements of the vertex by the adjacency list; in the SNN similarity graph constructing algorithm, an edge between v1 and v2 is added, v1 is added into an adjacent table of v2, and v2 is added into the adjacent table of v1 until all edges are added, so that a complete graph is constructed; finally, all connected branches of the graph are searched in the connected branch finding algorithm, and the depth-first search is used for exploring the connectivity problem of the graph, because the time complexity and the space of the depth-first search are in proportion to V + E, and the connectivity query about the graph is processed in a constant time.
CN201710862301.9A 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method Active CN107633067B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710862301.9A CN107633067B (en) 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710862301.9A CN107633067B (en) 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method

Publications (2)

Publication Number Publication Date
CN107633067A true CN107633067A (en) 2018-01-26
CN107633067B CN107633067B (en) 2020-03-27

Family

ID=61102253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710862301.9A Active CN107633067B (en) 2017-09-21 2017-09-21 Group identification method based on personnel behavior rule and data mining method

Country Status (1)

Country Link
CN (1) CN107633067B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764011A (en) * 2018-03-26 2018-11-06 青岛科技大学 Group recognition methods based on the modeling of graphical interactive relation
CN109005515A (en) * 2018-09-05 2018-12-14 武汉大学 A method of the user behavior pattern portrait based on motion track information
CN109388684A (en) * 2018-10-23 2019-02-26 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109543876A (en) * 2018-10-17 2019-03-29 天津大学 A kind of visual analysis method of urban issues
CN109743689A (en) * 2019-01-09 2019-05-10 南京航空航天大学 A kind of indoor track dwell regions discovery method based on stability value
CN110008655A (en) * 2019-03-01 2019-07-12 北京数字融通科技有限公司 A kind of infringing information identification people's mark system and method based on distributed network
CN110348133A (en) * 2019-07-15 2019-10-18 西南交通大学 A kind of bullet train three-dimensional objects structure technology effect figure building system and method
CN110457315A (en) * 2019-07-19 2019-11-15 国家计算机网络与信息安全管理中心 A kind of group's accumulation mode analysis method and system based on user trajectory data
CN110837512A (en) * 2019-11-15 2020-02-25 北京市商汤科技开发有限公司 Visitor information management method and device, electronic equipment and storage medium
CN110990455A (en) * 2019-11-29 2020-04-10 杭州数梦工场科技有限公司 Method and system for identifying house properties by big data
CN111312406A (en) * 2020-03-15 2020-06-19 智博云信息科技(广州)有限公司 Epidemic situation label data processing method and system
CN111428217A (en) * 2020-04-12 2020-07-17 中信银行股份有限公司 Method and device for identifying cheat group, electronic equipment and computer readable storage medium
CN111460246A (en) * 2019-12-19 2020-07-28 南京柏跃软件有限公司 Real-time activity abnormal person discovery method based on data mining and density detection
CN111639092A (en) * 2020-05-29 2020-09-08 京东城市(北京)数字科技有限公司 Personnel flow analysis method and device, electronic equipment and storage medium
CN111737387A (en) * 2020-06-11 2020-10-02 南京森根安全技术有限公司 Method and module for discovering specific personnel based on track similarity
CN111797291A (en) * 2020-06-02 2020-10-20 成都方未科技有限公司 Method, system and storage medium for social function mining by using trajectory data
CN111832304A (en) * 2020-06-29 2020-10-27 上海巧房信息科技有限公司 Method and device for checking duplicate of building name, electronic equipment and storage medium
CN112738725A (en) * 2020-12-18 2021-04-30 福建新大陆软件工程有限公司 Real-time identification method, device, equipment and medium for target crowd in semi-closed area
CN112765226A (en) * 2020-12-06 2021-05-07 复旦大学 Urban semantic map construction method based on trajectory data mining
CN113792763A (en) * 2021-08-24 2021-12-14 中山大学 Social group behavior recognition method based on electromagnetic spectrum data mining, computer device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866429A (en) * 2010-06-01 2010-10-20 中国科学院计算技术研究所 Training method of multi-moving object action identification and multi-moving object action identification method
CN103413440A (en) * 2013-04-11 2013-11-27 江苏省邮电规划设计院有限责任公司 Fake-licensed vehicle identification method based on smart city data base and identification rule base
US20150046420A1 (en) * 2005-09-27 2015-02-12 Patentratings, Llc Method and system for probabilistically quantifying and visualizing relevance between two or more citationally or contextually related data objects
CN104750829A (en) * 2015-04-01 2015-07-01 华中科技大学 User position classifying method and system based on signing in features
CN105404890A (en) * 2015-10-13 2016-03-16 广西师范学院 Criminal gang discrimination method considering locus space-time meaning
CN106407519A (en) * 2016-08-31 2017-02-15 浙江大学 Modeling method for crowd moving rule

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150046420A1 (en) * 2005-09-27 2015-02-12 Patentratings, Llc Method and system for probabilistically quantifying and visualizing relevance between two or more citationally or contextually related data objects
CN101866429A (en) * 2010-06-01 2010-10-20 中国科学院计算技术研究所 Training method of multi-moving object action identification and multi-moving object action identification method
CN103413440A (en) * 2013-04-11 2013-11-27 江苏省邮电规划设计院有限责任公司 Fake-licensed vehicle identification method based on smart city data base and identification rule base
CN104750829A (en) * 2015-04-01 2015-07-01 华中科技大学 User position classifying method and system based on signing in features
CN105404890A (en) * 2015-10-13 2016-03-16 广西师范学院 Criminal gang discrimination method considering locus space-time meaning
CN106407519A (en) * 2016-08-31 2017-02-15 浙江大学 Modeling method for crowd moving rule

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张延玲 等: "移动对象子轨迹段分割与聚类算法", 《计算机工程与应用》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764011A (en) * 2018-03-26 2018-11-06 青岛科技大学 Group recognition methods based on the modeling of graphical interactive relation
CN108764011B (en) * 2018-03-26 2021-05-18 青岛科技大学 Group identification method based on graphical interaction relation modeling
CN109005515B (en) * 2018-09-05 2020-07-24 武汉大学 User behavior mode portrait drawing method based on movement track information
CN109005515A (en) * 2018-09-05 2018-12-14 武汉大学 A method of the user behavior pattern portrait based on motion track information
CN109543876A (en) * 2018-10-17 2019-03-29 天津大学 A kind of visual analysis method of urban issues
CN109388684A (en) * 2018-10-23 2019-02-26 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109743689A (en) * 2019-01-09 2019-05-10 南京航空航天大学 A kind of indoor track dwell regions discovery method based on stability value
CN110008655A (en) * 2019-03-01 2019-07-12 北京数字融通科技有限公司 A kind of infringing information identification people's mark system and method based on distributed network
CN110348133A (en) * 2019-07-15 2019-10-18 西南交通大学 A kind of bullet train three-dimensional objects structure technology effect figure building system and method
CN110348133B (en) * 2019-07-15 2022-08-19 西南交通大学 System and method for constructing high-speed train three-dimensional product structure technical effect diagram
CN110457315A (en) * 2019-07-19 2019-11-15 国家计算机网络与信息安全管理中心 A kind of group's accumulation mode analysis method and system based on user trajectory data
CN110837512A (en) * 2019-11-15 2020-02-25 北京市商汤科技开发有限公司 Visitor information management method and device, electronic equipment and storage medium
CN110990455B (en) * 2019-11-29 2023-10-17 杭州数梦工场科技有限公司 Method and system for recognizing house property by big data
CN110990455A (en) * 2019-11-29 2020-04-10 杭州数梦工场科技有限公司 Method and system for identifying house properties by big data
CN111460246A (en) * 2019-12-19 2020-07-28 南京柏跃软件有限公司 Real-time activity abnormal person discovery method based on data mining and density detection
CN111460246B (en) * 2019-12-19 2020-12-08 南京柏跃软件有限公司 Real-time activity abnormal person discovery method based on data mining and density detection
CN111312406A (en) * 2020-03-15 2020-06-19 智博云信息科技(广州)有限公司 Epidemic situation label data processing method and system
CN111428217A (en) * 2020-04-12 2020-07-17 中信银行股份有限公司 Method and device for identifying cheat group, electronic equipment and computer readable storage medium
CN111639092A (en) * 2020-05-29 2020-09-08 京东城市(北京)数字科技有限公司 Personnel flow analysis method and device, electronic equipment and storage medium
CN111639092B (en) * 2020-05-29 2023-09-26 京东城市(北京)数字科技有限公司 Personnel flow analysis method and device, electronic equipment and storage medium
CN111797291A (en) * 2020-06-02 2020-10-20 成都方未科技有限公司 Method, system and storage medium for social function mining by using trajectory data
CN111737387A (en) * 2020-06-11 2020-10-02 南京森根安全技术有限公司 Method and module for discovering specific personnel based on track similarity
CN111832304A (en) * 2020-06-29 2020-10-27 上海巧房信息科技有限公司 Method and device for checking duplicate of building name, electronic equipment and storage medium
CN111832304B (en) * 2020-06-29 2024-02-27 上海巧房信息科技有限公司 Weight checking method and device for building names, electronic equipment and storage medium
CN112765226A (en) * 2020-12-06 2021-05-07 复旦大学 Urban semantic map construction method based on trajectory data mining
CN112738725A (en) * 2020-12-18 2021-04-30 福建新大陆软件工程有限公司 Real-time identification method, device, equipment and medium for target crowd in semi-closed area
CN113792763B (en) * 2021-08-24 2022-08-12 中山大学 Social group behavior recognition method based on electromagnetic spectrum data mining, computer device and storage medium
CN113792763A (en) * 2021-08-24 2021-12-14 中山大学 Social group behavior recognition method based on electromagnetic spectrum data mining, computer device and storage medium

Also Published As

Publication number Publication date
CN107633067B (en) 2020-03-27

Similar Documents

Publication Publication Date Title
CN107633067B (en) Group identification method based on personnel behavior rule and data mining method
Mohamed et al. Accurate real-time map matching for challenging environments
CN106096631B (en) A kind of floating population&#39;s Classification and Identification analysis method based on mobile phone big data
Yin et al. A generative model of urban activities from cellular data
Siła-Nowicka et al. Analysis of human mobility patterns from GPS trajectories and contextual information
Zhan et al. Inferring urban land use using large-scale social media check-in data
Tang et al. CLRIC: Collecting lane-based road information via crowdsourcing
EP3241370B1 (en) Analyzing semantic places and related data from a plurality of location data reports
Chung et al. A trip reconstruction tool for GPS-based personal travel surveys
Long et al. Discovering functional zones using bus smart card data and points of interest in Beijing
Shen et al. Review of GPS travel survey and GPS data-processing methods
Zheng et al. Computing with spatial trajectories
Pelekis et al. Mobility data management and exploration
CN111737605A (en) Travel purpose identification method and device based on mobile phone signaling data
Minetto et al. Measuring human and economic activity from satellite imagery to support city-scale decision-making during covid-19 pandemic
Yazdizadeh et al. An automated approach from GPS traces to complete trip information
CN109815993B (en) GPS track-based regional feature extraction, database establishment and intersection identification method
CN110442715B (en) Comprehensive urban geography semantic mining method based on multivariate big data
CN105045858A (en) Voting based taxi passenger-carrying point recommendation method
Sharif et al. Context-awareness in similarity measures and pattern discoveries of trajectories: a context-based dynamic time warping method
Xia et al. Decision tree-based contextual location prediction from mobile device logs
CN113888867B (en) Parking space recommendation method and system based on LSTM (least squares) position prediction
Servizi et al. Mining User Behaviour from Smartphone data: a literature review
Irshaid et al. User activity and trip recognition using spatial positioning system data by integrating the geohash and gis approaches
Liao [Retracted] Hot Spot Analysis of Tourist Attractions Based on Stay Point Spatial Clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant