CN107341239B

CN107341239B - Cluster data analysis method and device

Info

Publication number: CN107341239B
Application number: CN201710541642.6A
Authority: CN
Inventors: 程良伦; 傅应龙; 王卓薇
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2020-08-07
Anticipated expiration: 2037-07-05
Also published as: CN107341239A

Abstract

The application discloses a cluster data analysis method and a cluster data analysis device, wherein the method comprises the steps of selecting mobile cluster object data corresponding to time points which are separated by preset time intervals in a preset time period; establishing an abnormal data dynamic table; classifying the mobile cluster object data of each time point and the abnormal data points in the abnormal data dynamic table to obtain an initial classification result, and storing the unclassified mobile cluster object data serving as the abnormal data points in the abnormal data dynamic table; and analyzing the change of the initial classification result of each time point and the initial classification result of the previous time point from the first time point, and identifying the change condition of the initial classification result of each time according to the change condition to obtain the classification result. The abnormal data dynamic table capable of storing the unclassified data is established to store the abnormal data, so that the loss of useful data is avoided, and meanwhile, the abnormal data is also included in classification, so that the accuracy of the data analysis process is higher.

Description

Cluster data analysis method and device

Technical Field

The present application relates to the field of big data mobile data analysis, and in particular, to a cluster data analysis method and apparatus.

Background

With the wide popularization of big data technology, big data application is very common in daily life, and especially, a data manufacturer purposefully pushes highly targeted contents such as advertisements and message push to the most suitable object according to the analysis of the big data, which is one of the important applications of the big data. Meanwhile, the increase of the mobile data, namely the increase of the data containing the motion knowledge and the position information of the object, can sell products to the object more purposefully. Traffic congestion prediction and animal migration can also be studied using the movement data. However, in pattern mining of moving objects using mobile data, the types included in the object data are diverse, and the requirement for real-time data analysis is high, and therefore, a challenge is posed to a pattern for mining mobile data.

The modes of mining movement data are commonly applied to, for example, traffic management, logistics distribution, and crowd detection. These require analysis of cluster variations. Whereas for the nature of cluster changes: whether a cluster corresponds to a group of cars simply disappearing or members of a cluster migrating to other clusters, whether a newly appearing cluster reflects a new vehicle or the appearance of a new target cluster, or a result of a change in preference of an existing customer.

Therefore, the study of cluster change conditions is to analyze the change conditions of cluster data within a period of time, and the original data is firstly divided into classes to study by taking the clusters as units, and then the change of the clusters at different time points is judged by the difference of the clusters. The above is also a general method for analyzing cluster data at present.

However, when the current analysis method is applied to a small amount of data, the error of the obtained result from the real situation is small, and when the amount of data is increased, the deviation of the pattern analysis result of the method from the real situation is large, and the expected result is not met.

Therefore, how to solve the problem of large error of the cluster data analysis method is a hot problem concerned by those skilled in the art.

Disclosure of Invention

The invention aims to provide a cluster data analysis method and a cluster data analysis device.

In order to solve the above technical problem, the present application provides a cluster data analysis method, including:

selecting moving cluster object data corresponding to time points which are separated by preset time intervals in a preset time period;

establishing an abnormal data dynamic table;

classifying the moving cluster object data of each time point and abnormal data points in the abnormal data dynamic table to obtain an initial classification result, and storing the moving cluster object data which is not classified as the abnormal data points in the abnormal data dynamic table;

and analyzing the initial classification result of each time point and the change of the initial classification result of the time point before the time point from the first time point, and identifying the change condition of the initial classification result of each time according to the change condition to obtain the classification result.

Optionally, the method further includes:

determining the relation between the classes of each time point according to the classification result, and constructing a mobile cluster pattern tree;

and determining related mobile cluster frequent information according to the mobile cluster pattern tree.

Optionally, the identifier of the change condition specifically includes:

retention, merging, separation, expansion, contraction, disappearance and appearance.

Optionally, the creating an abnormal data dynamic table includes:

establishing the abnormal data dynamic table;

setting relevant processing parameters; wherein the processing parameters include a dynamic change time and an update time.

Optionally, the step of taking the moving cluster object data that is not classified in the classification as the abnormal data point and storing the abnormal data point into an abnormal data dynamic table further includes:

judging whether the existence time of the abnormal data points exceeds the updating time or not according to the processing parameters;

and if so, updating the abnormal data point.

The present application further provides a cluster data analysis device, the device includes:

the data selecting module is used for selecting the mobile cluster object data corresponding to the time points which are separated by the preset time interval in the preset time period;

the table building module is used for building an abnormal data dynamic table;

the initial classification module is used for classifying the moving cluster object data of each time point and abnormal data points in the abnormal data dynamic table to obtain an initial classification result, and storing the moving cluster object data which is not classified as the abnormal data points into the abnormal data dynamic table;

and the change identification module is used for analyzing the initial classification result of each time point and the change of the initial classification result of the time point before the time point from the first time point, and identifying the change condition of the initial classification result of each time according to the change condition to obtain the classification result.

Optionally, the method further includes:

the tree building module is used for determining the relation between the classes of each time point according to the classification result and building a mobile cluster mode tree;

and the mining module is used for determining the related mobile cluster frequent information according to the mobile cluster pattern tree.

Optionally, the table building module includes:

a table building unit for building the abnormal data dynamic table

A parameter setting unit for setting relevant processing parameters; wherein the processing parameters include a dynamic change time and an update time.

Optionally, the initial classification module further includes: an update unit, wherein the update unit comprises:

the time judging subunit is used for judging whether the existence time of the abnormal data points exceeds the updating time or not according to the processing parameters;

and the updating subunit is used for updating the abnormal data point when the existence time of the abnormal data point exceeds the updating time.

Due to the existing cluster data analysis method, all unclassified data can be lost in the classification process, but for data in a time period, unclassified abnormal data at the current moment has a beneficial effect on the classification result at the next moment. Therefore, the analysis result has larger error, and the described real situation does not meet the expected requirement.

Therefore, the cluster data analysis method provided by the application comprises the steps of selecting moving cluster object data corresponding to time points which are separated by preset time intervals in a preset time period; establishing an abnormal data dynamic table; classifying the moving cluster object data of each time point and abnormal data points in the abnormal data dynamic table to obtain an initial classification result, and storing the moving cluster object data which is not classified as the abnormal data points in the abnormal data dynamic table; and analyzing the initial classification result of each time point and the change of the initial classification result of the time point before the time point from the first time point, and identifying the change condition of the initial classification result of each time according to the change condition to obtain the classification result.

The abnormal data dynamic table capable of storing the unclassified data is established to store the abnormal data, so that the loss of useful data is avoided, and meanwhile, the abnormal data is also included in classification, so that the accuracy of the data analysis process is higher. The application also provides a cluster data analysis device, which has the beneficial effects, and the details are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a cluster data analysis method provided in an embodiment of the present application;

FIG. 2 is a detailed flow chart of data analysis provided by an embodiment of the present application;

FIG. 3 is a partial flow diagram of a classification process provided by an embodiment of the present application;

FIG. 4 is a flow chart of an analysis mode provided by an embodiment of the present application;

FIG. 5 is a diagram of building a pattern tree according to an embodiment of the present application;

FIG. 6 is a flowchart of creating a dynamic table according to an embodiment of the present application;

FIG. 7 is a flowchart of updating a dynamic table according to an embodiment of the present application;

fig. 8 is a block diagram of a cluster data analysis apparatus according to an embodiment of the present application;

FIG. 9 is a block diagram of a construction pattern tree provided by an embodiment of the present application;

fig. 10 is a block diagram of a table building module provided in an embodiment of the present application.

Detailed Description

The core of the application is to provide a cluster data analysis method, and by establishing an abnormal data dynamic table, storing abnormal data and updating stored data, the method avoids larger analysis result errors caused by losing useful data, and improves the accuracy of the analysis method.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a cluster data analysis method according to an embodiment of the present disclosure.

The embodiment may include:

s100, selecting moving cluster object data corresponding to time points which are separated by preset time intervals in a preset time period;

s200, establishing an abnormal data dynamic table;

it should be noted that, there is no relation between step S100 and step S200, and therefore there is no precedence relationship between the execution, and step S200 may be executed first and then step S100 may be executed, or both steps may be executed at the same time, which is not limited herein.

The predetermined time period indicated by step S100 is the time period to be analyzed in the present study, and may depend on the actual situation to be analyzed. For example, for a certain road segment under study, clustering data of vehicles at 5 to 7 pm, a time segment comprising this time segment should be selected. This is to say that it is not necessary to select this time segment, since changing data is studied, and the change is also observed for data at the beginning and end of the time segment, so that appropriate time reservation lengths are added at the beginning and end for comprehensive analysis of data in the time segment.

Meanwhile, the predetermined time interval refers to the interval of sampling points of continuous time in the time period, which can be determined by the analyzed timing situation, but an important parameter for sampling the time period is the number of sampling points in the time period, and since a large amount of data needs to be analyzed, and one point is increased to a certain extent for the data amount to be analyzed, an accurate result needs to be obtained by the appropriate number of sampling points. For example, it is necessary to study the cluster data of vehicles at 5 o 'clock to 7 o' clock in a certain road night, and it is known in the general knowledge that the traffic flow is large and the vehicle speed is slow at this time, and the number of points to be sampled can be reduced appropriately. If the data is researched, the data of the vehicles in a certain road from 5 to 7 points in the morning are clustered, the traffic flow is small, the vehicle speed is high, and the vehicles in the road change fast, so that the number of the sampled points can be increased properly.

And after the time point is determined, selecting the mobile cluster object data corresponding to the time. Moving data information O at a certain point of time of the moving cluster object data represented as one moving object:

O＝(oid,p(x,y),t)

wherein oid is a data type identifier, p (x, y) is the longitude and latitude of the mobile object at the time point t, x is the longitude, y is the latitude, and t is the time at the time point.

Define Ω (t), O ∈ Ω, Ω (t) as a set of mobile data object data, called the mobile object location coordination set.

For the abnormal data dynamic table established in step S200, a data table should be established in the data analysis, and functions such as storing, modifying, deleting and the like may be performed on the data. The dynamic table name created in this embodiment is F-list.

S300, classifying the moving cluster object data of each time point and abnormal data points in the abnormal data dynamic table to obtain an initial classification result, and storing the moving cluster object data which is not classified as the abnormal data points in the abnormal data dynamic table;

it should be noted that the classification of the moving cluster object data may be performed by using a classification method, for example, DBscan, KNN, and K-means, and the classification method may be selected according to the performance requirement and the result accuracy requirement of data analysis, which is not limited in this embodiment.

In the classification process, unclassified data may appear, and the unclassified data needs to be stored as abnormal data in an abnormal data dynamic table. Similarly, the classification target in the classification of the data is all data, that is, data including the time point to be classified and data in the abnormal data dynamic table.

Therefore, the method and the device have the advantages that the abnormal data dynamic table capable of storing unclassified data is established, the abnormal data is stored, the loss of useful data is avoided, meanwhile, the abnormal data is also contained in classification, and the accuracy of the data analysis process can be higher.

S400, starting from the first time point, analyzing the change of the initial classification result of each time point and the initial classification result of the time point before the time point, and identifying the change condition of the initial classification result of each time according to the change condition to obtain a classification result.

The initial classification result obtained according to the above process is a classification result of each time point, and since the evolution mode of the cluster data object needs to be analyzed, the classification results of the data of each time point need to be related and analyzed to obtain a correlation relationship. Therefore, the initial classification result of each time point and the initial classification result of the previous time point of the time point need to be analyzed, and a classification category is obtained and a change condition is identified by correlating according to the initial classification results of the two time points.

In this embodiment, the change situation of two adjacent time points is judged by using the similarity of the Jaccard, and the change situation is divided into corresponding change situation categories and identified. The Jaccard similarity relates to the problem of confidence, namely the change condition of the cluster initial classification result of the adjacent time points is judged according to the similarity ratio of the data quantity of the next time point to the data quantity of the previous time point. The proportion of similarity needs to be determined empirically, and is not limited herein.

The type of the change condition is generally determined by the specific condition of the data analyzed by the change condition. The data reached generally corresponds to the actual specific problem, and the change condition of the data and the change condition category thereof can be roughly determined according to the problem. For example, the analysis is simple, the data generally has the situations of merging, separating, disappearing and appearing, and the categories of the changing situations can also be divided into the categories. And are not limited herein.

In this embodiment, the actual problem of selection is to analyze the road traffic conditions, so the following seven categories of the change situation are selected: survives (retention), merged (merger), splits (separation), expands (shrinkages), dispepars (disappearance) and appepars (appearance).

Referring to fig. 2, fig. 2 is a specific flowchart of data analysis provided in the embodiment of the present application.

Wherein the predetermined time period is denoted by T, the predetermined time interval is denoted by Δ T, and the initial time point of the time point is denoted by T.

Referring to fig. 3, fig. 3 is a partial flowchart of a classification process according to an embodiment of the present disclosure.

The flow of the classification process for the parts is as follows. Because space limitations cannot show a complete classification process flow diagram, and a partial processing flow diagram is shown here as an example, the complete flow diagram can be simply expanded and obtained according to the partial flow diagram. And therefore will not be described in full herein.

The time points in the time period are set to be 6, the time interval is △ t, and classification analysis is carried out on the 6 time points from t, namely the 6 time points of t, t + △ t, t +2 △ t, t +3 △ t, t +4 △ t and t +5 △ t.

In t, classified classes are distinguished by C1, C2, C3 and C4, the 4 classes are marked as appaars (appearance), and at the moment, some points which cannot be classified are stored in an abnormal data dynamic table F-list.

In t + △ t, classification is carried out, at which time, C1 and C2 at the previous time point can be found to be merged into a class C1 'and therefore identified by merged (merging), the cluster number of C3' is enlarged compared with that of C3 and identified by expansions, C4 is kept unchanged and therefore identified by survives, and the points which cannot be classified at this time are continuously stored in the abnormal data dynamic table F-list.

At t +2 △ t, it can be seen that C3', C4 merge into a large class C3 ", so C3" is identified as merged, while C1' merges with some data in the abnormal data dynamic table as C1 ", which is not identified as merged but as expanded, and continues to store the point where classification cannot be performed in the abnormal data dynamic table F-list.

In t +3 △ t, since the previous time point t +2 △ t is full, the previous time point t +2 t is updated and the point which cannot be classified at this time is stored in the abnormal data dynamic table F-list, and corresponding to the situation that C1 "' and C5 are scattered from the previous time point C1", both C1 "' and C5 are marked as splits (separation), and at this time C3" ' is reduced from the previous time point C3 ", and is marked as shrins (reduction).

At t +4 △ t, C1' remains unchanged and is identified as survives, C3' is the reduction of C3' at the previous time point and is identified as shrins (reduction), and for C5, the data completely disappear, so that the data is identified as disppears (disappearance), and the points which cannot be classified at the time are continuously stored in the abnormal data dynamic table F-list.

For t +5 △ t, C1' "was labeled survives (reserved) without any change from C3" ", at the previous time.

Referring to fig. 4 and 5, fig. 4 is a flowchart of an analysis schema provided in an embodiment of the present application, and fig. 5 is a schema tree graph provided in an embodiment of the present application.

Based on the above embodiment, this embodiment may further include:

s500, determining the relation between the classes of each time point according to the classification result, and constructing a mobile cluster pattern tree;

s600, determining the related mobile cluster frequent information according to the mobile cluster pattern tree.

The established mobile cluster pattern tree is constructed according to the type of the change condition identified by each time point, and the classification of C1 at each time point is sequentially inserted from the first empty node of a root (root) to construct a first branch and indicate the change condition of the branch. And inserting a second null node, constructing a second branch from the second null node, and according to the classification result and the change condition, knowing that the C2 is merged into the C1 at a second time point, thereby indicating the change condition in the tree and indicating the process. And constructing the residual branches in sequence to form a complete model tree.

And then, in connection with the actual situation, selecting a proper information mining mode, determining the frequent information of the related mobile clusters, and obtaining the frequently-occurring associated mobile mode.

For example, in an actual traffic road section, a time period from 5 to 7 pm of an overpass is selected, and according to an analysis pattern tree, merging (merged) and expansion (expansions) are found to frequently occur, and vehicle conditions of the time period are defined in sequence, so that the method has important guiding significance for traffic modulation.

Referring to fig. 6, fig. 6 is a flowchart for establishing a dynamic table according to an embodiment of the present application.

Based on the foregoing embodiment, the creating of the abnormal data dynamic table in this embodiment may include:

s210, establishing the abnormal data dynamic table;

s220, setting relevant processing parameters; wherein the processing parameters include a dynamic change time and an update time.

After setting the relevant processing parameters to the abnormal data dynamic table, the abnormal data dynamic table is expressed as follows:

F-list(τ,θ)

where τ is T/n, n is 1,2,3 … … indicates a certain period of time for which an abnormal data point should be stored; θ ═ τ/n, and n ═ 1,2,3 … … denote the presence sub-times of the selected anomalous data points that should be updated.

The parameters can be set according to data and actual specific conditions, the numerical values of the parameters influence the data volume of subsequent classified scanning and the accuracy of results, if the numerical values are too large, the data volume existing at the same time is too large, the load of the classified scanning is increased, the data processing speed is influenced, and if the numerical values are too small, useful data can be cleared too early, and the result error of subsequent analysis is larger. Therefore, the present invention is not limited to the above embodiments.

In this embodiment, the data is updated once τ is set to 3, that is, the dynamic table thereof is full of data at 3 time points, and the data stored at the first two time points is deleted while θ is set to 2, that is, the data is updated.

Referring to fig. 7, fig. 7 is a flowchart of updating a dynamic table according to an embodiment of the present application.

Based on the foregoing embodiment, this embodiment may further include:

s321, judging whether the existence time of the abnormal data points exceeds the updating time or not according to the processing parameters;

and S322, if yes, updating the abnormal data point.

Corresponding to the above embodiment, a corresponding determination process needs to be performed in the processing process, and when it is determined that the abnormal data point exceeds the update time, that is, the τ value, the data stored at the previous two time points are updated.

However, the data is updated in such a manner that the updating time is specified and the updating operation is performed until the time is out in order to avoid that the scanned data amount in the classification is excessive and the machine load is increased because of excessive redundant data stored in the abnormal data dynamic table. The updating operation can be a complete deletion, or a partial deletion after comparison, and the overtime data can be stored in other tables for subsequent use instead of the deletion operation.

In the present embodiment, the deletion operation is selected for the data that has timed out, in order to reduce the amount of data that needs to be scanned each time, while reducing the load on the machine.

The embodiment of the application provides a cluster data analysis method, and the abnormal data occurring in the classification process is stored by establishing an abnormal data dynamic table, so that the condition of losing useful data is avoided, and the accuracy of the analysis method is improved.

In the following, the cluster data analysis device provided in the embodiment of the present application is introduced, and the cluster data analysis device described below and the cluster data analysis method described above may be referred to correspondingly.

Referring to fig. 8, fig. 8 is a block diagram of a cluster data analysis apparatus according to an embodiment of the present disclosure.

The present embodiment provides a cluster data analysis device, which may include:

a data selecting module 100, configured to select moving cluster object data corresponding to time points spaced by a predetermined time interval within a predetermined time period;

the table building module 200 is used for building an abnormal data dynamic table;

an initial classification module 300, configured to classify the moving cluster object data at each time point and an abnormal data point in the abnormal data dynamic table to obtain an initial classification result, and store the non-classified moving cluster object data as the abnormal data point in the abnormal data dynamic table;

a change identification module 400, configured to analyze, starting from the first time point, a change of the initial classification result at each time point and the initial classification result at the time point before the time point, and perform change condition identification on the initial classification result at each time according to a change condition to obtain a classification result.

Referring to fig. 9, fig. 9 is a block diagram of constructing a pattern tree according to an embodiment of the present application.

Based on the above embodiment, this embodiment may further include:

a tree building module 500, configured to determine a relationship between classes at each time point according to the classification result, and build a mobile cluster pattern tree;

and the mining module 600 is configured to determine the frequent information of the relevant mobile cluster according to the mobile cluster pattern tree.

Referring to fig. 10, fig. 10 is a block diagram of a table building module according to an embodiment of the present disclosure.

Based on the above embodiments, the table building module 200 may include:

a table building unit 210 for building the abnormal data dynamic table

A parameter setting unit 220 for setting relevant processing parameters; wherein the processing parameters include a dynamic change time and an update time.

Based on the above embodiment, this embodiment may further include: an update unit, wherein the update unit may include:

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above provides a detailed description of a cluster data analysis method and apparatus provided by the present application. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A method for cluster data analysis, the method comprising:

establishing an abnormal data dynamic table;

starting from the first time point, analyzing the initial classification result of each time point and the change of the initial classification result of the time point before the time point, and identifying the change condition of the initial classification result of each time according to the change condition to obtain a classification result;

further comprising:

2. The method according to claim 1, wherein the identification of the change condition specifically includes:

3. The method of claim 2, wherein the creating an exception data dynamic table comprises:

establishing the abnormal data dynamic table;

4. The method of claim 3, wherein storing the unclassified moving cluster object data as the outlier data point into an outlier dynamic table further comprises:

and if so, updating the abnormal data point.

5. A cluster data analysis apparatus, the apparatus comprising:

the table building module is used for building an abnormal data dynamic table;

a change identification module, configured to analyze, starting from a first time point, a change of the initial classification result at each time point and the initial classification result at a time point before the time point, and perform change condition identification on the initial classification result at each time according to a change condition to obtain a classification result;

further comprising:

6. The apparatus of claim 5, wherein the table building module comprises:

a table building unit for building the abnormal data dynamic table

7. The apparatus of claim 6, wherein the initial classification module further comprises: an update unit, wherein the update unit comprises: