CN104820708B

CN104820708B - A kind of big data clustering method and device based on cloud computing platform

Info

Publication number: CN104820708B
Application number: CN201510249032.XA
Authority: CN
Inventors: 马泳宇
Original assignee: Chengdu Rui Feng Science And Technology Ltd
Current assignee: Nanning First Station Network Technology Co Ltd
Priority date: 2015-05-15
Filing date: 2015-05-15
Publication date: 2018-02-09
Anticipated expiration: 2035-05-15
Also published as: CN104820708A

Abstract

The present invention proposes a kind of big data clustering method based on cloud computing platform, including：Step S100, big data pretreatment；Step S200, big data cutting and management；Step S300, establish the hypergraph model of cluster；Step S400, big data mapping, is specifically respectively mapped to hypergraph H=(V, E), i.e., each data block is mapped to a hypergraph by the data block after cutting；Step S500, clustering processing is carried out respectively to each data block using hypergraph；Step S600, the cluster result of each data block obtained to step S500 are clustered again, obtain final cluster result.The present invention carries out excavation clustering processing using cloud platform combination Hypergraph Theory to big data, realizes the quick, real-time, accurate of big data analyzing and processing.

Description

Big data clustering method and device based on cloud computing platform

Technical Field

The invention relates to the field of data mining, in particular to a big data clustering method and device based on a cloud computing platform.

Background

Information explosion has accumulated to the point of beginning to revolutionize as computer technology has fully integrated into social life over half a century. It not only enriches the world with more information than ever, but it is also growing faster. The information explosion disciplines such as astronomy and genetics create the concept of "big data". Today, this concept applies to almost all areas of human intelligence and development. The 21 st century is an era of the great development of data information, mobile internet, social networks, electronic commerce and the like greatly expand the boundary and application range of the internet, and various data are rapidly expanding and becoming large. Internet (social, search, e-commerce), mobile internet (microblog), internet of things (sensor, smart earth), car networking, GPS, medical imaging, security monitoring, finance (bank, stock market, insurance), telecommunication (talk, short message) all produce data wildly. The total data volume on the earth till now, individual users just step forward to the TB era in 2006, and about 180EB data is newly generated globally; by 2011, this figure reached 1.8ZB. And market research institutes predict that the total amount of data in the world will increase 44 times by 2020, reaching 35.2ZB! (1zb =10 hundred million TB).

The big data is a sharp increase of data volume (from the initial ERP/CRM data, the data is gradually expanded to increase internet data, and then to related information data such as sensors of the Internet of things), and is also an increase of data complexity. Big data can be said to be a large-scale qualitative change formed after the volume is accumulated to a certain degree. The big data has abundant and various data types, not only has structural information such as original database data and the like, but also has unstructured information such as texts, videos and the like, and the requirements on the acquisition and processing speed of the data are higher and higher.

The big data includes the meaning of "mass data" and surpasses the mass data in content, and in short, the big data is "mass data" + complex type data. Big data includes all data sets, including transactional and interactive data sets, that are of a size or complexity that exceeds the ability of conventional techniques to capture, manage, and process such data sets at a reasonable cost and time limit.

Big data consists of three main technical trends that converge:

mass transaction data: in online transaction processing (OLTP) and analytics systems, from ERP applications to data warehouse applications, traditional relational data and unstructured and semi-structured information continue to grow. This area becomes more complex as more data and business processes move to public and private clouds. The internal business transaction information mainly comprises online transaction data and online analysis data, and is structured static historical data which is managed and accessed through a relational database. From these data we can understand what happened in the past.

Mass interactive data: this new strength consists of social media data from Facebook, twitter, linkedIn, and other sources. It includes Call Detail Records (CDRs), device and sensor information, GPS and geolocation mapping data, mass image files transferred via the management File Transfer (management File Transfer) protocol, web text and click stream data, scientific information, email, etc. These data can tell us what will happen in the future.

Processing mass data: the method comprises the steps of receiving data from a client by utilizing various light databases, importing the data into a centralized large-scale distributed database or a distributed storage cluster, and then utilizing the distributed database to conduct ordinary query, classification and summarization and the like on the centralized mass data stored in the distributed database, so that most common analysis requirements are met, and meanwhile, data mining is conducted on the query data based on the front, and high-level data analysis requirements can be met. For example, yunTable is a new generation distributed database developed on the basis of the conventional distributed database and the new NoSQL technology. By which a distributed cluster of one hundred levels can be constructed to manage a large amount of data at the PB level.

In the face of heavy data attacks, the traditional data processing mode is more and more difficult to deal with, and the image faces are opposite to a gold mine in many times without effective tools and means, so that the 'data' is only expected to be in the sigh. The confusion of the traditional analysis technology facing big data is mainly as follows:

due to the limitations of analytical means, not all data can be fully utilized;

(ii) limited analysis capabilities and inability to obtain answers to complex questions;

some simple modeling technique has to be used because of the time limit requirements;

because there is not enough time to compute, the model accuracy is compromised.

Based on the current situation of data mining and clustering research, the existing method for mining big data clusters mostly adopts sampling of data and selection of representative data to realize point-by-point clustering analysis. When large data is processed, a method based on sample extraction probability is generally adopted for implementation, but the sampling method does not consider the problems that the global relative distance between data points or intervals and the data distribution are not uniform, and the division interval is too hard. Although clustering, fuzzy concepts, cloud models and the like are introduced to improve the problem of hard interval division and achieve good effects later, the methods do not consider different functions of big data points on knowledge discovery tasks. Therefore, in order to make the mining-derived clustering rules more effective and faster, it is necessary to take into consideration the different functions of the data points and carry out more intensive research on the clustering analysis. Cloud computing is provided based on processing among large data points in reality, and a strong theoretical basis is provided for mining more effective clustering rules.

Disclosure of Invention

In order to solve the problems in the prior art, the invention discloses a big data clustering method and device based on a cloud computing platform.

MapReduce is a programming model developed by Google primarily for large-scale (TB-level) data file processing. The main idea is to form an operation basic unit by the concepts of Map and Reduce, firstly, to cut the data into irrelevant blocks by Map program, to distribute (dispatch) to a large number of computers for processing, to achieve the effect of distributed operation, and then to summarize and output the result by Reduce program, to process the mass data in parallel. It is generally in the form:

Map(k1,v1)-〉list(k2,v2)

Reduce(k2,list(v2))-〉list(v2)

briefly, the Map-Reduce programming model divides an input data file into M independent data slices (split); and then allocating a plurality of Worker to start M Map functions to execute and output to an intermediate file (local writing) in parallel, and outputting the intermediate result in a key/value pair form by the calculation result. The intermediate result keys/values are grouped according to the key, the Reduce function is executed, the Reduce command is sent to the node where the intermediate file is located to be executed according to the intermediate file position information obtained from the Master, the final result is calculated and output, the output of the MapReduce is stored in the R output files, and the requirement of the intermediate file transmission on the bandwidth can be further reduced.

MapReduce is implemented depending on HDFS. Generally, mapReduce divides the calculated data into many small blocks, HDFS copies each block several times to ensure the reliability of the system, and it places the data blocks on different machines in the cluster according to a certain rule so that MapReduce performs the most convenient calculation on the data sink machine. HDFS is an open source version of Google GFS, a highly fault tolerant distributed file system that provides high throughput data access, suitable for storing large files (typically over 64M) in large volumes (PB-class).

According to the method, a clustering integration algorithm is designed by using a Map Reduce programming model, the big data blocks are stored in a distributed file system HDFS of a cloud platform, hadoop is responsible for managing the block data, and the key value of the block data is the data block Di. The computer in the computing cluster obtains a base clustering result by adopting a clustering algorithm for corresponding blocks stored locally, and obtains a final integrated clustering result of the computer by adopting a consistency scheme to carry out Reduce process (key value is machine number, value is clustering result) on each clustering result of the same machine, thereby achieving the purpose of parallel effective processing of big data and further improving the data processing performance and efficiency.

In order to achieve the purpose, the invention provides the following technical scheme:

a big data clustering method based on a cloud computing platform comprises the following steps:

step S100, big data preprocessing, namely cleaning real world data by filling missing values, smoothing noise data and identifying and deleting outliers, and carrying out standardized processing on data from different data sources to convert the data into data in a standard format;

step S200, big data segmentation and management: the method comprises the steps that after big data are cut into blocks, a plurality of cut data blocks are obtained and stored in a distributed file system (HDFS) of a cloud platform, and Hadoop is responsible for managing the cut data blocks;

step S300, establishing a hypergraph model for clustering, which specifically comprises the following steps:

establishing a weighted hypergraph H = (V, E), where V is a set of vertices and E is a set of hyper-edges, each hyper-graphAn edge can connect more than two vertexes, the vertexes of the hypergraph are used for representing data items used for clustering, the hypergraph is used for representing the association condition of the data items represented by the connected vertexes, and w (e) _m ) Is corresponding to each of the super edges E in E _m Weight of e, e _m ∈E，w(e _m ) The method is used for measuring the degree of correlation among a plurality of related data items connected by the super edges;

overcritical e _m The weight of (c) can be determined in two ways:

(1) With each strip of excess edge e _m The support degree of the association rule of (2) is used as the weight of the super edge;

(2) With each strip of excess edge e _m The average of the confidence of all necessary association rules of (a) is used as the weight of the superedge; an essential association rule is a rule whose expression has only one set of data items to the right and which includes a super edge e _j All data items associated.

Step S400, big data mapping, specifically, mapping the segmented data blocks to hypergraphs H = (V, E), respectively, that is, each data block is mapped to one hypergraph;

step S500, using the hypergraph to respectively cluster each data block,

for hypergraph H = (V, E), C is a set of vertices V, C _i E C is a subset of V, for any two classes C _i And c _j Has c of _i ∩c _j Not equal to phi, for a super edge e _m And a class c _i If e is _m ∩c _i If not equal to phi, then e _m And c _i There is a relationship between, which is expressed as:

wherein, | e _m I denotes the super edge e _m Number of middle vertices, | c _i I represents class c _i Number of middle vertices, | e _m ∩c _i Is simultaneously present at e _m And c _i Number of top points inEye syndrome C _i And class c _j Are combined into c _ij ，c _ij ＝c _i ∪c _j For the excess edge e _m ，e _m ∩c _i Not equal to φ, if HC (e) _m ,c _i )>HC(e _m ,c _ij ) Then exceed edge e _m In is c _j C represents the change of HC value _i And c _j Relative excess edge e between _m Similarity of (2); definition class c _i Quality Q (c) of _i ) Comprises the following steps:

i.e. class c _i Is all over edges e _m E.g. weighted HC (E) _m ,c _i ) The sum of the values;

defining the combination index f as:

f(c _i ,c _j )＝Q(c _ij )-[Q(c _i )-Q(c _j )]；

the specific process of clustering treatment comprises the following steps:

(1) Initializing a class set C, and enabling each class in C to correspond to each vertex in V;

(2) Traversing all classes in the hypergraph, for each class c _i Find a class c _j So that their combined index is maximized, i.e., f (c) _i ,c _j ) If f (c) is the maximum value _i ,c _j )&gt, 0, then class c is merged _i And class c _j Is of the class c _ij ；

(3) Constructing a new hypergraph by using all the merged classes;

(4) Repeating the steps (1) to (3) until no more classes are merged;

the specific process of clustering may further include:

(2) Traversing all classes in the hypergraph, for each class c _i Find a class c _j So that their combined index is maximizedLarge, i.e. f (c) _i ,c _j ) Has the largest value if f (c) _i ,c _j )&gt, 0, then class c is merged _i And class c _j Is of the class c _ij ；

(3) Constructing a new hypergraph by using all the merged classes;

(4) The new hypergraph corresponds to k partitions { G } ₁ ,G ₂ ,…G _k }，Is the weighted average of all edges in the ith partition,the weighted mean square error of all edges in the ith partition,the calculation is as follows:

wherein i =1,2, \8230;, k, e represents the super edge in the super graph, G _i Represents the ith partition of the hypergraph, w (e) represents the weight corresponding to the hyperedge e,represents division G _i The number of vertexes of the middle excess edge e;

(5) Judgment ofAnd (5) whether the current time is greater than a first threshold value or not, if so, repeatedly executing the clustering processes of the steps (1) to (4), and otherwise, ending the clustering process.

In step S500, each data block is clustered by using the hypergraph, and the following method may be adopted:

(1) Coarsening, constructing a minimum hypergraph based on the hypergraph H = (V, E), so that the projection quality of the partition on the initial hypergraph is better than that of the partition directly made on the initial hypergraph in the same time when any partition made on the minimum hypergraph is divided;

in the hypergraph coarsening stage, we need to construct a series of successively smaller hypergraphs. The purpose of the coarsening is to construct a minimum hypergraph such that for any partition made to the hypergraph, the projected quality of the partition on the initial hypergraph is better than the partition made directly to the initial hypergraph at the same time. In addition, the coarsening of the hypergraph also reduces the size of the hyperedges. That is, after several levels of coarsening, the large-scale hyper-edge is compressed into a small-scale hyper-edge that connects only a few vertices. Because the refinement heuristic algorithm is based on the Kernighan-Lin algorithm, the algorithm is very effective for small super edges, but is poor for super edges belonging to different divided regions and containing a large number of vertices. In the coarsened hypergraph at the next level, different methods may be selected to compress a group of vertices into a single vertex. From the viewpoint of node selection, FC (First Choice scheme), GFC (greenyfirst Choice scheme), HFC (Hybrid First Choice scheme), and the like can be classified. From the viewpoint of node consolidation, EDGE (EDGE scheme), HEDGE (Hyper-EDGE scheme), MHEDGE (Modified Hyper-EDGE scheme), and the like can be classified.

(2) Performing initial division processing, namely performing secondary division on the hypergraph subjected to the coarsening processing in the step (1);

in the initial partitioning stage, we need to divide the coarsened hypergraph two times. Because the hypergraph now contains a small number of vertices (typically less than 100 vertices), many different algorithms can be employed without impacting the runtime and quality of the algorithms too much. Multiple random dichotomy approaches may be employed. We can also use methods such as combinatorial methods, spectral methods and cellular automata methods to perform the bisection.

(3) Migration optimization processing, namely obtaining a more detailed hypergraph partition by using the minimum hypergraph partition;

in the migration optimization stage, we use the smallest partition of the hypergraph to get a more refined hypergraph partition. The above process is realized by the hypergraph projection which is more refined to the next level, and the dividing refinement algorithm is utilized to reduce the dividing times so as to improve the dividing quality. The refinement algorithm will achieve higher quality because the refined hypergraph of the next stage has higher degree of freedom. The idea of the V-period refinement algorithm is to use a multi-level paradigm to further improve the quality of the dichotomy. The V-period refinement algorithm comprises two parts, namely a coarsening stage and a migration optimization stage. The coarsening phase retains the initial partition as an input to the algorithm. We refer to this as a restricted coarsening plan. In the restricted coarsening plan, a group of vertices are merged to form vertices of the coarsened graph, and the group of vertices can only belong to one part of the two partitions. As a result, the original two partitions are retained and passed through the coarsening process, which becomes the initial partition we will refine in the migration optimization stage. The V-cycle refinement at the migration optimization stage followed by the migration optimization stage of the multi-stage hypergraph partitioning method mentioned above is identical. It moves vertices between the divided regions to improve the quality of the segmentation. Notably, various coarsening representations of the original hypergraph, allow refinement to further improve quality to help it jump out of local minima.

(4) And finally, obtaining a clustering result as a partitioning result.

Step S600, clustering the clustering result of each data block obtained in the step S500 again to obtain a final clustering result;

the clustering result obtained in step S500 is clustered again, and may be implemented by using various clustering methods, such as a k-means clustering method, a hypergraph-based clustering method, and the like.

According to the method, the cloud platform is combined with the hypergraph theory to conduct mining clustering processing on the big data, and therefore the big data is analyzed and processed quickly, in real time and accurately.

The invention also provides a big data clustering device based on the cloud computing platform, which comprises the following components:

the big data preprocessing device is used for cleaning the data of the real world by filling in missing values, smoothing noise data and identifying and deleting outliers, and carrying out standardized processing on the data from different data sources to convert the data into data in a standard format;

the big data segmentation and management device is used for segmenting the big data to obtain a plurality of segmented data blocks, storing the data blocks into a distributed file system (HDFS) of the cloud platform, and managing the segmented data blocks by Hadoop;

the hypergraph model device for establishing the clustering is specifically used for:

establishing a weighted hypergraph H = (V, E), wherein V is a set of vertexes, E is a set of hyperedges, each hyperedge can connect more than two vertexes, the vertexes of the hypergraph are used for representing data items used for clustering, the hyperedges are used for representing association conditions of the data items represented by the connected vertexes, and w (E) _m ) Is corresponding to each of the overclad E in E _m Weight of (e), e _m ∈E，w(e _m ) The method is used for measuring the degree of correlation among a plurality of related data items connected by the super edges;

beyond e _m The weight of (c) can be determined in two ways:

(1) With each strip of overedge e _m The support degree of the association rule of (2) is used as the weight of the super edge;

(2) With each strip of excess edge e _m The average value of the confidence degrees of all necessary association rules is used as the weight of the hyper-edge; an essential association rule is a rule whose expression has only one set of data items to the right and which includes a super edge e _j All data items associated.

The big data mapping device is used for mapping the segmented data blocks to the hypergraph H = (V, E) respectively, namely each data block is mapped to one hypergraph;

a clustering device for clustering each data block by using hypergraph,

for hypergraph H = (V, E), C is a set of vertices V, C _i E C is a subset of V, for any two classes C _i And c _j Has c of _i ∩c _j Is not equal to phi, for a super edge e _m And a class c _i If e is _m ∩c _i If not equal to φ, then e _m And c _i There is a relationship between them, which is expressed as：

Wherein, | e _m I denotes the overcide e _m Number of middle vertices, | c _i I represents class c _i Number of middle vertices, | e _m ∩c _i Is simultaneously present at e _m And c _i Number of vertices in (1), class c _i And class c _j Are combined into c _ij ，c _ij ＝c _i ∪c _j For the over edge e _m ，e _m ∩c _i Not equal to φ, if HC (e) _m ,c _i )>HC(e _m ,c _ij ) Then exceed edge e _m In is c _j C represents the change of HC value _i And c _j Relative excess edge e between _m The similarity of (2); definition class c _i Quality Q (c) of _i ) Comprises the following steps:

i.e. class c _i All over edges e _m Weighted HC of E (E) _m ,c _i ) The sum of the values;

defining the combination index f as:

f(c _i ,c _j )＝Q(c _ij )-[Q(c _i )-Q(c _j )]；

the specific process of clustering treatment comprises the following steps:

(2) Traversing all classes in the hypergraph, and determining the class c _i Find a class c _j So that their combined index is maximized, i.e., f (c) _i ,c _j ) If f (c) is the maximum value _i ,c _j )&gt, 0, then class c is merged _i And class c _j Is of the class c _ij ；

(3) Constructing a new hypergraph by using all the merged classes;

(4) Repeating the steps (1) to (3) until no more classes are merged;

the specific process of clustering processing may further include:

(2) Traversing all classes in the hypergraph, and determining the class c _i Find a class c _j So that their combined index is maximized, i.e., f (c) _i ,c _j ) Has the largest value if f (c) _i ,c _j )&gt, 0, then class c is merged _i And class c _j Is of the class c _ij ；

(3) Constructing a new hypergraph by using all the merged classes;

wherein i =1,2, \8230;, k, e represents the super edge in the super graph, G _i Represents the ith partition of the hypergraph, w (e) represents the weight corresponding to the hyperedge e,represents division G _i The number of vertexes of the middle hyper-edge e;

(5) Judgment ofAnd (4) whether the current time is greater than a first threshold value or not, if so, repeatedly executing the clustering processes of the steps (1) to (4), and otherwise, ending the clustering process.

The final clustering device is used for clustering the clustering result of each data block obtained by the clustering processing device again to obtain a final clustering result;

the clustering result obtained by the clustering processing device is clustered again, and can be realized by adopting various clustering methods, such as a k-means clustering method, a hypergraph-based clustering method and the like.

Drawings

FIG. 1 is a flow chart of a data storage method of the present invention;

FIG. 2 is a block diagram of a data storage device according to the present invention.

Detailed Description

The technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings of the present invention. Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Referring to fig. 1, the big data clustering method based on the cloud computing platform provided by the invention comprises the following steps:

data preprocessing refers to some processing of data prior to the main processing. The method provides clean, accurate and concise data for the information processing process, improves the information processing efficiency and accuracy, and is a very important link in the information processing. The real world data is very different, and in order to realize the unified processing of the data, the data must be preprocessed into standard data meeting requirements.

hadoop is implemented as an open source of Google's MapReduce algorithm, and can divide an application program into many small units of work, each of which can be executed or repeatedly executed on any cluster node. In addition, hadoop provides a distributed file system for storing data on various compute nodes and provides high throughput for reading and writing data. Many single machine algorithms are realized again on the Hadoop, and high availability and expandability are provided for various algorithms to process mass data.

establishing a weighted hypergraph H = (V, E), wherein V is a set of vertexes, E is a set of hyperedges, each hyperedge can connect more than two vertexes, the vertexes of the hypergraph are used for representing data items used for clustering, the hyperedges are used for representing association conditions of the data items represented by the connected vertexes, and w (E) _m ) Is corresponding to each of the super edges E in E _m Weight of e, e _m ∈E，w(e _m ) The method is used for measuring the degree of correlation among a plurality of related data items connected by the super edges;

beyond e _m The weight of (c) can be determined in the following two ways:

(2) With each strip of excess edge e _m The average of the confidence of all necessary association rules of (a) is used as the weight of the superedge; the necessary association rules refer to specific rules, rules thereofThen there is only one set of data items to the right of the expression and the rule includes a super edge e _j All data items associated.

To facilitate an understanding of the invention, some concepts related to the hypergraph are presented below.

Data item and data item set: let I = { I = { I = } ₁ ,i ₂ ,…,i _m Is a set of m different items, each i _k (k =1,2, \8230;, m) becomes a data Item (Item), the set of data items I is called a set of data items (Item set), simply referred to as an Item set, and the number of elements thereof is called the length of the data Item set. A set of data items of length k is called a k-dimensional set of data items, simply k-Item set (k-Item set).

Transaction: transaction T (Transaction) is a subset of the set of data items I, i.e.Each transaction has associated with it a unique identifier TID, and the totality of the different transactions constitutes the totality of transactions D (i.e. the transaction database).

Support of data item set: if X is a data item set, B is the number of X contained in the database D, and a is the number of all transactions contained in the database D, the Support (Support) of the data item set X is:the Support (X) of the item set X describes the importance of the item set X.

Association rules: the association rule may be expressed as: r is X → Y, whereinAnd X andgatey = phi, which means that if item set X occurs in a certain transaction, it will inevitably result in item set Y also occurring in the same transaction. X is called the antecedent of the rule (antecedent) and Y is called the result of the rule (consequent).

Support degree of association rule: for the association rule R X → Y, whereAnd X ≧ Y = φ. The support of the rule R is the ratio of the number of transactions in the database D containing both the item set X and the item set Y to the number of all transactions.

Confidence of association rule: for the association rule R X → Y, whereAnd X ≧ Y = φ. The Confidence (Confidence) of rule R is expressed as:

i.e. to indicate how likely it is that item set Y is also occurring simultaneously in transactions in database D in which item set X is occurring.

The support and confidence of an association rule are two measures of the interest of the rule. Confidence is a measure of the accuracy of the associated rule, or represents the strength of the rule; the support degree is a measure of the importance of the association rule, and indicates the frequency of the rule. If the support and credibility of the association rules are not considered, then there are very many association rules in the database. In fact, people are generally interested in only those association rules that satisfy a certain degree of support and trustworthiness. Therefore, in order to find meaningful association rules, two basic thresholds need to be given by the user: minimum support and minimum confidence.

Minimum support and frequent itemsets: the Minimum support (Minimum support) represents the Minimum support threshold that the association rule is found to require that a data item must meet, denoted minsupp, which represents the lowest statistical significance of a set of data items. Only the data item set meeting the minimum support degree is possible to appear in the association rule, and the data item set with the support degree larger than the minimum support degree is called a frequent item set or a strong item set (Large itemset); otherwise, it is called non-frequent item set or weak item set (Small item set).

Minimum confidence: the Minimum confidence (Minimum confidence) represents the Minimum confidence that the association rule must satisfy, denoted minconf, which represents the Minimum reliability of the association rule.

Strong association rules: if Support (R) is not less than minsuppp and Confidence (R) is not less than minconf, then the association rule is called

R, X → Y, is a strong association rule.

Hypergraph H = (V, E), in which set of vertices V = { V = ₁ ,v ₂ ,…v _n }, edge set E = { E = ₁ ,e ₂ ,…e _m H with a _ij Representing a vertex v _i And v _j The direct edge number may be 0,1,2 \8230, 8230, and the obtained nxn matrixa[i,j]E {0,1,2 \8230; }, which is a contiguous matrix of the hypergraph.

According to the definition of the hypergraph adjacency matrix which is an extension of the definition of the simple graph adjacency matrix, the properties of the hypergraph adjacency matrix can be obtained by combining the defined properties of the adjacency matrix:

(1) A (H) is a pair matrix

(2) A sufficient requirement for isomorphism of the two graphs G and H is the presence of the permutation matrix P such that

A(H)＝P ^T A(G)P。

Step S400, big data mapping, specifically, mapping the partitioned data blocks to hypergraphs H = (V, E), that is, each data block is mapped to one hypergraph;

step S500, using the hypergraph to respectively cluster each data block,

for hypergraph H = (V, E), C is a set of vertices V, C _i E C is a subset of V, for any two classes C _i And c _j Has c of _i ∩c _j Is not equal to phi, for a super edge e _m And a class c _i If e is _m ∩c _i If not equal to φ, then e _m And c _i There is a relationship between, which is expressed as:

wherein，|e _m I denotes the super edge e _m Number of middle vertices, | c _i I represents class c _i Number of middle vertices, | e _m ∩c _i Is simultaneously present at e _m And c _i Number of vertices in, class c _i And class c _j Are combined into c _ij ，c _ij ＝c _i ∪c _j For the over edge e _m ，e _m ∩c _i Not equal to φ, if HC (e) _m ,c _i )>HC(e _m ,c _ij ) Then exceed edge e _m In is c _j C represents the change of HC value _i And c _j Relative excess edge e between _m Similarity of (2); definition class c _i Quality Q (c) of _i ) Comprises the following steps:

i.e. class c _i All over edges e _m E.g. weighted HC (E) _m ,c _i ) The sum of the values;

defining the combination index f as:

f(c _i ,c _j )＝Q(c _ij )-[Q(c _i )-Q(c _j )]；

the specific process of clustering treatment comprises the following steps:

(2) Traversing all classes in the hypergraph, for each class c _i Find a class c _j So that their combined index is maximized, i.e., f (c) _i ,c _j ) Has the largest value if f (c) _i ,c _j )&gt, 0, then class c is merged _i And class c _j Is of the class c _ij ；

(3) Constructing a new hypergraph by using all the merged classes;

(4) Repeating the steps (1) to (3) until no more classes are merged;

the specific process of clustering may further include:

(3) Constructing a new hypergraph by using all the merged classes;

(1) Coarsening, namely constructing a minimum hypergraph based on the hypergraph H = (V, E), so that the projection quality of the division on the initial hypergraph is better than that of the division directly made on the initial hypergraph in the same time for any division made on the minimum hypergraph;

in the coarsening stage of the hypergraph, we need to construct a series of successively smaller hypergraphs. The purpose of the coarsening is to construct a minimum hypergraph such that for any partition made to the hypergraph, the projected quality of the partition on the initial hypergraph is better than the partition made directly to the initial hypergraph at the same time. In addition, the coarsening of the hypergraph also reduces the size of the hyperedges. That is, after several levels of coarsening, the large-scale hyper-edge is compressed into a small-scale hyper-edge that connects only a few vertices. Because the refinement heuristic algorithm is based on the Kernighan-Lin algorithm, the algorithm is very effective for small super edges, but is poor for super edges belonging to different divided regions and containing a large number of vertices. In the coarsened hypergraph at the next level, different methods may be selected to compress a group of vertices into a single vertex. From the viewpoint of node selection, FC (First Choice scheme), GFC (greenyfirst Choice scheme), HFC (Hybrid First Choice scheme), and the like can be classified. From the node merging point of view, it can be classified into EDGE (EDGE scheme), HEDGE (Hyper-EDGE scheme), MHEDGE (Modified Hyper-EDGE scheme), etc.

in the migration optimization stage, we use the smallest partition of the hypergraph to get a more refined hypergraph partition. The above process is realized by the hypergraph projection which is more refined to the next level, and the dividing refinement algorithm is utilized to reduce the dividing times so as to improve the dividing quality. The refinement algorithm will achieve higher quality because the refined hypergraph of the next stage has higher degree of freedom. The idea of the V-period refinement algorithm is to use a multi-level paradigm to further improve the quality of the dichotomy. The V-cycle refinement algorithm comprises two parts, namely a coarsening phase and a migration optimization phase. The coarsening phase retains the initial partition as an input to the algorithm. We refer to this as a restricted coarsening plan. In the restricted coarsening plan, a group of vertices are merged to form vertices of the coarsened graph, and the group of vertices can only belong to one part of the two partitions. As a result, the original two partitions are retained and passed through the coarsening process, and become the initial partitions that we will refine in the migration optimization stage. The V-cycle refinement at the migration optimization stage followed by the migration optimization stage of the multi-stage hypergraph partitioning method mentioned above is identical. It moves vertices between the divided regions to improve the quality of the segmentation. Notably, the various coarsened representations of the original hypergraph, allow refinement to further improve quality to help it jump out of local minima.

(4) And finally, obtaining a clustering result as a partitioning result.

clustering the clustering result obtained in step S500 again, which can be implemented by using various clustering methods, such as a k-means clustering method, a hypergraph-based clustering method, and the like.

Referring to fig. 2, the present invention further provides a big data clustering apparatus based on a cloud computing platform, including:

data preprocessing refers to some processing of data prior to the main processing. The method provides clean, accurate and concise data for the information processing process, improves the information processing efficiency and accuracy, and is a very important link in the information processing. In order to realize uniform processing of data, the data must be preprocessed into standard data meeting requirements.

hadoop is implemented as an open source for Google's MapReduce algorithm, and can divide an application program into many small work units, each of which can be executed or repeatedly executed on any cluster node. In addition, hadoop provides a distributed file system for storing data on various compute nodes and provides high throughput for reading and writing data. Many single machine algorithms are realized again on Hadoop, and high availability and expandability are provided for various algorithms to process mass data.

beyond e _m The weight of (c) can be determined in two ways:

a clustering device for clustering each data block by using a hypergraph,

wherein, | e _m I denotes the overcide e _m Number of middle vertices, | c _i I represents class c _i Number of middle vertices, | e _m ∩c _i Is simultaneously present at e _m And c _i Number of vertices in, class c _i And class c _j Are combined into c _ij ，c _ij ＝c _i ∪c _j For the excess edge e _m ，e _m ∩c _i Not equal φ, if HC (e) _m ,c _i )>HC(e _m ,c _ij ) Then exceed edge e _m In is c _j C represents the change of HC value _i And c _j Relative excess edge e between _m The similarity of (2); definition class c _i Quality Q (c) of _i ) Comprises the following steps:

defining the combination index f as:

f(c _i ,c _j )＝Q(c _ij )-[Q(c _i )-Q(c _j )]；

the specific process of clustering treatment comprises the following steps:

(3) Constructing a new hypergraph by using all the merged classes;

(4) Repeating the steps (1) to (3) until no more classes are merged;

the specific process of clustering may further include:

(2) Traversing all classes in the hypergraph, for each class c _i Find a class c _j So that their combined index is maximized, i.e., f (c) _i ,c _j ) Has the largest value if f (c) _i ,c _j )&gt, 0, then class c is merged _i And class c _j Is of class c _ij ；

(3) Constructing a new hypergraph by using all the merged classes;

(4) The new hypergraph corresponds to k partitions { G ₁ ,G ₂ ,…G _k }，Is the weighted average of all edges in the ith partition,the weighted mean square error of all edges in the ith partition,the calculation is as follows:

clustering the clustering result obtained by the clustering processing device again can be realized by adopting various clustering methods, such as a k-means clustering method, a hypergraph-based clustering method and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A big data clustering method based on a cloud computing platform comprises the following steps:

step S300, establishing a hypergraph model for clustering,

step S500, using the hypergraph to perform clustering processing on each data block respectively, specifically comprising:

wherein, | e _m I denotes the overcide e _m Number of middle vertices, | c _i I represents class c _i Number of middle vertices, | e _m ∩c _i Is simultaneously present at e _m And c _i Number of vertices in (1), class c _i And class c _j Are combined into c _ij ，c _ij ＝c _i ∪c _j For the excess edge e _m ，e _m ∩c _i Not equal to φ, if HC (e) _m ,c _i )>HC(e _m ,c _ij ) Then exceed edge e _m In is c _j C represents the change of HC value _i And c _j Relative excess edge e between _m Similarity of (2); definition class c _i Quality Q (c) of _i ) Comprises the following steps:

i.e. class c _i Is all over edges e _m Weighted HC of E (E) _m ,c _i ) The sum of the values;

defining the combination index f as:

f(c _i ,c _j )＝Q(c _ij )-[Q(c _i )-Q(c _j )]；

the specific process of clustering treatment comprises the following steps:

(3) Constructing a new hypergraph by using all the merged classes;

(4) Repeating the steps (1) to (3) until no more classes are merged;

in step S300, establishing a hypergraph model for clustering specifically includes:

establishing a weighted hypergraph H = (V, E), wherein V is a set of vertexes, E is a set of hyperedges, each hyperedge can be connected with more than two vertexes, the vertexes of the hypergraph are used for representing data items used for clustering, the hyperedges are used for representing the association condition of the data items represented by the connected vertexes, and w (E) _m ) Is corresponding to each of the overclad E in E _m Weight of (e), e _m ∈E，w(e _m ) The method is used for measuring the degree of correlation among a plurality of related data items connected by the super edges;

wherein, the excess edge e _m The weight of (A) is:

with each strip of excess edge e _m The support degree of the association rule of (4) is used as the weight of the super edge;

and (4) association rules: the association rule is expressed as: r is X → Y, whereinAnd X ≠ Y = Φ, which means that if a set of items X occurs in a certain transaction, it will inevitably result in the set of items Y also occurring in the same transaction, X being called antecedent of the rule, i.e. antecedent, Y being called result of the rule, i.e. consequent;

support of association rules: for the association rule R X → Y, the support of the rule R refers to the ratio of the number of transactions in the database D containing both the item set X and the item set Y to all transactions.

2. The cloud computing platform-based big data clustering method according to claim 1, wherein the hyper-edge e _m The weight of (A) is:

with each strip of overedge e _m The average of the confidence of all necessary association rules of (a) is used as the weight of the superedge; an association-necessary rule is a rule whose expression has only one set of data items to the right and which includes a super edge e _j All data items associated.

3. A big data clustering device based on a cloud computing platform comprises:

the big data preprocessing device is used for cleaning real world data by filling missing values, smoothing noise data and identifying and deleting outliers, and carrying out normalization processing on data from different data sources and converting the data into data in a standard format;

the hypergraph model device for clustering is used for building a hypergraph model for clustering;

the big data mapping device is used for mapping the segmented data blocks to the hypergraphs H = (V, E), namely each data block is mapped to one hypergraph;

the clustering processing device, utilize the hypergraph to carry out clustering processing respectively to every data block, include specifically:

for hypergraph H = (V, E), C is a set of vertices V, C _i E C is a subset of V, for any two classes C _i And c _j Has a flow of _i ∩c _j Not equal to phi, for a super edge e _m And a class c _i If e is _m ∩c _i If not equal to phi, then e _m And c _i There is a relationship between, which is expressed as:

wherein, | e _m I denotes the super edge e _m Number of middle vertices, | c _i I represents class c _i Number of middle vertices, | e _m ∩c _i Is simultaneously present at e _m And c _i Number of vertices in, class c _i And class c _j Are combined into c _ij ，c _ij ＝c _i ∪c _j For the over edge e _m ，e _m ∩c _i Not equal to φ, if HC (e) _m ,c _i )>HC(e _m ,c _ij ) Then exceed edge e _m In is c _j C represents the change of HC value _i And c _j Relative excess edge e between _m The similarity of (2); definition class c _i Quality Q (c) of _i ) Comprises the following steps:

defining the combination index f as:

f(c _i ,c _j )＝Q(c _ij )-[Q(c _i )-Q(c _j )]；

the specific process of clustering treatment comprises the following steps:

(3) Constructing a new hypergraph by using all the merged classes;

(4) Repeating the steps (1) to (3) until no more classes are merged;

establishing a weighted hypergraph H = (V, E), wherein V is a set of vertexes, E is a set of hyperedges, each hyperedge can be connected with more than two vertexes, the vertexes of the hypergraph are used for representing data items used for clustering, the hyperedges are used for representing the association condition of the data items represented by the connected vertexes, and w (E) _m ) Is corresponding to each of the super edges E in E _m Weight of e, e _m ∈E，w(e _m ) The method is used for measuring the degree of correlation among a plurality of related data items connected by the super edges;

wherein, the excess edge e _m The weight of (A) is:

with each strip of excess edge e _m The support degree of the association rule of (2) is used as the weight of the super edge;

support of association rules: for the association rule R X → Y, the support of the rule R refers to the ratio of the number of transactions containing both the item set X and the item set Y in the database D to the number of all transactions.

4. The cloud computing platform-based big data clustering device according to claim 3, wherein the super edge e _m The weight of (A) is: