CN116383464A

CN116383464A - Correlation big data clustering method and device based on stream computing

Info

Publication number: CN116383464A
Application number: CN202310375886.7A
Authority: CN
Inventors: 李佳; 刘晓蕾
Original assignee: Telephase Technology Development Beijing Co ltd
Current assignee: Telephase Technology Development Beijing Co ltd
Priority date: 2023-04-10
Filing date: 2023-04-10
Publication date: 2023-07-04

Abstract

The invention relates to the technical field of streaming computing, and discloses a method and a device for clustering relevance big data based on streaming computing, wherein the method comprises the following steps: generating a data tuple of preset streaming data according to the time stamp and the characterization data item; performing coarse clustering on the streaming data according to the vector tuples to obtain a coarse aggregation data set; calculating a first center position of each coarse aggregation subset in the coarse aggregation data set; calculating clustering distances between the real-time streaming data and the first center position one by one, carrying out fine clustering on the real-time streaming data according to the clustering distances to obtain a fine-aggregate data set, and carrying out time identification on the fine-aggregate data set to obtain an identification fine-aggregate data set; determining a second center position of the fine-aggregate data set according to the identification fine-aggregate data set, determining the number of data tuples in the fine-aggregate data set according to the identification fine-aggregate data set, and clustering the relevance big data according to the second center position and the number of the data tuples. The invention can improve the accuracy of the relevance big data clustering.

Description

Correlation big data clustering method and device based on stream computing

Technical Field

The present invention relates to the field of streaming computing technologies, and in particular, to a method and an apparatus for clustering relevance big data based on streaming computing.

Background

With the development of network and information technology, the expression form of data is not just a static form such as a file, a database and the like. The continuity and the integrity of data writing are guaranteed by the real-time streaming data, and the information writing efficiency is greatly improved, but in order to perform high-quality and high-efficiency clustering on the real-time streaming data, repeated iterative analysis is needed to perform large data clustering on the data.

The existing relevance big data clustering technology is to store streaming data first and then cluster the streaming data by using a clustering algorithm. In practical application, the streaming data is dynamically generated in real time, and only the static storage streaming data is considered to cluster the data through a clustering algorithm, so that the data clustering mode is possibly too single, and the accuracy in the process of carrying out relevance big data clustering is low.

Disclosure of Invention

The invention provides a relevance big data clustering method and device based on stream computing, and mainly aims to solve the problem of low accuracy in relevance big data clustering.

In order to achieve the above object, the present invention provides a method for clustering big data of relevance based on stream computation, comprising:

S1, acquiring preset streaming data, and generating a data tuple of the streaming data according to a preset time stamp and a preset characterization data item;

s2, carrying out vector conversion on the data tuples to obtain vector tuples, and carrying out coarse clustering on the streaming data according to the vector tuples by using a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set;

s3, carrying out subset division on the coarse aggregation data set to obtain a divided coarse aggregation data set, and calculating a first center position of each coarse aggregation subset in the divided coarse aggregation data set by using a preset center algorithm;

s4, calculating the clustering distance between the preset real-time streaming data and the first central position one by one, carrying out fine clustering on the real-time streaming data according to the clustering distance by using a preset clustering algorithm to obtain a fine-aggregate data set, and carrying out time identification on the fine-aggregate data set to obtain an identification fine-aggregate data set, wherein the step of calculating the clustering distance between the preset real-time streaming data and the first central position one by one comprises the following steps:

s41, carrying out vector conversion on the preset real-time streaming data to obtain a real-time data vector;

s42, unifying vector dimensions of the real-time data vectors to obtain unified real-time data vectors;

S43, calculating the clustering distance between the unified real-time data vector and the first center position one by using the following distance formula:

wherein D is _uv For the clustering distance between the u-th unified real-time data vector and the v-th first central position, x _u Is the abscissa, y of the u-th unified real-time data vector _u The ordinate, x, of the u-th unified real-time data vector _v Is the abscissa of the v first center position, y _v The m is the number of the unified real-time data vectors, and the k is the number of the first central positions;

s5, determining a second center position of the fine aggregate data set according to the identification fine aggregate data set, determining the number of data tuples in the fine aggregate data set according to the identification fine aggregate data set, and clustering the relevance big data according to the second center position and the number of the data tuples.

Optionally, the generating the data tuple of the streaming data according to the preset timestamp and the preset characterization data item includes:

determining a data item field corresponding to the streaming data according to the characterization data item;

generating a data tuple of the streaming data according to the timestamp and the data item field, wherein the data tuple is:

tuple(t)＝

Wherein, the tuple (t) is a data tuple with a time stamp of t, p _n (t) is the nth data item field when the timestamp is t.

Optionally, the coarse clustering of the streaming data according to the vector tuple by using a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set includes:

acquiring a preset first distance threshold and a preset second distance threshold;

any vector tuple is selected as a target center point, and a target distance between the target center point and a preset target set is calculated;

adding the target center point to the target set when the target distance is smaller than the first distance threshold, and deleting the target center point when the target distance is smaller than the second distance threshold;

when the vector tuple exists, returning to any step of selecting the vector tuple as a target center point until the vector tuple does not exist;

and when the vector tuple does not exist, generating the coarse aggregation data set according to the target set by utilizing the real-time coarse clustering algorithm.

Optionally, the acquiring the preset first distance threshold and the preset second distance threshold includes:

obtaining a maximum dimension value and a minimum dimension value in the vector tuple;

Calculating a dimension standard deviation according to the maximum dimension value and the minimum dimension value;

calculating the first distance threshold according to the maximum dimension value, the minimum dimension value and the dimension standard deviation by using the following distance threshold calculation formula:

wherein T is ₁ For the first distance threshold, max ^P For the maximum dimension value, min, of the P dimension in the vector tuple ^p S for the minimum dimension value, P, of the vector tuple ^P For the P dimension in the vector tupleThe dimension standard deviation, n is the dimension, u ₁ Is a first distance weight coefficient;

determining the second distance threshold according to the first distance threshold, wherein the second distance threshold is:

T ₂ ＝u ₂ T ₁

wherein T is ₂ For the second distance threshold, T ₁ For the first distance threshold, u ₂ Is the second distance weight coefficient.

Optionally, the sub-dividing the coarse aggregation data set to obtain a divided coarse aggregation data set includes:

dividing the coarse aggregation data set by utilizing a preset sliding window to obtain an initial divided data set;

extracting an inflow time point and an outflow time point of each data tuple in the initial partition data set;

calculating the average sliding time of the sliding window according to the inflow time point and the outflow time point, and updating the sliding window according to the average sliding time to obtain an updated sliding window;

And carrying out division updating on the initial division data set according to the updating sliding window to obtain the division rough aggregation data set.

Optionally, the calculating, with a preset center algorithm, the first center position of each coarse aggregation subset in the partitioned coarse aggregation data set includes:

any one data tuple in the coarse aggregation subset is selected as an initial clustering center;

calculating the similarity between the unselected data tuples and the initial clustering center;

dividing the data tuples into cluster sets corresponding to the initial cluster centers according to the similarity, and calculating cluster average values of all the data tuples in the cluster sets;

calculating a first center position of each coarse aggregation subset according to the clustering mean value by using the center algorithm, wherein the center algorithm is as follows:

wherein h is _i A first center position of the ith coarse aggregate subset, F _i For the associated features in the ith coarse aggregate subset, m _i Is the cluster mean of the ith coarse aggregate subset.

Optionally, the performing time identification on the fine aggregate data set to obtain an identified fine aggregate data set includes:

calculating an average timestamp in the fine aggregate dataset using the time formula:

Wherein S is _i For the average timestamp, G in the ith fine aggregate dataset _i For the i-th fine aggregate dataset associated features, t is the total timestamp, s _n A number of data sets for the fine aggregate;

calculating the standard deviation of the time stamp in the fine aggregate data set according to the average time stamp, wherein the calculation formula of the standard deviation of the time stamp is as follows:

wherein S is _σ For the timestamp standard deviation, G _i For the i-th fine aggregate dataset associated features, t is the total timestamp, s _n For the number of fine aggregate data sets S _i Said average timestamp for the ith fine aggregate dataset;

and carrying out time identification on the fine aggregate data set according to the standard deviation of the time stamp to obtain the identification fine aggregate data set.

Optionally, the determining a second center position of the fine-aggregate dataset according to the identifying fine-aggregate dataset includes:

calculating the local density of each data tuple in the identification fine-aggregate data set one by one;

and selecting the data tuple with the maximum local density as a second center position of the fine aggregate data set.

Optionally, the clustering the relevance big data according to the second center position and the number of the data tuples includes:

calculating a real-time distance between the real-time streaming data and the second central location;

Counting sequence number identifiers of the real-time streaming data;

and clustering the real-time streaming data into a target set corresponding to the second center position when the real-time distance is smaller than a preset distance threshold value and the sequence number identification is smaller than the number of the data tuples, so as to obtain a final clustering set.

In order to solve the above problems, the present invention further provides a device for clustering big data of relevance based on stream computation, the device comprising:

the data tuple generation module is used for acquiring preset streaming data and generating a data tuple of the streaming data according to a preset time stamp and a preset characterization data item;

the stream data coarse clustering module is used for carrying out vector conversion on the data tuples to obtain vector tuples, and carrying out coarse clustering on the stream data according to the vector tuples by utilizing a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set;

the first center position calculation module is used for carrying out subset division on the coarse aggregation data set to obtain a divided coarse aggregation data set, and calculating a first center position of each coarse aggregation subset in the divided coarse aggregation data set by using a preset center algorithm;

the stream data fine clustering module is used for calculating the clustering distance between preset real-time stream data and the first center position one by one, carrying out fine clustering on the real-time stream data according to the clustering distance by using a preset clustering algorithm to obtain a fine-aggregate data set, and carrying out time identification on the fine-aggregate data set to obtain an identification fine-aggregate data set;

And the data clustering module is used for determining a second center position of the fine-aggregate data set according to the identification fine-aggregate data set, determining the number of data tuples in the fine-aggregate data set according to the identification fine-aggregate data set, and clustering the relevance big data according to the second center position and the number of the data tuples.

According to the embodiment of the invention, the streaming data is represented by the data tuples through the preset time stamp and the representation data item, so that the data tuples are subjected to coarse clustering to obtain a coarse aggregation data set, and the data tuples arriving in real time are subjected to coarse clustering, so that the initial coarse clustering of the data tuples through a small number of data dimensions is facilitated, and the data clustering efficiency is improved; and a fine clustering process is carried out on the coarse-aggregate data set, the clustering distance between the real-time streaming data and the central position is calculated through a clustering algorithm, the fine-aggregate data set of the real-time streaming data is further determined according to the clustering distance, and the relevance big data is clustered according to the second central position of the fine-aggregate data set and the number of the data tuples, so that the accuracy of data clustering is improved. Therefore, the relevance big data clustering method and device based on stream computing can solve the problem of lower accuracy in relevance big data clustering.

Drawings

FIG. 1 is a flow chart of a method for clustering big data of relevance based on stream computation according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of coarse clustering streaming data according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a method for partitioning a coarse aggregate data set according to an embodiment of the present invention;

FIG. 4 is a functional block diagram of a correlation big data clustering device based on stream computation according to an embodiment of the present invention;

the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the application provides a relevance big data clustering method based on stream computing. The execution subject of the relevance big data clustering method based on the streaming computing includes, but is not limited to, at least one of a server, a terminal and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the relevance big data clustering method based on the streaming computation may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Referring to fig. 1, a flow chart of a method for clustering associative big data based on stream computation according to an embodiment of the present invention is shown. In this embodiment, the method for clustering relevance big data based on stream computing includes:

s1, acquiring preset streaming data, and generating a data tuple of the streaming data according to a preset time stamp and a preset characterization data item.

In the embodiment of the invention, the streaming data refers to a data sequence consisting of data which is infinitely increased and continued according to time sequence, the streaming data is represented as a continuous, uninterrupted and unstructured data message queue, and individual data items in the streaming data appear in the form of tuples, so that the streaming data can be regarded as a directed unbounded data stream consisting of tuple units.

In detail, the preset attrition data can be obtained through computer sentences (such as Java sentences and Python sentences) with a data grabbing function.

In the embodiment of the present invention, the data tuples refer to elements ordered according to a specific time sequence, and each data tuple in the data stream may be regarded as being formed by two parts, namely a timestamp and a characterization data item, in which a certain sequence exists.

In the embodiment of the present invention, the generating the data tuple of the streaming data according to the preset timestamp and the preset characterization data item includes:

tuple(t)＝

In detail, the data item fields include data attributes, data elements, data arrival times, data structures, etc., and the data item fields contained in the data tuples may be determined from characterizing features in the characterizing data item. Furthermore, a data tuple generally consists of a number of data items and the actual content of the data items, each data tuple being composed of n different data items, wherein a time stamp t may identify the arrival time of the data stream, and the data tuple on each time stamp t may be represented as tuple (t) =。

Specifically, compared with static data, the characteristic of existence of the correlative real-time streaming data is time sequence, the real-time streaming data arrives at the system in time sequence, and the arrival sequence is independent; real-time, each tuple in the real-time streaming data reaches the system in real time and changes continuously along with the time; the real-time streaming data reaches the system uninterruptedly, the data is huge in scale and the maximum value of the data cannot be predicted. Accordingly, the data tuples arriving in real-time need to be processed accordingly to achieve associative data clustering.

S2, carrying out vector conversion on the data tuples to obtain vector tuples, and carrying out coarse clustering on the stream data according to the vector tuples by using a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set.

In the embodiment of the invention, the vectorization processing is carried out on the data tuples in the relevance real-time streaming big data in the cloud environment, so that the distance between each data tuple in the streaming data can be accurately calculated later.

In detail, the data tuples may be vector-converted by a preset vector conversion model, to obtain vector tuples, where the vector conversion model includes but is not limited to word2vec model and Bert model.

In the embodiment of the invention, the coarse aggregation data set is mainly used for forming a set formed by macro clusters with differences, all macro clusters only record corresponding characteristics, and the coarse aggregation data set is a real-time cluster, so that the clustering is realized by adopting an efficient algorithm, and the macro clusters are quickly acquired through the calculated related clusters, thereby realizing the coarse aggregation data set.

In the embodiment of the present invention, referring to fig. 2, the coarse clustering of the streaming data according to the vector tuple by using a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set includes:

S21, acquiring a preset first distance threshold and a preset second distance threshold;

s22, any vector tuple is selected as a target center point, and a target distance between the target center point and a preset target set is calculated;

s23, adding the target center point to the target set when the target distance is smaller than the first distance threshold, and deleting the target center point when the target distance is smaller than the second distance threshold;

s24, when the vector tuple exists, returning to any step of selecting the vector tuple as a target center point until the vector tuple does not exist;

s25, when the vector tuples do not exist, generating the coarse aggregation data set according to the target set by utilizing the real-time coarse clustering algorithm.

In detail, the first distance threshold and the second distance threshold may be preset according to the situation, or may be dynamically set according to the statistical characteristics of the data set.

In the embodiment of the present invention, the obtaining the preset first distance threshold and the preset second distance threshold includes:

wherein T is ₁ For the first distance threshold, max ^P For the maximum dimension value, min, of the P dimension in the vector tuple ^p S for the minimum dimension value, P, of the vector tuple ^P The dimension standard deviation for the P dimension in the vector tuple, n is the dimension, u ₁ Is a first distance weight coefficient;

T ₂ ＝u ₂ T ₁

In detail, when vector conversion is performed on the data tuples, the dimension of each data tuple may be inconsistent in the conversion process, so that the vector value corresponding to the maximum dimension in the vector tuple under different dimensions, namely the maximum dimension value, is counted according to the statistical characteristics of the vector tuples, and the vector value corresponding to the minimum dimension in the vector tuple under different dimensions, namely the minimum dimension value, is counted according to the maximum dimension value and the minimum dimension value under different dimensions, and the dimension standard deviation is calculated according to the maximum dimension value and the minimum dimension value under different dimensions. In addition, in the case of the optical fiber, Ensuring the accuracy of distance threshold calculation according to the self-defined first distance weight coefficient and the second distance weight coefficient, wherein 01。

Specifically, any selected vector tuple is used as a target center point, the target distance between the target center point and the center point of a preset target set is calculated, and if the target set does not exist currently, the target center point is used as a target set; if the distance between the target center point and the target set is within a first distance threshold, adding the target center point to the target set; if the distance between the target center point and the target set is within the second distance threshold, the target center point needs to be deleted, i.e. the target center point is considered to be close enough to the target set, so that the center of the target set cannot be made any more.

Further, when the vector tuple exists, returning to any step of selecting the vector tuple as a target center point, continuously calculating the distance between each vector tuple and the target set, and judging whether the target center point corresponding to the vector tuple is added or deleted until the vector tuple does not exist. And when the vector tuple does not exist, generating the coarse aggregation data set according to the target set by utilizing the real-time coarse clustering algorithm. I.e. the target center point added in the target set is taken as a coarse aggregation data set. The real-time coarse clustering algorithm is a rapid clustering algorithm specially aiming at mass data. I.e. first the sample data set is divided into a number of partially overlapping subsets using a distance measurement method with low algorithmic complexity when calculating the distance of the data samples. And finer clusters are generated by utilizing the cluster information in the target set and a distance measurement method with higher complexity, so that the traditional clustering method can adapt to the clustering performance requirement of high-dimensional mass data sources.

Further, after the coarse aggregate data set is determined, a final clustering result needs to be obtained according to the related information of the macro clusters generated in the coarse clusters, so that further analysis is performed on the coarse aggregate data set to improve the accuracy of data clustering.

And S3, carrying out subset division on the coarse aggregation data set to obtain a divided coarse aggregation data set, and calculating a first center position of each coarse aggregation subset in the divided coarse aggregation data set by using a preset center algorithm.

In the embodiment of the invention, if a data set containing n associated real-time streaming data objects exists in a cloud computing environment, the data set is divided into k subsets, and each subset represents a cluster, so that the coarse aggregation data set needs to be divided into subsets to prepare for fine aggregation division.

In the embodiment of the present invention, referring to fig. 3, the performing subset division on the coarse aggregate data set to obtain a divided coarse aggregate data set includes:

s31, dividing the coarse aggregation data set by utilizing a preset sliding window to obtain an initial divided data set;

s32, extracting an inflow time point and an outflow time point of each data tuple in the initial partition data set;

S33, calculating the average sliding time of the sliding window according to the inflow time point and the outflow time point, and updating the sliding window according to the average sliding time to obtain an updated sliding window;

and S34, carrying out division updating on the initial division data set according to the updating sliding window to obtain the division rough aggregation data set.

In detail, firstly setting the size of a sliding window, dividing a coarse aggregation data set according to the size of the sliding window to obtain an initial divided data set divided by a fixed sliding window, and determining the average sliding time of the sliding window according to the occupation ratio of the sum of the inflow time point and the outflow time point to the time of the sliding window according to the data inflow time point and the outflow time point marked in each data tuple in the initial divided data set.

Specifically, the fixed sliding window time is set as dynamic sliding window time, that is, the sliding time of each window is set as the average sliding time of each window corresponding to the sliding time of each window, and then the initial divided data set is divided again according to the average sliding time, so that a more accurate divided coarse aggregation data set is obtained.

Further, in order to perform more accurate data clustering on the coarse-aggregate data set, the coarse-aggregate data set is divided into a plurality of subsets, and then a clustering center of each subset needs to be determined so as to realize finer clustering.

In the embodiment of the invention, the central algorithm refers to randomly selecting k tuples from n data tuples as an initial clustering center; for the remaining other tuples, they are classified into corresponding clusters according to their similarity to the cluster centers, and the behavior of all the objects in the cluster is found, and the obtained average value is taken as the cluster center.

In the embodiment of the present invention, the calculating, by using a preset center algorithm, the first center position of each rough aggregation subset in the partitioned rough aggregation data set includes:

In detail, a plurality of different data tuples are selected from the coarse aggregation subset as initial clustering centers thereof, and similarities between the unselected data tuples and the initial clustering centers corresponding to the selected data tuples are calculated respectively, wherein the similarities can be calculated by using a preset similarity algorithm, and the similarity algorithm comprises, but is not limited to, a cosine distance algorithm, a Euclidean distance algorithm and the like.

Specifically, according to the similarity degree between each data tuple and the initial clustering center, different data tuples are divided into corresponding clustering sets, the clustering mean value of all tuples in each clustering set is calculated according to the similarity degree, and the center position of each rough aggregation subset is determined according to the clustering mean value. Wherein the associated features in each coarse aggregate subset are determined from the similarity and the associated features are determined from the variance of the similarity.

Further, the first central position of each coarse aggregation subset is used as an initial value in the fine clustering process, and data are clustered more finely according to the initial central position of the coarse aggregation subset, so that the accuracy of data clustering is ensured.

S4, calculating clustering distances between preset real-time streaming data and the first center position one by one, carrying out fine clustering on the real-time streaming data according to the clustering distances by using a preset clustering algorithm to obtain a fine-aggregate data set, and carrying out time identification on the fine-aggregate data set to obtain an identification fine-aggregate data set.

In the embodiment of the invention, the clustering distance refers to the distance between the calculated real-time streaming data and the first center position thereof.

In the embodiment of the present invention, the calculating the clustering distance between the preset real-time streaming data and the first center position one by one includes:

performing vector conversion on the preset real-time streaming data to obtain a real-time data vector;

unifying vector dimensions of the real-time data vectors to obtain unified real-time data vectors;

calculating the clustering distance between the unified real-time data vector and the first center position one by using the following distance formula:

wherein D is _uv For the clustering distance between the u-th unified real-time data vector and the v-th first central position, x _u Is the abscissa, y of the u-th unified real-time data vector _u The ordinate, x, of the u-th unified real-time data vector _v Is the abscissa of the v first center position, y _v And m is the number of the unified real-time data vectors, and k is the number of the first central positions.

In detail, the real-time streaming data is first subjected to vector conversion, so that the distance between the real-time streaming data and the first center position can be further calculated, and the data structure is unified. The data tuples can be subjected to vector conversion through a preset vector conversion model to obtain vector tuples, wherein the vector conversion model comprises but is not limited to a word2vec model and a Bert model.

Specifically, in order to make vector conversion on real-time streaming data, vector dimensions are not uniform, so that each real-time data vector needs to be unified into a two-dimensional vector, that is, zero is added to a one-dimensional vector to form a two-dimensional vector, and a value corresponding to an extra dimension is set to zero for a multi-dimensional vector to form a two-dimensional vector, and a unified real-time data vector is obtained. And calculating the clustering distance between the streaming data which arrives in real time and each first center position one by one through a distance formula.

Further, the clustering distance between each real-time streaming data and the first center position is calculated, and the real-time streaming data is sequentially divided into corresponding clustering clusters according to the principle of distance nearest to each other, so that fine clustering of the real-time streaming data is realized.

In the embodiment of the invention, the clustering algorithm uses the first center position in the coarse aggregation subset as the initial value of the clustering algorithm, so that the number of clusters and the initial center point do not need to be set in the fine aggregation process, the clustering step does not need to be iterated, and the clustering efficiency is improved. And dividing the real-time streaming data into a clustering set corresponding to the first center position for fine clustering according to the principle of the nearby clustering distance by using the clustering algorithm, and further obtaining a fine clustering data set.

Further, since there is some delay in the arrival time of the real-time streaming data, these delays are unavoidable, the identification must be performed according to the characteristics of the cluster for the identified time to offset the time delay.

In the embodiment of the present invention, the performing time identification on the fine aggregate data set to obtain an identified fine aggregate data set includes:

In detail, each fine aggregate data set is time-stamped, and although the data tuples in the streaming data arrive in time, the real-time streaming data has a certain delay in the streaming time, and clusters of the relevance data also have a certain time delay, so that the fine aggregate data set is identified for a certain time by taking the average time stamp and the standard deviation of the time stamp as the identification time.

Specifically, by the ratio of the correlation characteristic of the data tuple in each fine-aggregate data set to the total time of all fine-aggregate data sets, the average time stamp of each fine-aggregate data set can be calculated, and the standard deviation of the time stamp of the fine-aggregate data set can be calculated according to the average time stamp, wherein the correlation characteristic of the data tuple is determined according to the average value of the similarity among the data tuples in each fine-aggregate data set.

Further, after time identification is performed on each fine aggregate data set, the center position in each fine aggregate data set and the number of data tuples in the fine aggregate data set need to be analyzed so as to perform data clustering more accurately.

In the embodiment of the invention, the identification fine-aggregate data set comprises the time stamp standard deviation in each fine-aggregate data set, the arrival time of each real-time streaming data can be accurately determined according to the time stamp standard deviation, the data clustering efficiency is improved, and the center position of the data clustering is determined according to the identification fine-aggregate data set.

In an embodiment of the present invention, the determining, according to the identifying fine aggregate data set, the second center position of the fine aggregate data set includes:

and selecting the data tuple with the maximum local density as a second center position of the fine aggregate data set. In detail, the cluster center will have a higher density than its neighboring samples, will be located at the center of one dense area, and will be selected from other centers. Thus, the local density of each data tuple is calculated from the percentage of data tuples identifying the nearest field in the fine-aggregate dataset, and the data tuple with the largest local density is selected in the fine-aggregate dataset as the second central location of its fine-aggregate dataset.

Specifically, the number of data tuples in each fine-aggregate data set is determined according to the clustering capacity in each identification fine-aggregate data set, wherein the clustering capacity is obtained by taking the sum of mean square deviations as a measurement standard until the obtained clustering result is unchanged or the result is converged to a specified value, so that the clustering capacity is determined, and the number of data tuples in each fine-aggregate data set is determined according to the clustering capacity.

Further, the final result of the data clustering is determined according to the second center position and the number of the data tuples, so that more accurate data clustering is realized.

In the embodiment of the present invention, the clustering the big data of relevance according to the second center position and the number of the data tuples includes:

counting sequence number identifiers of the real-time streaming data;

In detail, the real-time streaming data is further and more accurately clustered, namely, the real-time distance between the real-time streaming data and the second center position is calculated, the sequence number identification of the real-time streaming data in the data stream is counted, and when the real-time distance is smaller than a preset distance threshold value and the sequence number identification is within the number of the data tuples, the real-time streaming data is clustered into a final clustering set.

For example, when the sequence number of the real-time streaming data is identified as 5, the real-time distance is 3, the number of the data tuples is 8, and the preset distance threshold is 8, the real-time distance is smaller than the distance threshold and the sequence number identification is smaller than the number of the data tuples, and the real-time streaming data is clustered into a target set corresponding to the second central position, so as to obtain a final clustered set. If the serial number identification of the real-time streaming data is larger than the number of the data tuples, clustering the serial number identification of the real-time streaming data into a target set corresponding to the next second center position.

Fig. 4 is a functional block diagram of a correlation big data clustering device based on stream computation according to an embodiment of the present invention.

The relevance big data clustering device 100 based on stream computing can be installed in electronic equipment. Depending on the implementation, the streaming computation-based relevance big data clustering device 100 may include a data tuple generating module 101, a streaming data coarse clustering module 102, a first central location computing module 103, a streaming data fine clustering module 104, and a data clustering module 105. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the data tuple generating module 101 is configured to obtain preset streaming data, and generate a data tuple of the streaming data according to a preset timestamp and a preset characterization data item;

the coarse clustering module 102 is configured to perform vector conversion on the data tuples to obtain vector tuples, and perform coarse clustering on the streaming data according to the vector tuples by using a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set;

The first central position calculating module 103 is configured to perform subset division on the coarse aggregation data set to obtain a divided coarse aggregation data set, and calculate a first central position of each coarse aggregation subset in the divided coarse aggregation data set by using a preset central algorithm;

the streaming data fine clustering module 104 is configured to calculate a clustering distance between preset real-time streaming data and the first center position one by one, perform fine clustering on the real-time streaming data according to the clustering distance by using a preset clustering algorithm to obtain a fine-aggregate data set, and perform time identification on the fine-aggregate data set to obtain an identified fine-aggregate data set;

the data clustering module 105 is configured to determine a second center position of the fine-aggregate data set according to the identified fine-aggregate data set, determine a number of data tuples in the fine-aggregate data set according to the identified fine-aggregate data set, and cluster the relevance big data according to the second center position and the number of data tuples.

In detail, each module in the stream-based relevance big data clustering device 100 in the embodiment of the present invention adopts the same technical means as the stream-based relevance big data clustering method described in fig. 1 to 3, and can generate the same technical effects, which are not described herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. Multiple units or means as set forth in the system embodiments may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A relevance big data clustering method based on stream computation, the method comprising:

2. The method for clustering associative big data based on stream computation according to claim 1, wherein generating the data tuple of the stream data according to the predetermined timestamp and the predetermined characterization data item comprises:

tuple(t)＝

3. The method for clustering relevance big data based on stream computation according to claim 1, wherein the coarse clustering the stream data according to the vector tuple by using a preset real-time coarse clustering algorithm to obtain a coarse aggregation data set comprises:

4. The method for clustering relevance big data based on stream computation as claimed in claim 3, wherein the step of obtaining a preset first distance threshold and a preset second distance threshold comprises:

wherein T is ₁ For the first distance threshold, max ^P For the maximum dimension value, min, of the P dimension in the vector tuple ^p S for the minimum dimension value, P, of the vector tuple ^P The dimension standard deviation for the P dimension in the vector tuple, n is the dimension, u ₁ For the first distance weight coefficient；

T ₂ ＝u ₂ T ₁

5. The method for clustering relevance big data based on stream computation according to claim 1, wherein the sub-dividing the coarse aggregate data set to obtain a divided coarse aggregate data set includes:

6. The method for clustering relevance big data based on streaming computation according to any one of claims 1 to 5, wherein the computing the first center position of each coarse subset in the partitioned coarse aggregate data using a preset center algorithm includes:

7. The method for clustering relevance big data based on stream computation according to claim 1, wherein the performing time identification on the fine-aggregate data set to obtain an identified fine-aggregate data set includes:

8. The streaming based associative big data clustering method according to claim 1, wherein the determining the second center position of the fine-aggregate dataset according to the identifying fine-aggregate dataset comprises:

9. The method for clustering big data of relevance based on stream computation according to claim 1, wherein the clustering big data of relevance according to the second center position and the number of data tuples comprises:

counting sequence number identifiers of the real-time streaming data;

10. A stream computation-based relevance big data clustering device, the device comprising: