CN111061559A

CN111061559A - Distributed data mining and statistical method based on data deduplication

Info

Publication number: CN111061559A
Application number: CN201911106504.0A
Authority: CN
Inventors: 邓金祥; 王炜; 代先勇; 谷峰; 曾海刚; 佘朝裕; 刘洋
Original assignee: Chengdu Ansi Technology Co ltd
Current assignee: Chengdu Ansi Technology Co ltd
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-04-24

Abstract

The invention discloses a distributed data mining and statistical method based on data deduplication, wherein a distributed server cluster traverses all data of a data packet, mining conditions are aggregated according to data configured by a user, each server node in the distributed server cluster respectively judges whether the data are duplicated, if not, the data are retained, and otherwise, redundant duplicated data are deleted. According to the invention, the original data is processed in a distributed manner, so that the quantity level of the data packets is greatly reduced, the difficulty of configuring complex mining conditions by a user is greatly reduced, the pertinence of data mining is strengthened, and the efficiency of data mining is greatly improved.

Description

Distributed data mining and statistical method based on data deduplication

Technical Field

The invention belongs to the technical field of network flow analysis, and particularly relates to a distributed data mining and statistical method based on data deduplication.

Background

With the development of computer technology and internet, the increase of broadband rate and the reduction of cost make the connection between life and work of people and network increasingly tight, and the number of network data packets increases in geometric level. For current network traffic data analysis, even experienced analysts can only analyze about 1700000 packets at most a day. The data packets generated by a family in one week can reach 1 hundred million level according to statistics, but are not easy to analyze. If the traffic of an end point of a company, school, government, etc. is collected, the data packet will be at an astronomical level. The data mining and statistical efficiency is seriously affected by large data magnitude, high data complexity, slow data processing and the like.

Disclosure of Invention

The invention aims to solve the problems and provides a distributed data mining and statistical method based on data deduplication, which comprises the following steps:

s1, the system acquires the network flow data packet;

s2, the central server splits and sends the data packet to the distributed server cluster after load balancing;

s3, configuring data aggregation mining conditions by a user, and matching and mining the data by each server node in the distributed server cluster according to the data aggregation mining conditions;

s4, the central server merges the matched and mined data;

s5, configuring data aggregation mining conditions by the user, traversing all data sessions by the system, and matching and mining the data by the system again;

and S6, outputting the aggregation mining data result.

The invention has the beneficial effects that: according to the invention, the original data is processed in a distributed manner, so that the quantity level of the data packets is greatly reduced, the analysis efficiency is improved, and the effect is more obvious when the data volume is larger; the data mining method has the advantages that the data are configured and mined under the aggregation condition for removing the duplicate data, and then the multitask intelligent scheduling and multi-channel concurrent data mining operation of the back-end server cluster are assisted, so that the difficulty of configuring the complex mining condition by a user is greatly reduced, the pertinence of data mining is strengthened, and the efficiency of data mining is greatly improved through intellectualization and multi-channel concurrent addition.

Drawings

FIG. 1 is a flow chart of the present invention;

fig. 2 is a system diagram of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

as shown in fig. 1, the distributed data mining and statistical method based on data deduplication of the present invention includes the following steps:

s1, the system acquires the network flow data packet;

s3, the distributed server cluster traverses all data of the data packet, and according to the data aggregation mining conditions configured by the user, each server node in the distributed server cluster respectively judges whether the data are repeated, if not, the data are retained, otherwise, redundant repeated data are deleted;

s4, the central server merges the data after the duplication removal;

s5, the central server traverses the data after the duplication elimination, and judges whether the data are duplicated according to the aggregation mining condition of the data configured by the user, if not, the data are kept, otherwise, redundant data which are duplicated are deleted;

s6, the system outputs the aggregate mined data result.

Further, in the S2, the central server of the load balancing process allocates the task amount according to the hardware configuration performance and the current load of each server node in the distributed server cluster.

Further, the matching mining process in S3 is to match seven-tuple information in the data traffic packet according to the user configuration condition.

Further, the central server in S5 merges data into asynchronous operations, and performs a merging calculation as long as a result returned from one distributed node is received.

Further, the S5 specifically includes: and configuring a data aggregation mining condition by a user, traversing all data sessions by the system, judging whether the call back is the session encountered by the system for the first time, if so, keeping the session, and otherwise, discarding the session.

Further, the user-configured data mining condition is a session between a source address and a source port.

The system acquires an original network flow data packet, splits the data packet in proportion, wherein the splitting is matched according to the state of each server node in a server cluster, and the system mainly comprises two judgment dimensions: hardware configuration, current load. For example, there are a total of 100 ten thousand data sources and three servers A, B, C for analysis, where server A and server C are better configured, but A has a lower current load than C, and then A is assigned 50% of the number of data pieces, C is assigned 40% of the number of data pieces, and server B is configured with a slightly worse number of data pieces, which is only 20%.

And the node server in the cluster performs data matching on the split data, wherein the matching condition is that seven-element group data acts on the data flow packet, the session data meeting the matching condition is extracted, and only MAX data or MIN data in the session is reserved to obtain a plurality of groups of numerical values. For example, if the matching condition is the sum of the outgoing traffic of the sessions with the source IP of 1.1.1.1, all the sessions are first matched by the condition "source IP of 1.1.1.1" to obtain the sessions with all the source IPs of 1.1.1.1, and then only the outgoing traffic data in the sessions are extracted to obtain the final matching result.

The central server merges data, and once the result returned from one distributed node is received, the merging calculation is performed.

E.g., a total of one million sessions, the deduplication condition is the source address and the source port. Firstly, the system carries out load balancing on the 100 ten thousand sessions according to the configuration of the configuration file, and distributes the sessions to each server (A, B and C) after automatic splitting; secondly, each server automatically traverses all the distributed conversations, the duplicate removal is carried out by using the conditions of the source address and the source port, only the conversation of the source address and the source port which are encountered for the first time is reserved, and the subsequent conversation of the same source address and the source port is discarded; then, the duplicate removal results of the server B and the server C are summarized to the server A, the server A performs traversal again and reserves the conversation between the source address and the source port encountered for the first time, and the following encountered things are discarded; and finally, the server A only extracts the values of the source address and the source port from the obtained deduplication session to form a new list, obtains a final result, and displays the final result on the user terminal, so that the method has strong intuitiveness.

According to the invention, the original data is processed in a distributed manner, so that the quantity level of the data packets is greatly reduced, the analysis efficiency is improved, and the effect is more obvious when the data volume is larger; the data mining method has the advantages that the data are configured and mined under the aggregation condition for removing the duplicate data, and then the multitask intelligent scheduling and multi-channel concurrent data mining operation of the back-end server cluster are assisted, so that the difficulty of configuring the complex mining condition by a user is greatly reduced, the pertinence of data mining is strengthened, and the efficiency of data mining is greatly improved through intellectualization and multi-channel concurrent addition.

The technical solution of the present invention is not limited to the limitations of the above specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the protection scope of the present invention.

Claims

1. A distributed data mining and statistical method based on data deduplication is characterized by comprising the following steps:

s1, the system acquires the network flow data packet;

s4, the central server merges the data after the duplication removal;

s6, the system outputs the aggregate mined data result.

2. The method as claimed in claim 1, wherein the load balancing process center server in S2 allocates task load according to hardware configuration performance and current load of each server node in the distributed server cluster.

3. The distributed data mining and statistics method based on data deduplication as claimed in claim 1, wherein the matching mining process in S3 is matching seven-tuple information in the data traffic packet according to user configuration conditions.

4. The method as claimed in claim 1, wherein the central server in S5 merges data into asynchronous operation, and performs a merging calculation only when a result returned from a distributed node is received.

5. The method for distributed data mining and statistics based on data deduplication as claimed in claim 1, wherein the S5 specifically includes: and configuring a data aggregation mining condition by a user, traversing all data sessions by the system, judging whether the call back is the session encountered by the system for the first time, if so, keeping the session, and otherwise, discarding the session.

6. The distributed data mining and statistics method based on data deduplication of claim 1, wherein the user-configured data mining condition in S5 is a session between a source address and a source port.