CN111061559A - Distributed data mining and statistical method based on data deduplication - Google Patents

Distributed data mining and statistical method based on data deduplication Download PDF

Info

Publication number
CN111061559A
CN111061559A CN201911106504.0A CN201911106504A CN111061559A CN 111061559 A CN111061559 A CN 111061559A CN 201911106504 A CN201911106504 A CN 201911106504A CN 111061559 A CN111061559 A CN 111061559A
Authority
CN
China
Prior art keywords
data
mining
distributed
server
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911106504.0A
Other languages
Chinese (zh)
Inventor
邓金祥
王炜
代先勇
谷峰
曾海刚
佘朝裕
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Ansi Technology Co ltd
Original Assignee
Chengdu Ansi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Ansi Technology Co ltd filed Critical Chengdu Ansi Technology Co ltd
Priority to CN201911106504.0A priority Critical patent/CN111061559A/en
Publication of CN111061559A publication Critical patent/CN111061559A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers

Abstract

The invention discloses a distributed data mining and statistical method based on data deduplication, wherein a distributed server cluster traverses all data of a data packet, mining conditions are aggregated according to data configured by a user, each server node in the distributed server cluster respectively judges whether the data are duplicated, if not, the data are retained, and otherwise, redundant duplicated data are deleted. According to the invention, the original data is processed in a distributed manner, so that the quantity level of the data packets is greatly reduced, the difficulty of configuring complex mining conditions by a user is greatly reduced, the pertinence of data mining is strengthened, and the efficiency of data mining is greatly improved.

Description

Distributed data mining and statistical method based on data deduplication
Technical Field
The invention belongs to the technical field of network flow analysis, and particularly relates to a distributed data mining and statistical method based on data deduplication.
Background
With the development of computer technology and internet, the increase of broadband rate and the reduction of cost make the connection between life and work of people and network increasingly tight, and the number of network data packets increases in geometric level. For current network traffic data analysis, even experienced analysts can only analyze about 1700000 packets at most a day. The data packets generated by a family in one week can reach 1 hundred million level according to statistics, but are not easy to analyze. If the traffic of an end point of a company, school, government, etc. is collected, the data packet will be at an astronomical level. The data mining and statistical efficiency is seriously affected by large data magnitude, high data complexity, slow data processing and the like.
Disclosure of Invention
The invention aims to solve the problems and provides a distributed data mining and statistical method based on data deduplication, which comprises the following steps:
s1, the system acquires the network flow data packet;
s2, the central server splits and sends the data packet to the distributed server cluster after load balancing;
s3, configuring data aggregation mining conditions by a user, and matching and mining the data by each server node in the distributed server cluster according to the data aggregation mining conditions;
s4, the central server merges the matched and mined data;
s5, configuring data aggregation mining conditions by the user, traversing all data sessions by the system, and matching and mining the data by the system again;
and S6, outputting the aggregation mining data result.
The invention has the beneficial effects that: according to the invention, the original data is processed in a distributed manner, so that the quantity level of the data packets is greatly reduced, the analysis efficiency is improved, and the effect is more obvious when the data volume is larger; the data mining method has the advantages that the data are configured and mined under the aggregation condition for removing the duplicate data, and then the multitask intelligent scheduling and multi-channel concurrent data mining operation of the back-end server cluster are assisted, so that the difficulty of configuring the complex mining condition by a user is greatly reduced, the pertinence of data mining is strengthened, and the efficiency of data mining is greatly improved through intellectualization and multi-channel concurrent addition.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a system diagram of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
as shown in fig. 1, the distributed data mining and statistical method based on data deduplication of the present invention includes the following steps:
s1, the system acquires the network flow data packet;
s2, the central server splits and sends the data packet to the distributed server cluster after load balancing;
s3, the distributed server cluster traverses all data of the data packet, and according to the data aggregation mining conditions configured by the user, each server node in the distributed server cluster respectively judges whether the data are repeated, if not, the data are retained, otherwise, redundant repeated data are deleted;
s4, the central server merges the data after the duplication removal;
s5, the central server traverses the data after the duplication elimination, and judges whether the data are duplicated according to the aggregation mining condition of the data configured by the user, if not, the data are kept, otherwise, redundant data which are duplicated are deleted;
s6, the system outputs the aggregate mined data result.
Further, in the S2, the central server of the load balancing process allocates the task amount according to the hardware configuration performance and the current load of each server node in the distributed server cluster.
Further, the matching mining process in S3 is to match seven-tuple information in the data traffic packet according to the user configuration condition.
Further, the central server in S5 merges data into asynchronous operations, and performs a merging calculation as long as a result returned from one distributed node is received.
Further, the S5 specifically includes: and configuring a data aggregation mining condition by a user, traversing all data sessions by the system, judging whether the call back is the session encountered by the system for the first time, if so, keeping the session, and otherwise, discarding the session.
Further, the user-configured data mining condition is a session between a source address and a source port.
The system acquires an original network flow data packet, splits the data packet in proportion, wherein the splitting is matched according to the state of each server node in a server cluster, and the system mainly comprises two judgment dimensions: hardware configuration, current load. For example, there are a total of 100 ten thousand data sources and three servers A, B, C for analysis, where server A and server C are better configured, but A has a lower current load than C, and then A is assigned 50% of the number of data pieces, C is assigned 40% of the number of data pieces, and server B is configured with a slightly worse number of data pieces, which is only 20%.
And the node server in the cluster performs data matching on the split data, wherein the matching condition is that seven-element group data acts on the data flow packet, the session data meeting the matching condition is extracted, and only MAX data or MIN data in the session is reserved to obtain a plurality of groups of numerical values. For example, if the matching condition is the sum of the outgoing traffic of the sessions with the source IP of 1.1.1.1, all the sessions are first matched by the condition "source IP of 1.1.1.1" to obtain the sessions with all the source IPs of 1.1.1.1, and then only the outgoing traffic data in the sessions are extracted to obtain the final matching result.
The central server merges data, and once the result returned from one distributed node is received, the merging calculation is performed.
E.g., a total of one million sessions, the deduplication condition is the source address and the source port. Firstly, the system carries out load balancing on the 100 ten thousand sessions according to the configuration of the configuration file, and distributes the sessions to each server (A, B and C) after automatic splitting; secondly, each server automatically traverses all the distributed conversations, the duplicate removal is carried out by using the conditions of the source address and the source port, only the conversation of the source address and the source port which are encountered for the first time is reserved, and the subsequent conversation of the same source address and the source port is discarded; then, the duplicate removal results of the server B and the server C are summarized to the server A, the server A performs traversal again and reserves the conversation between the source address and the source port encountered for the first time, and the following encountered things are discarded; and finally, the server A only extracts the values of the source address and the source port from the obtained deduplication session to form a new list, obtains a final result, and displays the final result on the user terminal, so that the method has strong intuitiveness.
According to the invention, the original data is processed in a distributed manner, so that the quantity level of the data packets is greatly reduced, the analysis efficiency is improved, and the effect is more obvious when the data volume is larger; the data mining method has the advantages that the data are configured and mined under the aggregation condition for removing the duplicate data, and then the multitask intelligent scheduling and multi-channel concurrent data mining operation of the back-end server cluster are assisted, so that the difficulty of configuring the complex mining condition by a user is greatly reduced, the pertinence of data mining is strengthened, and the efficiency of data mining is greatly improved through intellectualization and multi-channel concurrent addition.
The technical solution of the present invention is not limited to the limitations of the above specific embodiments, and all technical modifications made according to the technical solution of the present invention fall within the protection scope of the present invention.

Claims (6)

1. A distributed data mining and statistical method based on data deduplication is characterized by comprising the following steps:
s1, the system acquires the network flow data packet;
s2, the central server splits and sends the data packet to the distributed server cluster after load balancing;
s3, the distributed server cluster traverses all data of the data packet, and according to the data aggregation mining conditions configured by the user, each server node in the distributed server cluster respectively judges whether the data are repeated, if not, the data are retained, otherwise, redundant repeated data are deleted;
s4, the central server merges the data after the duplication removal;
s5, the central server traverses the data after the duplication elimination, and judges whether the data are duplicated according to the aggregation mining condition of the data configured by the user, if not, the data are kept, otherwise, redundant data which are duplicated are deleted;
s6, the system outputs the aggregate mined data result.
2. The method as claimed in claim 1, wherein the load balancing process center server in S2 allocates task load according to hardware configuration performance and current load of each server node in the distributed server cluster.
3. The distributed data mining and statistics method based on data deduplication as claimed in claim 1, wherein the matching mining process in S3 is matching seven-tuple information in the data traffic packet according to user configuration conditions.
4. The method as claimed in claim 1, wherein the central server in S5 merges data into asynchronous operation, and performs a merging calculation only when a result returned from a distributed node is received.
5. The method for distributed data mining and statistics based on data deduplication as claimed in claim 1, wherein the S5 specifically includes: and configuring a data aggregation mining condition by a user, traversing all data sessions by the system, judging whether the call back is the session encountered by the system for the first time, if so, keeping the session, and otherwise, discarding the session.
6. The distributed data mining and statistics method based on data deduplication of claim 1, wherein the user-configured data mining condition in S5 is a session between a source address and a source port.
CN201911106504.0A 2019-11-13 2019-11-13 Distributed data mining and statistical method based on data deduplication Pending CN111061559A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911106504.0A CN111061559A (en) 2019-11-13 2019-11-13 Distributed data mining and statistical method based on data deduplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911106504.0A CN111061559A (en) 2019-11-13 2019-11-13 Distributed data mining and statistical method based on data deduplication

Publications (1)

Publication Number Publication Date
CN111061559A true CN111061559A (en) 2020-04-24

Family

ID=70298551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911106504.0A Pending CN111061559A (en) 2019-11-13 2019-11-13 Distributed data mining and statistical method based on data deduplication

Country Status (1)

Country Link
CN (1) CN111061559A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5434994A (en) * 1994-05-23 1995-07-18 International Business Machines Corporation System and method for maintaining replicated data coherency in a data processing system
US20020091734A1 (en) * 2000-11-13 2002-07-11 Digital Door, Inc. Data security system and method
CN101411122A (en) * 2006-01-10 2009-04-15 方毅 System and method for P2P network data flow direction and flow measurement, and commerce mode based on the technology
CN105912572A (en) * 2016-03-30 2016-08-31 深圳市金立通信设备有限公司 Data management method and terminal
CN106603317A (en) * 2017-02-20 2017-04-26 山东浪潮商用系统有限公司 Alarm monitoring strategy analysis method based on data mining technology
CN106599230A (en) * 2016-12-19 2017-04-26 北京天元创新科技有限公司 Method and system for evaluating distributed data mining model
CN107025288A (en) * 2017-04-14 2017-08-08 四川九鼎瑞信软件开发有限公司 Distributed data digging method and system
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN108121810A (en) * 2017-12-26 2018-06-05 北京锐安科技有限公司 A kind of data duplicate removal method, system, central server and distributed server
CN109165363A (en) * 2018-08-27 2019-01-08 成都深思科技有限公司 A kind of configuration method of network data snapshot
CN110188093A (en) * 2019-05-21 2019-08-30 江苏锐天信息科技有限公司 A kind of data digging system being directed to AIS information source based on big data platform

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5434994A (en) * 1994-05-23 1995-07-18 International Business Machines Corporation System and method for maintaining replicated data coherency in a data processing system
US20020091734A1 (en) * 2000-11-13 2002-07-11 Digital Door, Inc. Data security system and method
CN101411122A (en) * 2006-01-10 2009-04-15 方毅 System and method for P2P network data flow direction and flow measurement, and commerce mode based on the technology
CN105912572A (en) * 2016-03-30 2016-08-31 深圳市金立通信设备有限公司 Data management method and terminal
CN108121706A (en) * 2016-11-28 2018-06-05 央视国际网络无锡有限公司 A kind of optimization method of distributed reptile
CN106599230A (en) * 2016-12-19 2017-04-26 北京天元创新科技有限公司 Method and system for evaluating distributed data mining model
CN106603317A (en) * 2017-02-20 2017-04-26 山东浪潮商用系统有限公司 Alarm monitoring strategy analysis method based on data mining technology
CN107025288A (en) * 2017-04-14 2017-08-08 四川九鼎瑞信软件开发有限公司 Distributed data digging method and system
CN108121810A (en) * 2017-12-26 2018-06-05 北京锐安科技有限公司 A kind of data duplicate removal method, system, central server and distributed server
CN109165363A (en) * 2018-08-27 2019-01-08 成都深思科技有限公司 A kind of configuration method of network data snapshot
CN110188093A (en) * 2019-05-21 2019-08-30 江苏锐天信息科技有限公司 A kind of data digging system being directed to AIS information source based on big data platform

Similar Documents

Publication Publication Date Title
CN111309776A (en) Distributed network flow aggregation dimension reduction statistical method based on data sorting
US7509408B2 (en) System analysis apparatus and method
CN106789242B (en) Intelligent identification application analysis method based on mobile phone client software dynamic feature library
US8463928B2 (en) Efficient multiple filter packet statistics generation
CN109710731A (en) A kind of multidirectional processing system of data flow based on Flink
CN103618733B (en) A kind of data filtering system and method for being applied to mobile Internet
CN104361031B (en) A kind of government data pre-processing system and processing method
CN102104544A (en) Order preserving method for fragmented message flow in IP (Internet Protocol) tunnel of multi-nuclear processor with accelerated hardware
CN107391606A (en) Log processing method and device based on Storm
CN103024819A (en) Data distribution method of third-generation mobile communication core network based on user terminal IP (Internet Protocol)
Kadianakis et al. Extrapolating network totals from hidden-service statistics
CN108280018A (en) A kind of node workflow communication overhead efficiency analysis optimization method and system
CN105610992A (en) Task allocation load balancing method of distributive flow computing system
CN114697391A (en) Data processing method, device, equipment and storage medium
CN104954257A (en) Message forwarding system and method
CN108134746B (en) Method and device for processing rail transit data
CN105007200B (en) The analysis method and system of network packet
Cai et al. Flow identification and characteristics mining from internet traffic with hadoop
CN104378419A (en) High-speed data push method and system
CN111061559A (en) Distributed data mining and statistical method based on data deduplication
CN111211939A (en) Device and method for realizing efficient flow table counting based on network processor
CN103227730A (en) Method and system for analyzing large log
CN109857563A (en) Task executing method, device and task execution system
CN111221877A (en) Multi-channel concurrent data packet mining and statistical method
CN110795600A (en) Aggregation dimension reduction statistical method for distributed network flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination