CN102664789A - Method and system for processing large-scale data - Google Patents

Method and system for processing large-scale data Download PDF

Info

Publication number
CN102664789A
CN102664789A CN2012101024112A CN201210102411A CN102664789A CN 102664789 A CN102664789 A CN 102664789A CN 2012101024112 A CN2012101024112 A CN 2012101024112A CN 201210102411 A CN201210102411 A CN 201210102411A CN 102664789 A CN102664789 A CN 102664789A
Authority
CN
China
Prior art keywords
flow
mirror image
cluster
sub
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101024112A
Other languages
Chinese (zh)
Other versions
CN102664789B (en
Inventor
贺艳军
李婷婷
周宇
石婧岚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210102411.2A priority Critical patent/CN102664789B/en
Publication of CN102664789A publication Critical patent/CN102664789A/en
Application granted granted Critical
Publication of CN102664789B publication Critical patent/CN102664789B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a method and a system for processing large-scale data, wherein the system comprises a flow collection sub-system and a flow processing sub-system; the flow collection sub-system is used for collecting data flow, carrying out mirror image on the collected data flow, shunting the obtained mirror image flow into P paths of sub-flow to transmit the sub-flow to a flow storage cluster in the flow processing sub-system, wherein P is an integer greater than 1; the flow storage cluster is composed of M storage servers; N discs are suspended under each storage server; M is a positive integer; N is an integer greater than 1; M*N is greater than or equal to P, and each storage server receives the shunted sub-flow and writes the shunted sub-flow into the suspended N discs by adopting a load balance technology. The lasting writing pressure of the discs is reduced by this mode and the problem of storage of the large-scale data is better solved.

Description

A kind of processing method of large-scale data and system
[technical field]
The present invention relates to computer networking technology, particularly a kind of processing method of large-scale data and system.
[background technology]
Along with the network user's continuous expansion, the last data volume of Internet becomes explosive increase, and people have had new understanding to the transmission speed of network, the safety and the reliability of data.User's data is distributed in a lot of places widely; Concerning the user, do not made business operation exist implicit danger by the storage of perfect management and backup; The speed and the quality influence of transfer of data user experience; Along with the rise gradually and the popularization of cloud service, processing demands such as the storage of large-scale data, statistics or analysis become problem demanding prompt solution in addition.Yet; Existing data processing system and method are subject to the processing demands that Effect on Performance can't satisfy large-scale data; If for example available data treatment system and method directly apply to the storage of large-scale data, then can bring unaffordable reading and writing data pressure.
[summary of the invention]
The invention provides a kind of processing method and system of large-scale data, so that satisfy the processing demands of large-scale data.
Concrete technical scheme is following:
A kind of treatment system of large-scale data, this system comprises: flow collection subsystem and flow processing subsystem;
Said flow collection subsystem is used for the image data flow, and the data traffic that collects is carried out mirror image, and the mirror image flow that obtains is split into P way flow is sent to the flow storage cluster in the said flow processing subsystem, and P is the integer greater than 1;
Said flow storage cluster is made up of M platform storage server, and every storage server is hung N disk down, and said M is a positive integer, and N is the integer greater than 1, and M * N >=P; Every storage server receives the sub-flow that is diverted to, and the sub-flow that adopts load-balancing technique to be diverted to writes down N the disk of hanging.
According to one preferred embodiment of the present invention, said flow collection subsystem comprises:
Be used to gather the data traffic of outer net core switch outlet and the data traffic that collects is carried out the flow collection unit of mirror image, and
Be used to adopt load-balancing technique the mirror image flow to be split into the branch stream processing unit of each sub-flow.
According to one preferred embodiment of the present invention, said flow collection unit is made up of optical splitter and image intensifer;
Said optical splitter carries out light-splitting processing to the data traffic of outer net core switch outlet, and the data traffic of said image intensifer after to light-splitting processing carried out light amplification and formed the mirror image flow.
According to one preferred embodiment of the present invention, stream processing unit was the shunting switch in said minute, adopted the mode of trunk to adopt load-balancing technique to be split into P way flow the mirror image flow.
According to one preferred embodiment of the present invention; The a plurality of processes of operation on every storage server; Part disk in the corresponding said N disk of each process difference, the responsible respectively said parton flow that receives a part of sub-flow and will receive of each process is that unit writes corresponding disk successively in turn by preset time span.
According to one preferred embodiment of the present invention, said flow processing subsystem also comprises the real-time analysis cluster;
Said flow collection subsystem carries out mirror image with the data traffic that collects and obtains two-way mirror image flow, and wherein one road mirror image flow is used to carry out said shunting processing, and another road mirror image flow is sent to said real-time analysis cluster;
Said real-time analysis cluster is used for the mirror image flow that receives is carried out the statistics of flow information, and utilizes statistics to generate Study document.
According to one preferred embodiment of the present invention, said real-time analysis cluster comprises: real-time receiver module and the tabulate statistics module be made up of server cluster;
Several servers in the said real-time receiver module receive said mirror image flow, and the flow information of adding up is write journal file;
Said tabulate statistics module is downloaded the journal file that said several servers generated; The flow information that gathers in each journal file obtains and exports Study document, and the Cycle Length of wherein said download writes the flow information of statistics greater than said real-time receiver module the Cycle Length of journal file.
According to one preferred embodiment of the present invention; Said flow processing subsystem also comprises non real-time analysis cluster; Be used to gather and analyze after said flow is stored the sub-flow of cluster storage, said analysis comprises: the excavation of attack or the extraction of demand data.
A kind of processing method of large-scale data; This method is applied to comprise the large-scale data treatment system of flow collection subsystem and flow processing subsystem; Flow storage cluster in the said flow processing subsystem is made up of M platform storage server; Every storage server is hung N disk down, and said method comprises:
Said flow collection subsystem image data flow carries out mirror image with the data traffic that collects, and the mirror image flow that obtains is split into P way flow is sent to said flow storage cluster, and P is the integer greater than 1;
Every storage server receives the sub-flow that is diverted to, and the sub-flow that adopts load-balancing technique to be diverted to writes down N the disk of hanging; Wherein said M is a positive integer, and N is the integer greater than 1, and M * N >=P.
According to one preferred embodiment of the present invention, said image data flow is specially: the data traffic of gathering the outer net core switch.
According to one preferred embodiment of the present invention, saidly the data traffic that collects carried out mirror image be specially:
Adopt optical splitter that the data traffic of gathering is carried out light-splitting processing, the data traffic after adopting image intensifer to light-splitting processing is carried out light amplification and is formed the mirror image flow.
According to one preferred embodiment of the present invention, saidly the mirror image flow that obtains split into P way flow be specially:
Adopt the trunk mode of shunting switch to adopt load-balancing technique to be split into P way flow the mirror image flow.
According to one preferred embodiment of the present invention; The sub-flow that said employing load-balancing technique will be diverted to writes down N the disk of hanging and is specially: a plurality of processes of operation on every storage server; Part disk in the corresponding said N disk of each process difference, the responsible respectively said parton flow that receives a part of sub-flow and will receive of each process is that unit writes corresponding disk successively in turn by preset time span.
According to one preferred embodiment of the present invention; Said flow collection subsystem is when carrying out mirror image with the data traffic that collects; Obtain two-way mirror image flow; Wherein one road mirror image flow is used to carry out said shunting processing, and another road mirror image flow is sent to the real-time analysis cluster of said flow processing subsystem;
Said real-time analysis cluster carries out the statistics of flow information to the mirror image flow that receives, and utilizes statistics to generate Study document.
According to one preferred embodiment of the present invention, said the mirror image flow that receives is carried out the statistics of flow information, and utilizes statistics to generate Study document to be specially:
Several servers in the said real-time analysis cluster receive said mirror image flow, and the flow information of adding up is write journal file;
Tabulate statistics module in the said real-time analysis cluster is downloaded the journal file that said several servers generated; The flow information that gathers in each journal file obtains and exports Study document, and the Cycle Length of wherein said download writes the Cycle Length of journal file greater than said flow information with statistics.
According to one preferred embodiment of the present invention, this method also comprises:
Non real-time analysis cluster gathers to be analyzed after said flow is stored the sub-flow of cluster storage, and said analysis comprises: the excavation of attack or the extraction of demand data.
Can find out by above technical scheme; In the system and method provided by the invention, after the flow collection subsystem at first carries out mirror image with the data traffic that collects, the mirror image flow that obtains is split into the flow storage cluster that the sub-flow of multichannel is sent to the flow processing subsystem; Flow storage cluster is made up of some storage servers; Every storage server writes down a plurality of disks of hanging with the sub-flow that the shunt volume employing load-balancing technique that receives will be diverted to, and has reduced the pressure that disk continues to write in this way, has solved the problem of mass data storage preferably; Improve the disk utilance simultaneously, effectively practiced thrift the server cost.
[description of drawings]
The treatment system sketch map of the large-scale data that Fig. 1 provides for the embodiment of the invention;
The system example figure that Fig. 2 provides for the embodiment of the invention;
The process flow figure of the large-scale data that Fig. 3 provides for the embodiment of the invention.
[embodiment]
In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.
At first the treatment system of large-scale data provided by the present invention is described, as shown in Figure 1, this system can comprise: flow collection subsystem 100 and flow processing subsystem 200.
Flow collection subsystem 100 is used for the image data flow, and the data traffic that collects is mirrored to the server cluster in the flow processing subsystem 200.
Specifically can comprise: be used for the image data flow and the data traffic that collects is carried out the flow collection unit 110 of mirror image, and further can comprise: be used to adopt load-balancing technique the mirror image flow to be split into the branch stream processing unit 120 of each sub-flow.
Wherein, Flow collection unit 110 is when the image data flow; Can lay the outlet of collection point at the outer net core switch; Such laying mode can nondestructively be gathered whole datas on flows on the one hand, just can accomplish re-set target with less collection point on the other hand, can practice thrift cost and reduce the engineering difficulty for engineering construction.In addition, the flow collection unit 110 image data flows and the mode of carrying out mirror image can comprise following two kinds:
One of which, Port Mirroring mode: through with the data image of a port of outer net core switch or a plurality of ports mode to another or a plurality of ports, realize the collection of data traffic, this mode is a prior art, is not described in detail in this.
Two, spectroscope is as mode: at first through optical splitter the outlet data of outer net core switch is carried out light-splitting processing; Because signal strength signal intensity has decay after the beam split; Therefore can further carry out light amplification to the flow after the light-splitting processing; Thereby the signal strength signal intensity of the flow after the assurance beam split is sufficient, guarantees that the complete sum of data is reliable.Spectroscope is that stability and reliability are all higher as the compare advantage of Port Mirroring mode of mode; The Port Mirroring mode can exert an influence to core switch itself; For online service; The fault of core switch to the service influence be fatal, therefore, spectroscope as mode as a kind of preferred data traffic acquisition mode.
The real-time analysis cluster that the road flow that obtains behind the mirror image can send in the flow processing subsystem 200 is used for flow is carried out real-time analysis, and another road flow can be sent to branch stream processing unit 120 and further handle.Divide stream processing unit 120 can adopt the shunting switch to realize.When shunting, can adopt the mode of trunk; The shunting switch adopts load-balancing technique to be split into the sub-flow of multichannel the mirror image flow that receives and sends to server cluster in the flow processing subsystem 200 so that the sub-flow of this multichannel is carried out identical processing, mainly be that the sub-flow of multichannel is carried out stores processor respectively here.Data traffic to handle 10G is an example; One 10,000,000,000 port of switch inserts the data traffic of this 10G as inbound port; Outbound port exists 8 gigabit mouths as a trunk simultaneously; Such 8 gigabit mouths can make scheduling (round-robin) mode that the flow of inbound port is evenly distributed on 8 gigabit mouths with wheel, realize the load balancing first time to the high speed flow.
Fig. 2 is the sketch map of one of them execution mode of flow collection subsystem 100; Being light-dividing device carries out light-splitting processing with the rate of discharge of outer net core switch; The flow of image intensifer after to light-splitting processing carries out light amplification, realizes the shunting of flow again via the shunting switch.The road flow that obtains after the beam split can send to the real-time analysis cluster in the flow processing subsystem 200; Each way flow that another road flow that obtains after the beam split obtains after the shunting switch processes can send to the flow storage cluster in the flow processing subsystem 200, is used for follow-up non real-time analysis.
Below real-time analysis cluster in the flow processing subsystem 200 210 and flow storage cluster 220 are described in detail.
210 pairs of flows that receive of real-time analysis cluster carry out the statistics of flow information, and utilize statistics to generate Study document.Particularly, this real-time analysis cluster 210 can specifically comprise real-time receiver module and tabulate statistics module (also not shown among Fig. 1).
Wherein receiver module can be made up of server cluster in real time, and the identical bag of each server operation in this server cluster is caught and statistics program, and statistics is write daily record (log) file.With 10,000,000,000 servers is example; Every 10,000,000,000 server is supported 2 ten thousand Broadcoms; Can handle the data traffic of 20G simultaneously, the bag prize procedure can be accomplished the efficient packet receiving from ten thousand Broadcoms, and statistics program is the statistics respectively that unit carries out flow information with purpose ip; The content of statistics can include but not limited to: tcp flow value, udp flow value, icmp flow value etc., and unit is generally bps; Tcp packet rate, udp packet rate, icmp packet rate etc., unit is generally pps; The access times of non-serve port per second; The get request number of http per second, get length of data package; The information such as packet number that the main conditional code per second of http is responded.Can statistics be write the log file with binary format then.
The tabulate statistics module is downloaded the log file of the server cluster generation of real-time receiver module, and the Cycle Length of wherein downloading writes the flow information of statistics greater than real-time receiver module the Cycle Length of log file usually.Then the flow information in each log file is gathered and obtain Study document, export this Study document.For example, can the flow information of identical purpose ip in each log file be gathered.
The cluster that flow storage cluster 220 is made up of M platform storage server, M is a positive integer, the major function of completion is the flow that receives to be write disk with the high-efficiency reliable mode preserve.Owing to receive the packet of magnanimity, and the flow speed of dozens or even hundreds of G/s normally of application processes on the actual line, need large-scale flow be stored in disk at a slow speed with less cost price.Each storage server is hung N disk down in the present invention, and N is the positive integer greater than 1, and M * N >=P, and P is the sub-flow quantity that obtains after 200 shuntings of flow processing subsystem.Storage server receives the flow that is diverted to, and adopts load-balancing technique to write each disk the flow that receives, and particularly, can be that unit writes each disk successively in turn by preset time span.Wherein can move a plurality of processes on each storage server; Each process is counterpart disk respectively, and each process is responsible for the reception of a sub-flow wherein respectively and is that unit writes corresponding disk successively in turn with the sub-flow of this by preset time span.
Give one example, suppose that the flow storage subsystem comprises two storage servers, every storage server carries 4 mouthfuls of PCI-Express, 8 disks of carry, every 1T.Obtain 8 way flows after dividing stream processing unit 120 shuntings, move 4 independent processes on every storage server simultaneously, receive flow from 4 PCI-Express respectively, promptly be responsible for receiving wherein 4 way flows, corresponding 2 disks of each process.Adopted the load balancing strategy again in the process that each process is write flow toward disk; I.e. load balancing for the second time can be minute to be that unit writes 2 disks successively in turn, and first minute flow writes first disk; Second minute flow writes second disk; The 3rd minute flow writes first disk, and the 4th minute flow writes second disk, by that analogy.This load balancing strategy has made full use of the independence of each process and disk, has reduced the pressure that disk continues to write, and has solved the problem of mass data storage preferably, has improved the disk utilance simultaneously, has practiced thrift the server cost effectively.
In addition; Flow processing subsystem 200 can further include non real-time and analyzes cluster 230; Be used for analyzing after flow to flow saveset crowd 220 storage gathers, include but not limited to: the excavation of attack or the extraction of demand data etc.
When carrying out the excavation of attack, can extract the flow of attacking the period, carry out the attack analysis based on the characteristic of extraction flow.For example, for common network attack, comprise that mainly the synflood of the attack of network level bandwidth type, tcp layer and ack flood attack, the Distributed Request of application layer is attacked.Diverse network attack meeting brings influence to the stable operation of product, and we can come the deep analysis attack signature based on the flow that with stored history is 220 storages of flow storage cluster, for the defence of product line and the evidence obtaining of attack provide service.Attack for the bandwidth type of network level, common have udp flood and an icmp flood, and we add up then and should various types of flow sizes of period judge attack type and attack scale through extracting the flow of attacking the period.Attack for tcp layer protocol stack resource exhaustion type, through extracting the flow of attacking the period, the various types of packet rates of statistical time range tcp flag bit are judged attack type and attack scale.Distributed Request for application layer is attacked; Through extracting the packet of attacking the period, each field of adding up this period http request header comprises fields such as host, url, cookie, User-Agent or referer; Judge attack type; And further judge product line and the related pages of being attacked, and conclude the request characteristic of summing up the http head simultaneously, for closing strategy distinguishing mark is provided.
The current operation aspect comprises that to the demand of past Visitor Logs the line of tracking problem and product descends test, and the extraction of demand data is just in order to satisfy this demand.Concrete implementation method is based on the flow of flow storage cluster 220 storages; Non real-time is analyzed the purpose ip of cluster 230 according to product line; From the flow of flow storage cluster 220 storages, extract the packet of corresponding purpose ip and getting off, be used for follow-up this packet being offered business demand side such as packet capturing (pcap) stored in file format.
The processing method of the large-scale data of realizing based on above-mentioned treatment system can be as shown in Figure 3, mainly may further comprise the steps:
Step 301: flow collection subsystem image data flow, carry out mirror image with the data traffic that collects, with wherein one road mirror image flow execution in step 302 that obtains; Another road mirror image flow that obtains is sent to the real-time analysis cluster in the flow processing subsystem, execution in step 305.
When the image data flow, the collection point can be laid in the outlet of outer net core switch, promptly gathers the data traffic of outer net core switch.
The said mode that the data traffic that collects is carried out mirror image can specifically comprise following two kinds:
One of which, Port Mirroring mode: through with the data image of a port of outer net core switch or a plurality of ports mode to another or a plurality of ports, realize the collection of data traffic, this mode is a prior art, is not described in detail in this.
Two, spectroscope is as mode: at first through optical splitter the outlet data of outer net core switch is carried out light-splitting processing; Because signal strength signal intensity has decay after the beam split; Therefore can further carry out light amplification to the flow after the light-splitting processing; Thereby the signal strength signal intensity of the flow after the assurance beam split is sufficient, guarantees that the complete sum of data is reliable.Spectroscope is that stability and reliability are all higher as the compare advantage of Port Mirroring mode of mode; The Port Mirroring mode can exert an influence to core switch itself; For online service; The fault of core switch to the service influence be fatal, therefore, spectroscope as mode as a kind of preferred data traffic acquisition mode.
Step 302: the mirror image flow is split into P way flow be sent to the flow storage cluster in the flow processing subsystem.P is the integer greater than 1.
The shunting of in this step, carrying out is handled and can be realized by the shunting switch, and the shunting switch adopts the trunk mode to adopt load-balancing technique to be split into P way flow the mirror image flow.
Step 303: the M platform storage server in the flow storage cluster receives the sub-flow that is diverted to respectively, and the sub-flow that adopts load-balancing technique to be diverted to writes down N the disk of hanging; Wherein said M is a positive integer, and N is the integer greater than 1, and M * N >=P.
The load balancing mode that adopts in this step can be that unit writes disk successively in turn according to preset time span.Wherein can move a plurality of processes on each storage server; Each process is counterpart disk respectively, and each process is responsible for the reception of a part of sub-flow wherein respectively and is that unit writes corresponding disk successively in turn with this partial discharge by preset time span.This load balancing strategy has made full use of the independence of each process and disk, has reduced the pressure that disk continues to write, and has solved the problem of mass data storage preferably, has improved the disk utilance simultaneously, has practiced thrift the server cost effectively.
Step 304: the non real-time analysis cluster in the flow processing subsystem gathers to be analyzed after flow is stored the sub-flow of cluster storage, and the analysis of execution includes but not limited to: the excavation of attack or the extraction of demand data.
When carrying out the excavation of attack, can extract the flow of attacking the period, carry out the attack analysis based on the characteristic of extraction flow.For example, for common network attack, comprise that mainly the synflood of the attack of network level bandwidth type, tcp layer and ack flood attack, the Distributed Request of application layer is attacked.Diverse network attack meeting brings influence to the stable operation of product, and we can come the deep analysis attack signature based on the flow that with stored history is the storage of flow storage cluster, for the defence of product line and the evidence obtaining of attack provide service.Attack for the bandwidth type of network level, common have udpflood and an icmp flood, and we add up then and should various types of flow sizes of period judge attack type and attack scale through extracting the flow of attacking the period.Attack for tcp layer protocol stack resource exhaustion type, through extracting the flow of attacking the period, the various types of packet rates of statistical time range tcp flag bit are judged attack type and attack scale.Distributed Request for application layer is attacked; Through extracting the packet of attacking the period, each field of adding up this period http request header comprises fields such as host, url, cookie, User-Agent or referer; Judge attack type; And further judge product line and the related pages of being attacked, and conclude the request characteristic of summing up the http head simultaneously, for closing strategy distinguishing mark is provided.
The current operation aspect comprises that to the demand of past Visitor Logs the line of tracking problem and product descends test, and the extraction of demand data is just in order to satisfy this demand.Concrete implementation method is based on the flow of flow storage cluster storage; Non real-time is analyzed the purpose ip of cluster according to product line; From the flow of flow storage cluster storage, extract the packet of corresponding purpose ip and getting off, be used for follow-up this packet being offered business demand side such as the pcap stored in file format.
Step 305: the real-time analysis cluster carries out the statistics of flow information to the mirror image flow that receives, and utilizes statistics to generate Study document.
In this step, several servers in the real-time analysis cluster receive the mirror image flow, and the flow information of adding up is write the log file.Tabulate statistics module in the real-time analysis cluster is downloaded the log file that above-mentioned several servers generated then; The flow information that gathers in each journal file obtains and exports Study document, and the Cycle Length of wherein tabulate statistics module download log file writes the flow information of statistics greater than above-mentioned several servers the Cycle Length of journal file.
The identical bag of above-mentioned several server operations is caught and statistics program; The bag prize procedure can be accomplished the efficient packet receiving from ten thousand Broadcoms; Statistics program is the statistics respectively that unit carries out flow information with purpose ip; The content of statistics can include but not limited to: tcp flow value, udp flow value, icmp flow value etc., and unit is generally bps; Tcp packet rate, udp packet rate, icmp packet rate etc., unit is generally pps; The access times of non-serve port per second; The get request number of http per second, get length of data package; The information such as packet number that the main conditional code per second of http is responded.Can statistics be write the log file with binary format then.
Said system provided by the invention and method; Through traffic mirroring, storage server cluster and to descend to hang the flow memory load of disk balanced; Realized the storage demand of large-scale data; Further large-scale mirror image flow is realized the real-time analysis demand, analyze cluster through non real-time the data of storage server cluster storage are carried out the non real-time analyze demands of Macro or mass analysis realization to large-scale data through the real-time analysis cluster.Empirical tests, the present invention can be good at the data flow that processing bandwidth surpasses 100G, and data possess integrality and stability, and network equipment cost aspect is with the obvious advantage.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (16)

1. the treatment system of a large-scale data is characterized in that, this system comprises: flow collection subsystem and flow processing subsystem;
Said flow collection subsystem is used for the image data flow, and the data traffic that collects is carried out mirror image, and the mirror image flow that obtains is split into P way flow is sent to the flow storage cluster in the said flow processing subsystem, and P is the integer greater than 1;
Said flow storage cluster is made up of M platform storage server, and every storage server is hung N disk down, and said M is a positive integer, and N is the integer greater than 1, and M * N >=P; Every storage server receives the sub-flow that is diverted to, and the sub-flow that adopts load-balancing technique to be diverted to writes down N the disk of hanging.
2. system according to claim 1 is characterized in that, said flow collection subsystem comprises:
Be used to gather the data traffic of outer net core switch outlet and the data traffic that collects is carried out the flow collection unit of mirror image, and
Be used to adopt load-balancing technique the mirror image flow to be split into the branch stream processing unit of each sub-flow.
3. system according to claim 2 is characterized in that, said flow collection unit is made up of optical splitter and image intensifer;
Said optical splitter carries out light-splitting processing to the data traffic of outer net core switch outlet, and the data traffic of said image intensifer after to light-splitting processing carried out light amplification and formed the mirror image flow.
4. system according to claim 2 is characterized in that, stream processing unit was the shunting switch in said minute, adopts the mode of trunk to adopt load-balancing technique to be split into P way flow the mirror image flow.
5. system according to claim 1; It is characterized in that; The a plurality of processes of operation on every storage server; Part disk in the corresponding said N disk of each process difference, the responsible respectively said parton flow that receives a part of sub-flow and will receive of each process is that unit writes corresponding disk successively in turn by preset time span.
6. system according to claim 1 is characterized in that, said flow processing subsystem also comprises the real-time analysis cluster;
Said flow collection subsystem carries out mirror image with the data traffic that collects and obtains two-way mirror image flow, and wherein one road mirror image flow is used to carry out said shunting processing, and another road mirror image flow is sent to said real-time analysis cluster;
Said real-time analysis cluster is used for the mirror image flow that receives is carried out the statistics of flow information, and utilizes statistics to generate Study document.
7. system according to claim 6 is characterized in that, said real-time analysis cluster comprises: real-time receiver module and the tabulate statistics module be made up of server cluster;
Several servers in the said real-time receiver module receive said mirror image flow, and the flow information of adding up is write journal file;
Said tabulate statistics module is downloaded the journal file that said several servers generated; The flow information that gathers in each journal file obtains and exports Study document, and the Cycle Length of wherein said download writes the flow information of statistics greater than said real-time receiver module the Cycle Length of journal file.
8. system according to claim 1; It is characterized in that; Said flow processing subsystem also comprises non real-time analysis cluster, is used to gather analyze after said flow is stored the sub-flow of cluster storage, and said analysis comprises: the excavation of attack or the extraction of demand data.
9. the processing method of a large-scale data; It is characterized in that; This method is applied to comprise the large-scale data treatment system of flow collection subsystem and flow processing subsystem; Flow storage cluster in the said flow processing subsystem is made up of M platform storage server, and every storage server is hung N disk down, and said method comprises:
Said flow collection subsystem image data flow carries out mirror image with the data traffic that collects, and the mirror image flow that obtains is split into P way flow is sent to said flow storage cluster, and P is the integer greater than 1;
Every storage server receives the sub-flow that is diverted to, and the sub-flow that adopts load-balancing technique to be diverted to writes down N the disk of hanging; Wherein said M is a positive integer, and N is the integer greater than 1, and M * N >=P.
10. method according to claim 9 is characterized in that, said image data flow is specially: the data traffic of gathering the outer net core switch.
11. method according to claim 9 is characterized in that, saidly the data traffic that collects is carried out mirror image is specially:
Adopt optical splitter that the data traffic of gathering is carried out light-splitting processing, the data traffic after adopting image intensifer to light-splitting processing is carried out light amplification and is formed the mirror image flow.
12. method according to claim 9 is characterized in that, saidly the mirror image flow that obtains is split into P way flow is specially:
Adopt the trunk mode of shunting switch to adopt load-balancing technique to be split into P way flow the mirror image flow.
13. method according to claim 9; It is characterized in that; The sub-flow that said employing load-balancing technique will be diverted to writes down N the disk of hanging and is specially: a plurality of processes of operation on every storage server; Part disk in the corresponding said N disk of each process difference, the responsible respectively said parton flow that receives a part of sub-flow and will receive of each process is that unit writes corresponding disk successively in turn by preset time span.
14. method according to claim 9; It is characterized in that; Said flow collection subsystem is when carrying out mirror image with the data traffic that collects; Obtain two-way mirror image flow, wherein one road mirror image flow is used to carry out said shunting processing, and another road mirror image flow is sent to the real-time analysis cluster of said flow processing subsystem;
Said real-time analysis cluster carries out the statistics of flow information to the mirror image flow that receives, and utilizes statistics to generate Study document.
15. method according to claim 14 is characterized in that, said the mirror image flow that receives is carried out the statistics of flow information, and utilizes statistics to generate Study document to be specially:
Several servers in the said real-time analysis cluster receive said mirror image flow, and the flow information of adding up is write journal file;
Tabulate statistics module in the said real-time analysis cluster is downloaded the journal file that said several servers generated; The flow information that gathers in each journal file obtains and exports Study document, and the Cycle Length of wherein said download writes the Cycle Length of journal file greater than said flow information with statistics.
16. method according to claim 9 is characterized in that, this method also comprises:
Non real-time analysis cluster gathers to be analyzed after said flow is stored the sub-flow of cluster storage, and said analysis comprises: the excavation of attack or the extraction of demand data.
CN201210102411.2A 2012-04-09 2012-04-09 The processing method of a kind of large-scale data and system Active CN102664789B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210102411.2A CN102664789B (en) 2012-04-09 2012-04-09 The processing method of a kind of large-scale data and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210102411.2A CN102664789B (en) 2012-04-09 2012-04-09 The processing method of a kind of large-scale data and system

Publications (2)

Publication Number Publication Date
CN102664789A true CN102664789A (en) 2012-09-12
CN102664789B CN102664789B (en) 2016-08-17

Family

ID=46774207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210102411.2A Active CN102664789B (en) 2012-04-09 2012-04-09 The processing method of a kind of large-scale data and system

Country Status (1)

Country Link
CN (1) CN102664789B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051552A (en) * 2012-12-04 2013-04-17 恒安嘉新(北京)科技有限公司 Intelligent management and control method and system based on separation of tandem connection blockage and side channel analysis
CN103051612A (en) * 2012-12-13 2013-04-17 华为技术有限公司 Firewall and method for preventing network attack
CN104461385A (en) * 2014-12-02 2015-03-25 国电南瑞科技股份有限公司 Multi-hard disk balanced storage method with self-adaptive port traffic
CN105893628A (en) * 2016-05-17 2016-08-24 中国农业银行股份有限公司 Real-time data collection system and method
CN108093048A (en) * 2017-12-19 2018-05-29 北京盖娅互娱网络科技股份有限公司 A kind of method and apparatus for obtaining using interaction data
CN108989101A (en) * 2018-07-04 2018-12-11 北京奇艺世纪科技有限公司 A kind of log output system, method and electronic equipment
CN110881058A (en) * 2018-09-06 2020-03-13 阿里巴巴集团控股有限公司 Request scheduling method, device, server and storage medium
CN111061431A (en) * 2019-11-28 2020-04-24 曙光信息产业股份有限公司 Distributed storage method, server and client
CN114363346A (en) * 2020-02-14 2022-04-15 北京百度网讯科技有限公司 IP mounting and data processing method and device
CN114978884A (en) * 2022-07-27 2022-08-30 北京搜狐新媒体信息技术有限公司 Data packet processing method and device
CN115544781A (en) * 2022-10-18 2022-12-30 南方电网科学研究院有限责任公司 Construction method and device of large power grid test system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1599356A (en) * 2004-09-21 2005-03-23 北京锐安科技有限公司 Flow equilization processing method and device based on connection pair
CN101795211A (en) * 2010-01-13 2010-08-04 北京中创信测科技股份有限公司 Data storage method and system
US8010829B1 (en) * 2005-10-20 2011-08-30 American Megatrends, Inc. Distributed hot-spare storage in a storage cluster

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1599356A (en) * 2004-09-21 2005-03-23 北京锐安科技有限公司 Flow equilization processing method and device based on connection pair
US8010829B1 (en) * 2005-10-20 2011-08-30 American Megatrends, Inc. Distributed hot-spare storage in a storage cluster
CN101795211A (en) * 2010-01-13 2010-08-04 北京中创信测科技股份有限公司 Data storage method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
苏红环等: "MSP系统对TD GPRS核心网Iu-PS接口的监控方案讨论", 《移动通信 》 *
谭子军: "分布式加密存储系统的文件数据分布于磁盘配额管理技术研究", 《国防科学技术大学工程硕士学位论文 》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051552B (en) * 2012-12-04 2015-06-17 恒安嘉新(北京)科技有限公司 Intelligent management and control method and system based on separation of tandem connection blockage and side channel analysis
CN103051552A (en) * 2012-12-04 2013-04-17 恒安嘉新(北京)科技有限公司 Intelligent management and control method and system based on separation of tandem connection blockage and side channel analysis
CN103051612A (en) * 2012-12-13 2013-04-17 华为技术有限公司 Firewall and method for preventing network attack
CN103051612B (en) * 2012-12-13 2015-09-30 华为技术有限公司 Fire compartment wall and prevent method of network attack
CN104461385A (en) * 2014-12-02 2015-03-25 国电南瑞科技股份有限公司 Multi-hard disk balanced storage method with self-adaptive port traffic
CN105893628A (en) * 2016-05-17 2016-08-24 中国农业银行股份有限公司 Real-time data collection system and method
CN108093048B (en) * 2017-12-19 2021-04-02 北京盖娅互娱网络科技股份有限公司 Method and device for acquiring application interaction data
CN108093048A (en) * 2017-12-19 2018-05-29 北京盖娅互娱网络科技股份有限公司 A kind of method and apparatus for obtaining using interaction data
CN108989101A (en) * 2018-07-04 2018-12-11 北京奇艺世纪科技有限公司 A kind of log output system, method and electronic equipment
CN110881058A (en) * 2018-09-06 2020-03-13 阿里巴巴集团控股有限公司 Request scheduling method, device, server and storage medium
CN110881058B (en) * 2018-09-06 2022-04-12 阿里巴巴集团控股有限公司 Request scheduling method, device, server and storage medium
CN111061431A (en) * 2019-11-28 2020-04-24 曙光信息产业股份有限公司 Distributed storage method, server and client
CN111061431B (en) * 2019-11-28 2023-06-23 曙光信息产业股份有限公司 Distributed storage method, server and client
CN114363346A (en) * 2020-02-14 2022-04-15 北京百度网讯科技有限公司 IP mounting and data processing method and device
CN114978884A (en) * 2022-07-27 2022-08-30 北京搜狐新媒体信息技术有限公司 Data packet processing method and device
CN114978884B (en) * 2022-07-27 2022-12-13 北京搜狐新媒体信息技术有限公司 Data packet processing method and device
CN115544781A (en) * 2022-10-18 2022-12-30 南方电网科学研究院有限责任公司 Construction method and device of large power grid test system

Also Published As

Publication number Publication date
CN102664789B (en) 2016-08-17

Similar Documents

Publication Publication Date Title
CN102664789A (en) Method and system for processing large-scale data
CN109167798B (en) Household Internet of things device DDoS detection method based on machine learning
US20120099465A1 (en) Method and its devices of network tcp traffic online identification using features in the head of the data flow
US7840664B2 (en) Automated characterization of network traffic
CN101695035B (en) Flow rate identification method and device thereof
CN103067218B (en) A kind of express network packet content analytical equipment
EP2852097B1 (en) Efficient data center monitoring
CN107707576A (en) A kind of network defense method and system based on Honeypot Techniques
CN102315974A (en) Stratification characteristic analysis-based method and apparatus thereof for on-line identification for TCP, UDP flows
CN103023725A (en) Anomaly detection method based on network flow analysis
CN104283897B (en) Wooden horse communication feature rapid extracting method based on multiple data stream cluster analysis
US20170171044A1 (en) Systems And Methods To Recreate Real World Application Level Test Packets For Network Testing
KR20140119561A (en) System and method for big data aggregaton in sensor network
CN106972985A (en) Accelerate the method and DPI equipment of the processing of DPI device datas and forwarding
WO2009151739A3 (en) Methods for collecting and analyzing network performance data
CN105337753B (en) A kind of internet real quality monitoring method and device
CN111222019B (en) Feature extraction method and device
CN107302534A (en) A kind of DDoS network attack detecting methods and device based on big data platform
CN108289125A (en) TCP sessions recombination based on Stream Processing and statistical data extracting method
CN106535240A (en) Mobile APP centralized performance analysis method based on cloud platform
CN105897615A (en) Data transmission method and device
CN110661807A (en) Automatic acquisition method and device for IPv6 address
CN104618192B (en) Method and device for testing database audit equipment
WO2013139678A1 (en) A method and a system for network traffic monitoring
CN109639592B (en) Rapid data analysis method and device based on ten-gigabit traffic

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant