CN106452868A

CN106452868A - Network traffic statistics implement method supporting multi-dimensional aggregation classification

Info

Publication number: CN106452868A
Application number: CN201610888795.3A
Authority: CN
Inventors: 谭齐; 莫娴; 田永春
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2016-10-12
Filing date: 2016-10-12
Publication date: 2017-02-22
Anticipated expiration: 2036-10-12
Also published as: CN106452868B

Abstract

The invention discloses a network traffic statistics implement method supporting multi-dimensional aggregation classification. The network traffic statistics implement method adopts a multi-level storage manner and a data format, is flexible to use, high in expandability, and low in storage occupation space, and can meet different application requirements. An original message is stored in a database, and original traffic is stored in a binary file, so that the requirements of storage and integrity of a large amount of data are met; a traffic linked list is stored in an internal storage, so that the data accessing speed of a user is improved, and multi-granularity traffic statistical analysis is supported. As a third-order HASH aggregation traffic linked list matrix is introduced, the traffic inquire and access speed is improved, and the user is helped to find query information quickly and rapidly. The network traffic statistics implement method is a multi-dimensional traffic aggregation classification algorithm, can be used for dynamic traffic analysis according to various divided time periods of an application, and also supports business traffic aggregation analysis based on the variation of groups organized by various applications of original-target addresses, so that multi-dimensional traffic aggregation classification statistical analysis of flexible combinations of five key words is realized.

Description

A kind of network flow statistic implementation method supporting various dimensions polymerization classification

Technical field

The present invention relates to a kind of network flow statistic implementation method supporting various dimensions polymerization classification.

Background technology

Network flow statistic and analysis are the bases grasping network behavior, are for analyzing network condition, grasping flow spy Property effective ways, the different network application of quantitative analysis can be determined movable in a network by collecting data or message trace Rule, can provide an effective control and the foundation distinguishing network traffics for Virtual network operator；Additionally, pass through statistics and analysis These time serieses constituting service traffics obtain customer service behavioural characteristic, and therefrom extraction can portray the ginseng of network flow characteristic Number, it is achieved to service traffics modeling, simulation and performance evaluation and the prediction to future customer business demand.Visible, deeply Solve the constituent of network traffics, the variation tendency etc. producing root and modern network business, the network planning, network are transported Dimension, QoS guarantee, network security etc. have important meaning.The design of current flux statistics is all based on greatly NetFlow and realizes , NetFlow is according to the purpose IP address of IPv4 message, source IP address, destination slogan, source port number, protocol number, ToS, defeated Incoming interface or output interface define stream, and seven identical tuple identity are same stream, are characterized in that traffic statistics are based on IP Stream, can carry out adding up and trend analysis to agreement (application), host ip (user) and service etc. according to these information, from And realizing Network Traffic Monitoring, user applies monitoring, network security, the function such as the network planning and abnormal traffic detection.But, The statistics only to summaries such as the message amount of IP traffic, message total lengths for the NetFlow, does not has each number in IP traffic amount According to message feature (as：Message length, message is spaced the transmission time), it is also just difficult to obtain fine granularity traffic characteristic.And it is existing Many service traffics are had to be difficult to distinguish from seven tuple identity merely, it is necessary to based on the protocol identification of application layer, by analyzing Application layer packet header, carrys out discriminating service flow.

Another one key problem in network flow statistic and analysis is how that the data carrying out collection are concluded Arrange and storage, especially for mass data produced by high-speed large-flow network environment.These original information data amounts are not But big and complicated, this brings certain difficulty to process and the storage organization of network traffics initial data.Visible, effectively former Beginning data preparation and storage are to meet varigrained network traffics identification and the basis carrying out comprehensive traffic statistics analysis.

Usually need the needs according to customer analysis in network flow statistic analysis, in order to analyze flow tendency, carry out The inquiry of various network traffics, and directly carry out the inquiry of database, often owing to statistics dispersion is stored in table, lead Causing query time very long, system effectiveness is very low.Thus be necessary to carry out classification polymerization to network traffics.So-called flow gathers Close, refer to, according to certain polymerizing condition, primary flow record is carried out flow merging, it is achieved a plurality of stream merges into the process of.By Stream record after polymerization will greatly reduce than primary flow record quantity so that the efficiency that user inquires about data traffic also will substantially Improve, it is seen then that set up the method for flexible and efficient flow polymerization be also improve the i.e. flow analysis efficiency of data query key and Important means.At present, one-dimensional, two-dimentional flow polymerization sorting algorithm is relatively more, if the various flow rate polymerization of user to be met Query demand, it is necessary to possible for each user flow polymerization all is arranged out one by one, often leads to Primary Stage Data and arrange The time expended is long, and its memory space taking is just very big.And at present higher-dimension (referring to more than two dimension) polymerization classification is supported Algorithm or the excessive requirement that cannot meet low cost of the memory headroom of requirement, otherwise the speed of classified inquiry is relatively low, it is impossible to full Foot customer flow statistical analysis application demand.

It is therefore desirable to the data finding the polymerization of a kind of flow are processed and the method for tissue, both took less storage sky Between, the needs of user's various flow aggregate query can be adapted to again flexibly simultaneously.

Content of the invention

In order to overcome the disadvantages mentioned above of prior art, the invention provides a kind of network flow supporting various dimensions polymerization classification Amount statistics implementation method, it is intended to meet the demand of the network traffics identification of different agreement aspect.The present invention proposes a set of expansion The original message format information memory of exhibition, this data form not only comprises seven tuple information of IP data, but also stores former The application layer protocol header of beginning message, both can meet the needs of seven tuple IP traffic amount identifications, can also meet application (Deep Packet Inspection, deep message detects, a kind of protocol identification based on application layer in the DPI flow identification of layer Technology) needs.On this basis, it is proposed that the data structure of initial data flow, this data structure had both stored primary flow Amount statistical information, also stored for original flow composition each message length and transmission time information, by obtain flow when Between sequence, can carry out data traffic probability nature (as：Correlation analysis, cluster analysis etc.) the degree of depth excavate and feature intend Close, meet fine granularity traffic statistics analysis needs.Additionally, by the need analyzing the common flow aggregate query of current user Wanting, devising the Processing Algorithm of a kind of flow polymerization, wherein to take memory space few for three rank polymerization traffic matrixes, and can spirit Realize the requirement of user's multidimensional flow aggregate query alively, inquire about efficient quick.

The technical solution adopted for the present invention to solve the technical problems is：A kind of network flow supporting various dimensions polymerization classification Amount statistics implementation method, including following content：

First, database is used to store initial IP message information；

2nd, the storage of original data stream：

1) data structure of primary flow is defined：

Use one flow information of a discharge record nodes records, use the mode of chained list to save all of discharge record Point couples together；One discharge record node comprises flow head chained list and the message chained list linking with it；

2) storage of initial data flow：

When the fine-grained traffic statistics of needs, by flow head chained list and message storage of linked list to internal memory, wait user Inquiry and reading；It when not needing fine-grained traffic statistics, is stored in flow head link table information in file；

3rd, flow polymerization：

1) flow polymerization is carried out according to time granularity：Polymerization item includes flow value and bag number, generates flow head chained list, deposits It is stored in internal memory；

2) flow polymerization is carried out according to polymerizing condition：Carry out flow polymerization according to user's querying condition, generate three rank polymerizations Flow chained list matrix；

4th, flow inquiry：

1) the polymerization traffic chained list from ts to te the time period meeting under each order conditions is obtained；

2) parameter initialization；

3) the flow node of t to t+T time period in tri-polymerization traffic chained lists of Slink, Dlink, Plink is looked for；

4) common node of three polymerization traffic chained lists in t to the t+T time period is found；

5) it is circulated propelling with T for step-length, until poll-final during t >=te, export result.

Compared with prior art, the positive effect of the present invention is：The present invention uses mode and the data form of multistage storage, Use flexible extensibility strong, take memory space little, different application demand can be met.Original message is stored in database, former Beginning flow is stored in binary file, it is ensured that the requirement of mass data storage and integrality；Flow storage of linked list in internal memory, Improve the speed of user accesses data, and support the traffic statistics analysis of many granularities.Additionally, invention defines a kind of support The flow polymerization sorting algorithm of various dimensions, can divide according to each time period of application, carries out dynamic flow analysis, also supports base Yu Yuan, the various application of destination address weave into the change of marshalling to analyze service traffics polymerization, it is achieved that flexible group of five keywords The multidimensional flow polymerization statistic of classification analysis closed.This algorithm introduces three rank HASH polymerization traffic chained list matrixes, improves flow and looks into Ask access speed, help user fast to quickly find Query Information, and there is very strong flexibility, by changing polymerization point The dimensional information of class, changes the information in datarams, and committed memory is little, can meet user's various flow rate easily and analyze need Ask.

Brief description

Examples of the present invention will be described by way of reference to the accompanying drawings, wherein：

Fig. 1 is the workflow schematic diagram of the present invention；

Fig. 2 is data traffic record node list structure schematic diagram；

Fig. 3 is three rank polymerization traffic matrix schematic diagrames；

Fig. 4 is multidimensional polymerization traffic querying flow schematic diagram.

Detailed description of the invention

The purpose of the present invention is exactly to design the data storage of a set of flexible and efficient network flow statistic and the side arranging Method, by formulating the data form that autgmentability is high, information is complete, uses database, file, the many means of internal memory and multistage storage Mode, to meet the requirement of the different agreement aspect varigrained flow identification to network traffics and inquiry, sets up reusability The three rank polymerization traffic matrixes high, autgmentability is strong, retrieval rate is fast, meet inquiry and the retrieval of the polymerization of user's multiple data traffic Need.The space hold of the inventive method is low, execution speed is fast, it is flexible and efficient to apply, and is especially suitable for large scale network and magnanimity The application scenarios that data flux statistics is analyzed.

It as it is shown in figure 1, the workflow of the technical solution used in the present invention is divided into four steps, is described in detail below：

1st, the storage of initial IP message information：

The data volume obtaining original IP message information from the flow collection end of router is huge especially, needs to collection Data carry out preliminary classified finishing and could preserve, it is desirable to preserving content must be complete, it is simple to the data statistic analysis in later stage.Cause This, it is considered to use the storage mode of database.Original message information database is the important place of information exchange, and network traffics are united Score analysis system needs to access the data on flows the most original that database obtains certain test, according to the different demands of user, enters The various statistical analysis activity of row.Therefore the design of database table structure is extremely important for the performance of statistical analysis.

Information and the application layer message of each IP message are temporally deposited by database purchase service traffics information the most original Storage is got off, and message information storage format is defined as follows table：

2nd, the generation of original data stream and storage：

After collecting service message data and be stored in database, work below is carried out according to original message information exactly The identification of IP traffic amount and classification, namely the generation process of original data stream.Circulation in one network refers to one Unidirectional one group sequence of data packet between individual given source and target, for broad sense, network traffics refer to by same route The sequence of data packet meeting same characteristic features condition of device, characteristic condition here comprises the attributive character of protocal layers in network, Such as：Port numbers, protocol number, the protocol characteristic etc. of application layer.Can be according to data packet head (such as protocol type, tos or target A part for location), the result of packet itself (such as the size of packet) and processing data packets (such as packet in the router Output port) define a stream.The sorting technique of IP traffic is very many, and it is not done concrete research by the present invention, here According to conventional mode, the feature of a data stream is described according to seven conventional critical fielies.I.e. all have phase The logical network end of source/target ip address together, source/target port, agreement, COS and network equipment input/output The packet of mouth is all grouped in same stream, and carries out statistical counting to packet and byte in this stream.

1) definition of the data structure of primary flow

We use a discharge record node (flow node) to record a flow information, and the mode using chained list will All of stream record node (flow node) couples together.One discharge record node comprises flow_head chained list and pkt_ Inf chained list, as shown in Figure 2.Each flow head chained list (flow_head) comprises this traffic classification feature and traffic statistics letter Breath.The message chained list (pkt_inf) of its link then have recorded in this stream, and length, the IP of each message identify (by this mark This application of electronic report layer data can be found from database), all details such as message interval time, this is original flow Fine-grained statistical analysis, business conduct analysis, service traffics emulation matching provide good tenability.

The data structure definition of flow_head is as follows：

struct flow_head{

uint srcIP；The source IP of // stream record

uint dstIP；Purpose IP of // stream record

ushort srcPort；The source port of // stream record

ushort dstPort；The destination interface of // stream record

uchar protocol；The protocol type of // stream record

ushort input；Flow into interface

ushort output；// flow out interface

uchar tos；// COS

uchar appType；// application layer type of service, " 0 " represent the flow identification not carrying out application

/ * flow information */

struct Time startTime；// stream time started (system time is accurate to millisecond)

struct Time endTime；// stream end time (system time is accurate to millisecond)

uint pktsNum；The data packet number that // stream comprises

uint lengthBytes；The total bytes of // stream

struct pkt_inf*pkt_inf_ptr；

struct flow_head*next；

}

It should be noted that defined in the data structure of flow_head appType variable, when needs are carried out from application layer When the Classification and Identification of flow, this value can be used.First user needs the protocol characteristic word according to application layer traffic identification The value of the different appType of definition, can map with appType by way of defining arrays one by one, and application layer traffic is known Other process is to need in AppDataLen and AppData field to look for the protocol characteristic word meeting predefined from database, from And obtain appType value and write in flow chained list.

The data structure definition of pkt_inf is as follows：

struct pkt_inf{

ushort pktID；The IP mark of // packet

ushort pktSize；// data packet byte number

double intervalTime；// packet time is spaced, and is accurate to millisecond

struct pkt_inf*next；

}

Recordable by way of the traffic messages chained list of pkt_inf each message in flow is recorded, It is thus able to obtain the probability nature according to flow such as the time series of traffic messages and packet size distribution, more fine granularity can be supported Flow analysis.When needs user needs this fine-grained flow analysis, the original flow chained list of Fig. 2 (includes：flow_ All information of head and pkt_inf) can store in internal memory, wait user's inquiry and read.If there is no this needs, then by interior The flow_head data structure information deposited in original flow chained list is stored in file, in case follow-up flow polymerization analysis is used.

2) initial data flow generates process

(1) first according to selected router and time period, the message of condition is met from database search, crucial according to seven Field carries out traffic statistics, reads first message meeting condition, creates first flow head chained list node (flow_ head).

(2) continue search for database, search the message meeting time and router id, after finding, by seven critical fielies Search flow_head chained list, checks whether the attribute of newly arrived packet meets that in flow_head, oneself has stream, if it is satisfied, Then the flow statistic (stream end time, message number, the total bytes of stream) in this existing flow_head chained list node is carried out Update, create a new message chained list node (pkt_inf), the interval time of recorded message length and a upper message, report The information such as literary composition IP identification record, are inserted into pkt_inf chained list afterbody.

(3) if not finding the flow meeting condition in flow_head chained list, then a new flow is created (flow_head node) is by the afterbody of this node city to flow_head.

(4) the 2nd step is repeated, until the search of database.

(5) each discharge record node is write original stream data file successively according to the order of discharge record node chained list In.

3) storage of initial data flow

It is previously noted initial data flow and firstly generate in internal memory, when not needing fine-grained traffic statistics analysis Wait, flow head link table information (link table information comprising flow_head and pkt_inf) can be stored in file, do so Benefit is exactly will not to be limited by memory size and is easy to the data reusing of off-line statistical analysis.Because original data on flows is There is provided for follow-up flow polymerization and support, directly facing user's inquiry, if preserved according to conventional database purchase mode Flow data, the insertion efficiency of database, file size all can become the bottleneck that network flow statistic is analyzed.Although internal memory reads fast Speed, but space size is limited, it is difficult to meet huge original flow information storage.Simplest settling mode uses two to enter exactly The mode of file processed preserves flow data, i.e. valuable field in the data of every stream with binary form, according to the time It is sequentially written in stream data file.If it is impossible in one file that all of data are preserved, and system can be caused During meter is analyzed, data search efficiency is extremely low.When all can be limited to certain section under normal circumstances due to the retrieval of stream data The data on flows of some router interior, therefore facilitates feasible method to be exactly according to time period segmentation flow data literary composition the most Part, each time period (user can be configured according to the concrete condition of the data volume of network traffic analysis and network operation time) Interior being saved in the flow data in buffering area in physical file, the naming rule of physics output file is：DevId_ YYYYMMddhhmm.data, wherein, DevId be flow place the numbering of router, YYYYMMddhhmm represents in file The flow data time started, it is assumed that carried out time slice according to 1 hour, then file name is 1001_201501011810.data, Show this file to contain from 1 day 18 January in 2015:10 to 19:Between 10, by numbered 1001 router produce All of primary flow record.It should be noted that flow file does not has stream data to carry out a series of operations such as polymerization, simply To reading original message data from database, it is former that or application layer characteristic information regular through seven keyword filtrations generates Flow_head information in beginning data stream is stored, and takies a line according to each stream record, is saved in file.? In follow-up flow analysis, need to read these " original materials " according to customer analysis need follow-up flow is carried out to it Polymerization classification process.

3rd, the polymerization of flow multidimensional and storage

The data volume of initial data flow is very huge, and it is right that user directly will make from the original record of each flow data When data carry out inquiry and statistical analysis, the low problem of meeting generation efficiency.The process of flow polymerization is will to meet same polymeric The a plurality of original data stream of condition and time granularity carries out flow summation, retains polymerization item, to realize the compression of primary flow simultaneously Arrange.

Flow polymerization has three key elements:Polymerizing condition, polymerization item and time granularity.Flow polymerizing condition is by original The critical field combination of data traffic is derived from, and so-called various dimensions polymerization i.e. refers to multiple different polymerizing condition.Flow gathers The purpose closed is mainly and improves the data query i.e. efficiency of flow analysis, thus should take into full account reality when designing aggregation strategy Need.Design polymerizing condition based on the querying condition that flow analysis is commonly used, or by querying condition directly as polymerizing condition, It is easy to obtain the statistics needed for flow analysis by the process that flow is polymerized.Analysis system directly inquires about polymerization result data Table, i.e. can get statistics, it is to avoid substantial amounts of tables of data connects and statistical calculation.Flow polymerization relatively common at present has： The polymerization of agreement dimension, the polymerization of address (including source IP address, purpose IP address) dimension, port (include：Source port, destination interface) Dimension is polymerized.Therefore, the present invention considers using this five dimension as the polymerization dimension that could support up.So-called polymerization item refers to meet identical The superposition item of polymerizing condition, such as the parameter informations such as total message number, total bag length, these can need be set according to user.Time Granularity, refer to time period that flow is polymerized (as according to 5 minutes, 1 hour, the time period of 1 day is polymerized), time granularity is got over Greatly, the compression of data volume is bigger, but the detailed information of flow is lost more serious, thus time granularity is according to the flow of user Thick fine-grained needs of information analysis determines.

1) generation of polymerization traffic

Flow aggregation problem its essence is a multi-objective optimization question classified, rather than single-object problem.No Can pursuit speed simply, and cause the blast of memory space, it is necessary to be to seek one between time, space, flexibility three Individual compromise.So being possible not only to accelerate the speed of flow polymerization classified inquiry, the blast that simultaneously also can solve memory headroom is asked Topic, meets the flexibility needs of user's various flow inquiry.Owing to Hash table has from key fast mapping to value, This feature is highly effective for the polymerization of quick internal memory level flow.Therefore, use hash table as aggregated flow Rapid matching Storage organization, devises a kind of three rank flow aggregating algorithms based on Hash (although matrix exponent number is too much in speed and flexibility On enhance, space hold certainly will be caused too high, therefore, select three rank proper), its flow polymerization process is in two steps：

(1) flow polymerization is carried out according to time granularity.When the network router of Main Basis user care and flow initiate Between, select corresponding original flow data file, read the original flow information in original flow data file, according to time grain Degree carries out flow polymerization, and polymerization item includes flow value and bag number, ultimately produces flow_head flow chained list.Because temporally Granularity is no longer necessary to preserve the information of each message in flow after carrying out flow polymerization, so the data structure of flow_head In pkt_inf_ptr for sky, polymerization traffic chained list i.e. on a time period does not deposit pkt_inf chained list.Based on time granularity Polymerization traffic storage of linked list in internal memory, for based on polymerizing condition flow polymerization prepare；

(2) flow polymerization is carried out according to polymerizing condition.It is based on time granularity according to the flow polymerization of polymerizing condition, Carry out flow polymerization according to user's querying condition (such as protocol number, source/destination address, port numbers), generate the three rank polymerizations such as Fig. 3 Flow chained list matrix.In Fig. 3 in the HASH polymerization traffic chained list according to polymerizing condition generation, with the polymerization traffic joint of letter representation The data structure definition of point is as follows：

struct flow_Link{

struct flow_head*data；

struct flow_Link*next；

}

In data structure, data pointer points to based on the flow_ meeting polymerizing condition in the polymerization traffic chained list of time granularity The memory address of head node, next pointer points to next HASH polymerization traffic chained list node.So according to three HASH letters Number, generates three HASH flow chained lists.

2) design of three rank polymerization traffic chained list matrixes

Have been based on 7 critical fielies (7 classifying rules territories in other words) in original traffic generating, carry out dividing of flow Class, but reality during traffic statistics analysis, be typically based on source/destination IP, source/destination port and protocol number five pass Key word.And commonly used business is defined by agreement and port in IP layer feature.The polymerization point of communication network service flow During alanysis, it will usually consider to carry out based on the relation of client, service traffic partition, and apply relation to be typically and IP ground Location is related, thus the present invention first design based in the sorting technique of the three rank flow chained list matrixes of HASH.Such as Fig. 3 institute Show, first carry out the HASH chained list of traffic classification with source address and destination address, and as the X of three rank flow chained list matrixes Axle and Y-axis；And the critical field using purpose/source port and agreement comes together jointly to construct Lothrus apterus HASH chained list (as three rank Z axis in matrix), owing to the combined number of this three critical field is considerably less, it is to avoid Space Explosion.Need exist for explanation be In above-mentioned three rank flow chained list matrixes, the keyword on three rank is to choose according to conventional flow polymerization demand, it is also possible to root Being adjusted according to the particular demands of user and changing, the method for flow polymerization and the structure of three rank polymerization traffic matrixes are constant 's.

As it is shown on figure 3, present invention uses three Hash tables to preserve the aggregated flow information of identical srcIP, identical dst_ The aggregated flow information of ip and identical source port number, destination slogan, the aggregated flow information of protocol number.Top side in Fig. 3 The aggregated flow information based on source IP address deposited by hash table, and its table space is Xmax to the maximum；The hash table of Fig. 3 bottom side deposit based on The aggregated flow information of purpose IP address, its table space is Ymax to the maximum；In Fig. 3, the hash table in left side is deposited based on source port, mesh Port, the aggregated flow information of three keywords of protocol number, its table space is Zmax to the maximum.Hash table with source IP address is Example, all streams with identical srcIP will be mapped to that in same list item, due to the existence of Hash conflict, has different The stream of srcIP is also possible to be mapped to same list item, and in order to solve this problem, we preserve in each Hash list item One chained list, wherein contains all srcIP key assignments being mapped to this list item；In figure 3, source IP address 192.168.5.1, the key assignments of 10.16.55.7 and 66.77.88.50 has been mapped on the list item of hash function value 253.Except this Outside, comprise another one chained list under each source IP address node, store all stream informations under this source address (srcIP)； Purpose IP address Hash table has similar structure, each list item comprises a chained list, contains all this list items that is mapped to DestIP, in figure 3, purpose IP address 192.0.15.1,210.200.15.7 are mapped to the table that hash function value is 123 , meanwhile, each destination address node comprises a chained list, wherein contains all stream informations under this destination address；With The Hash table with source port, destination interface, protocol number as keyword for the sample, and construct Lethrus apterus function so that each Source port, destination interface, the key word of protocol number uniquely map a hash table, and each list item houses institute with the form of chained list There is the polymerization traffic information with identical sources port, destination interface, protocol number.In Fig. 3, hash function value is in the list item of 56 Store the polymerization traffic information that source port is that the 21st, destination interface is the 22nd, protocol number is 55.

Definition to the hash function of three HASH tables separately below illustrates：

3) based on source port, destination interface, the definition of hash function of protocol number

Generally, the value of destination interface and source port (32) is often a little in 0-65535, protocol domain The value of (16) is little several values in 0-255.The value in therefore actual filtering rule port and protocol territory (or value model Enclose) the combined number of different situations be very limited amount of；And change little, combine them into single order, after reducing exponent number With this three crucial fields for input key assignments, (s, d, p) function, the output valve of function is mapped to and meets this condition to build hash HASH chain flow-meter first address.The method still supports the flow polymerization point of multiple dimensions of source/destination port and protocol number Class, thus the three rank polymerization traffic chained list matrixes based on HASH of present invention definition, can support source/destination address, source/destination end Mouth, the flow polymerization of five dimensions of protocol number.(s, d, p) hash function is defined as follows hash：

Hash (s, d, p)=P × S × d+P × s+p

0≤s≤S-1,0≤d≤D-1,0≤p≤P-1 in formula

Wherein s, d, p be respectively source port, destination interface, protocol number, S and D is respectively source, the maximum port numbers of purpose, Greatly 65535, P are maximum protocol number, are 255 to the maximum.Provable through mathematical derivation, this function is Lethrus apterus function. So-called Lothrus apterus, refers to that the key assignments of a source port number determining, destination slogan and protocol number is calculated by hash function Unique functional value, also just uniquely maps a list item of Hash table.

4) define based on the hash function of source IP address

The flow polymerization of source IP address still uses Hash table to store, the performance searched according to HASH above Analysis understands, can suitably increase, in order to promote search performance, the load factor appropriate design hash that table space reduces hash table Function makes hash function as far as possible uniformly and select suitable conflict processing method.

(1) Hash address space (table space) is determined

The display format of IP address is XXX.XXX.XXX.XXX and IP address is with 32 nothings in network packet The storage of symbol integer, it has 4294967296 kinds may.User IP quantity in view of a large-enterprise network is usual For tens thousand of.For the filling being quickly carried out Hash computing and reduced upper list by properly increasing hashed value address realm Hashed value after Hash can be set to 16 by the factor, will IP address carry out Hash computing after hashed value be distributed in 0- Between 65535.It is thus determined that Hash table space is 65535.

(2) conflict processing method

Owing to reducing Hash table space, hash-collision will necessarily be brought.As it is shown on figure 3, the present invention uses chain address method Carry out clash handle, i.e. each list item is dynamic link table, only one of which item during Lothrus apterus, when the conflict occurs, is dynamically this The corresponding list item in hashed value address increases a subitem.

(3) definition of hash function

In order to by IP address information Hash of 4 bytes to the address space of 2 bytes, reach Hash function and want simultaneously That asks is easy to use, uses jackknife method and leaving remainder method to combine and carry out the Hash computing of IP address information here.Formula As follows：

Hash (k1, k2, k3, k4)=((k1+k3) × 256+k2+k4) mod65536

In formula, k1, k2, k3, k4 are four fields of source IP address.Mod represents the remainder divided by 65536.

4th, flow inquiry

Flow inquiry is exactly the flow querying condition that user inputs care, and network flow statistic is analyzed system output and met bar The flow results of part.The essence of flow inquiry is exactly the data on flows found from the data on flows of magnanimity and meet querying condition. Inquiry to original flow is exactly the data on flows that from internal memory or in original flow file, search meets condition, and its process is exactly The reading process of the search of one chained list and file, is not detailed herein.Emphasis is to various dimensions (multiple key section) below The query script of polymerization traffic is illustrated and described, the polymerizing condition of the multidimensional polymerization traffic of present invention classification can be from Source/destination IP, five keywords of source/destination port and protocol number are chosen arbitrarily and are combined, and look into for one-dimensional polymerization traffic The process ask calculates acquisition key value according to polymerizing condition by Hash table function and carries out mapping and inquiring meeting condition Flow information.And the process of inquiry of the polymerization traffic for multidimensional, in fact first pass through exactly HASH function find meet each The polymerization traffic chained list of rank polymerizing condition, then looks for their public joint again from the polymerizing condition of each rank in polymerization traffic chained list Point, is the flow information meeting multidimensional polymerizing condition.As a example by the inquiry of two dimension polymerization traffic chained list, as it is shown on figure 3, all There is the stream of identical srcIP and dstIP and be positioned at srcIP chained list and the common part of destIP chained list, as all srcIP are 192.168.5.1, dstIP is 192.0.15.1, stream public flow node be D and K.If being further added by source port the 32nd, purpose The polymerizing condition of port the 54th, protocol number 87, then only K flow node.

Visible, the inquiry of multidimensional polymerization traffic chained list is exactly to compare searching public affairs between multiple one-dimensional polymerization traffic chained list The process of conode, simplest way is exactly the nodal information then comparison one by one read in chained list, finds same section.As Fruit be two dimension polymerization be exactly that two polymerization traffic chained lists carry out double recycle ratio pair, common portion be meet two dimension polymerization bar The polymerization traffic information of part；Five dimension polymerizations are exactly that three polymerization traffic chained lists carry out three and recirculate contrast, and common portion is to expire The polymerization traffic information of foot five dimension polymerizing condition.It is primarily based on time granularity in view of all polymerization traffic information to be gathered Close, it is assumed that polymerization traffic chained list is that (original flow information is according to acquisition time sequencing according to the arrangement of time order and function order Carry out what data were deposited) therefore, we can utilize this feature, first comparison time, the then flow at identical time point Node carries out Multiple Cycle comparison so that inquiry cycle-index greatly reduces, and search efficiency is improved.This multidimensional aggregated flow The querying flow of amount is as shown in Figure 4, it is assumed that inquire about from ts (time started) to te (end time) time period, polymerization time grain Degree is T, based on source, destination address, the polymerization traffic of source, destination slogan and protocol number various dimensions, is described in detail below：

Calculate its key assignments according to source address srcIP, find its list item in source IP address Hash table by HASH function, Ergodic source IP address link list, finds out srcIP corresponding IP addressed nodes, obtains the aggregated flow chain list index of all srcIP SLink；

Calculate its key assignments according to destination address dstIP, find it in the Hash table of purpose IP address by HASH function List item, travels through purpose IP address link list, finds out dstIP corresponding IP addressed nodes, and the aggregated flow chained list obtaining all dstIP refers to Pin DLink；

Calculate its key assignments according to purpose/source port and protocol number, obtain it in purpose/source port and association by HASH function List item in view Hash table, is met the aggregated flow chain list index Plink under the conditions of this purpose/source port and protocol number.

2) parameter initialization, the start time of record current queries, and record looking into of three chained lists respectively with three groups of pointers Ask position：T=ts, SptrStart=Slink, SptrEnd=SptrStart, DptrStart=Dlink, DptrEnd= DptrStart, PptrStart=Plink, PptrEnd=PptrStart；

3) looking for the flow node in t to the t+T time period in Slink polymerization traffic chained list, step is as follows：

3-1) flow node counter Scount=0；

3-2) chained list node (flow_Link data structure) the middle taking-up data field pointed to SptrEnd pointer, will SptrEnd->data->StartTime is compared with t；

If 3-3) SptrEnd->data->StartTime is less than t, then SptrStart=SptrStart->next, SptrEnd=SptrStart, re-executes 3-2) step；

If 3-4) SptrEnd->data->StartTime is more than or equal to t, and is less than t+T, then this node is the t time Flow node on point, Scount++, simultaneously SptrEnd=SptrEnd->Next, continues executing with 3-4) step；Otherwise perform 3-5) step；

If 3-5) SptrEnd->data->StartTime is more than or equal to t+T, it is judged that Scount, if equal to zero, hold Row 7) step, if it is greater than zero, perform 4) step；

4) looking for the flow node in t to the t+T time period in Dlink polymerization traffic chained list, step is as follows：

4-1) flow node counter Dcount=0；

4-2) chained list node (flow_Link data structure) the middle taking-up data field pointed to DptrEnd pointer, will DptrEnd->data->StartTime is compared with t；

If 4-3) DptrEnd->data->StartTime is less than t, then DptrStart=DptrStart->next, DptrEnd=DptrStart, re-executes 4-2) step；

If 4-4) DptrEnd->data->StartTime is more than or equal to t, and is less than t+T, then this node is the t time Flow node on point, Dcount++, simultaneously DptrEnd=DptrEnd->Next, continues executing with 4-4) step；Otherwise perform 4-5) step；

If 4-5) DptrEnd->data->StartTime is more than or equal to t+T, it is judged that Dcount, if equal to zero, hold Row 7) step, if it is greater than zero, perform 5) step；

5) looking for the flow node in t to the t+T time period in Plink polymerization traffic chained list, step is as follows：

5-1) flow node counter Pcount=0；

5-2) chained list node (flow_Link data structure) the middle taking-up data field pointed to PptrEnd pointer, will PptrEnd->data->StartTime is compared with t；

If 5-3) PptrEnd->data->StartTime is less than t, then PptrStart=PptrStart->next, PptrEnd=PptrStart, re-executes 5-2) step；

If 5-4) PptrEnd->data->StartTime is more than or equal to t, and is less than t+T, then this node is the t time Flow node on point, Pcount++, simultaneously PptrEnd=PptrEnd->Next, continues executing with 5-4) step；Otherwise perform 5-5) step；

If 5-5) PptrEnd->data->StartTime is more than or equal to t+T, it is judged that Pcount, if equal to zero, hold Row 7) step, if it is greater than zero, perform 6) step；

6) common node of three flow chained lists in searching t to the t+T time period；

Filtered by Time transfer receiver, it is thus achieved that three flow chained lists in t to the t+T time period, be respectively with SptrStart is starting point, and SptrEnd is the chained list of terminal；With DptrStart as starting point, DptrEnd is the chained list of terminal；With PptrStart is starting point, and PptrEnd is the chained list of terminal.The nodes of this three chained lists compares former polymerization traffic chained list (Slink, DLink and Plink) greatly reduces, and when carrying out Multiple Cycle comparison, cycle-index greatly reduces, and improves inquiry Efficiency and speed.When finding same node point in this three chained lists, owing to the data of back end of chained list is all referring to identical Memory space, as long as their data address is identical, indicate that it is same node, therefore just can be searched by address comparison To data on flows information, decrease the consumption accessing internal memory.

7) it is circulated propelling with T for step-length：T=t+T, SptrStart=SptrEnd, DptrStart= DptrEnd, PptrStart=PptrEnd, 8 are performed as t >=te) step, and otherwise perform 3) step；

8) poll-final, exports result.

Claims

1. the network flow statistic implementation method supporting various dimensions polymerization classification, it is characterised in that：Including following content：

First, database is used to store initial IP message information；

2nd, the storage of original data stream：

1) data structure of primary flow is defined：

Use one flow information of a discharge record nodes records, use the mode of chained list by all of discharge record node even Pick up；One discharge record node comprises flow head chained list and the message chained list linking with it；

2) storage of initial data flow：

When the fine-grained traffic statistics of needs, by flow head chained list and message storage of linked list to internal memory, wait user's inquiry And reading；It when not needing fine-grained traffic statistics, is stored in flow head link table information in file；

3rd, flow polymerization：

1) flow polymerization is carried out according to time granularity：Polymerization item includes flow value and bag number, generates flow head chained list, is stored in In internal memory；

2) flow polymerization is carried out according to polymerizing condition：Carry out flow polymerization according to user's querying condition, generate three rank polymerization traffics Chained list matrix；

4th, flow inquiry：

2) parameter initialization；

2. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 1, its feature It is：Described flow head chained list comprises traffic classification feature and traffic statistics；In described message this stream of chain table record each The length of individual message, IP mark and message interval time.

3. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 2, its feature It is：Described traffic classification feature include source IP address, target ip address, source port, target port, agreement, COS with And network equipment input and output logical network port.

4. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 1, its feature It is：The X-axis of described three rank polymerization traffic chained list matrixes and Y-axis carry out traffic classification with source address and destination address respectively HASH chained list, Z axis is the critical field syntectonic Lothrus apterus HASH chained list altogether with destination interface, source port and agreement.

5. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 4, its feature It is：The hash function of described HASH chained list includes：

1) based on source port, destination interface, protocol number hash function hash (s, d, p)：

Hash (s, d, p)=P × S × d+P × s+p

0≤s≤S-1,0≤d≤D-1,0≤p≤P-1 in formula

Wherein s, d, p be respectively source port, destination interface, protocol number, S and D is respectively source, the maximum port numbers of purpose, is 65535, P is maximum protocol number, is 255；

2) hash function based on source IP address：

Hash (k1, k2, k3, k4)=((k1+k3) × 256+k2+k4) mod65536

In formula, k1, k2, k3, k4 are four fields of source IP address, and Mod represents the remainder divided by 65536.

6. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 1, its feature It is：The generation process of described initial data flow is：

(1) first according to selected router and time period, meet the message of condition from database search, enter according to critical field Row traffic statistics, read first message meeting condition, create first flow head chained list node；

(2) continue search for database, search the message meeting time and router id, after finding, by critical field search stream Amount head chained list, it is judged that whether the attribute of newly arrived packet meets that in flow head chained list, oneself has the attribute of stream：

A) if it is satisfied, then be updated the flow statistic in this existing flow head chained list node, and one is created newly Message chained list node, the information such as interval time of recorded message length and a upper message, message IP identification record, insert To message chained list afterbody；

If b) be unsatisfactory for, then create a new flow head chained list node, and by the tail of this node city to flow head chained list Portion；

(3) (2nd) step is repeated, until the search of database；

(4) each discharge record node is write in original stream data file successively according to the order of discharge record node chained list.

7. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 1, its feature It is：It is characterized in that：The described flow node step looked in Slink polymerization traffic chained list in t to the t+T time period is as follows：

3-1) flow node counter Scount=0；

3-2) take out data field with in the chained list node of SptrEnd pointer sensing, by SptrEnd->data->startTime Compared with t；

If 3-4) SptrEnd->data->StartTime is more than or equal to t, and is less than t+T, then this node is on t time point Flow node, Scount++, simultaneously SptrEnd=SptrEnd->Next, continues executing with 3-4) step；Otherwise perform 3-5) Step；

If 3-5) SptrEnd->data->StartTime is more than or equal to t+T, it is judged that Scount, if equal to zero, perform step Rapid 5), if it is greater than zero, the flow node then looked in Dlink polymerization traffic chained list in t to the t+T time period.

8. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 1, its feature It is：The described flow node step looked in Dlink polymerization traffic chained list in t to the t+T time period is as follows：

4-1) flow node counter Dcount=0；

4-2) take out data field with in the chained list node of DptrEnd pointer sensing, by DptrEnd->data->startTime Compared with t；

If 4-4) DptrEnd->data->StartTime is more than or equal to t, and is less than t+T, then this node is on t time point Flow node, Dcount++, simultaneously DptrEnd=DptrEnd->Next, continues executing with 4-4) step；Otherwise perform 4-5) Step；

If 4-5) DptrEnd->data->StartTime is more than or equal to t+T, it is judged that Dcount, if equal to zero, perform step Rapid 5), if it is greater than zero, the flow node then looked in Plink polymerization traffic chained list in t to the t+T time period.

9. a kind of network flow statistic implementation method supporting various dimensions polymerization classification according to claim 1, its feature It is：The described flow node step looked in Plink polymerization traffic chained list in t to the t+T time period is as follows：

5-1) flow node counter Pcount=0；

5-2) take out data field with in the chained list node of PptrEnd pointer sensing, by PptrEnd->data->startTime Compared with t；

If 5-4) PptrEnd->data->StartTime is more than or equal to t, and is less than t+T, then this node is on t time point Flow node, Pcount++, simultaneously PptrEnd=PptrEnd->Next, continues executing with 5-4) step；Otherwise perform 5-5) Step；

If 5-5) PptrEnd->data->StartTime is more than or equal to t+T, it is judged that Pcount, if equal to zero, perform step Rapid 5), if it is greater than zero, step 4).