Data integrity merging system
Technical Field
The invention relates to the technical field of internet, in particular to a data integrity merging system.
Background
Existing network traffic analysis products can typically only handle the complete traffic situation. Such as: there are 10 packets for a complete session and only 5 of them can be captured, and the statistics of the number of packets for the session becomes problematic. Two scenarios may result in failure to capture complete data traffic, which may affect the subsequent analysis process.
The first scenario is: the collection point set has serious packet loss, but a plurality of collections can be complemented to form more complete flow.
The second scenario is: due to routing design, the uplink and downlink traffic may go through different routing loops. And the situation that the session is incomplete can also occur when a packet capturing device is deployed at a specific network position. Data from both acquisition points needs to be merged.
Chinese patent publication No. CN 105740361a, 2016, 07/06/2016, discloses a method and apparatus for detecting integrity of full-scale data, wherein the method comprises: extracting a first IP list in the full data and the access track data of the first IP list; loading reference data, and extracting a second IP list in the reference data and access track data of the second IP list; matching and verifying the access track data of the first IP list and the access track data of the second IP list; and calculating the integrity of the full data according to the matching verification result.
Although the method and the device for detecting the integrity of the full data disclosed in the patent document can improve the accuracy and reliability of the integrity detection of the full data in the internet, the method and the device can not only evaluate the integrity of the full data, but also further locate the position where the data is lost. However, incomplete data at different acquisition points cannot be subjected to de-duplication and gap filling, centralized combination is performed, and continuous combination of the data cannot be performed.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a data integrity merging system, which can be used for removing duplication and filling up incomplete data of different acquisition points, carrying out centralized merging, evaluating the integrity of the flow before merging and the flow after merging and displaying the integrity as a percentage value, and carrying out data merging continuously.
The invention is realized by the following technical scheme:
a data integrity merging system comprises a storage module, and is characterized in that: the storage module stores the flow by taking the four-tuple of the IP port as KEY and by session returning, and separately stores the original flow and the merged flow; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
And the UDP is directly added into the session stream to complete the combination.
And the TCP is firstly de-duplicated and then is judged whether to be complete, and the serial number and the confirmation number of the TCP are calculated and compared whether to be continuous or not.
Each end of the TCP session contains a 32-bit serial number for tracking the data volume sent by the end, each packet contains a serial number, and the receiving end is used for notifying the sending end of successful data reception through an acknowledgement number.
The sequence number and the confirmation number in the TCP data packet from the client to the server and from the server to the client are respectively counted, a load length field is added, and 5 fields in total participate in calculation.
The integrity of the overall flow is the ratio of the number of all complete TCP sessions to the number of all TCP sessions.
The UDP is a user datagram protocol; TCP is a transmission control protocol; KEY is a value; MD5 is an information summarization algorithm; bit is a bit.
The beneficial effects of the invention are mainly shown in the following aspects:
1. the invention relates to a storage module, which stores flow by taking an IP port quadruplet as a KEY and by session return, and separately stores original flow and merged flow; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range, and as a complete technical scheme, compared with the prior art, the integrity evaluation module can eliminate duplicate and complement incomplete data of different acquisition points, performs centralized merging, can evaluate the integrity of the flow before merging and the flow after merging and display the integrity of the flow before merging and the flow after merging in percentage, so that the integrity of the data can be supported by more clear data before merging and after merging, the data can be stored in an incomplete mode, and the data can be merged continuously after the supplementary flow data is obtained, thereby improving the integrity.
2. According to the invention, through the past repeated supplementation and scoring of the percentage value of the data integrity, a lot of low-quality data which cannot be accurately analyzed or cannot be analyzed in the prior art can be converted into high-quality data, so that more effective analysis can be obtained, and the analysis accuracy and reliability can be improved.
Drawings
The invention will be further described in detail with reference to the drawings and the detailed description, wherein:
FIG. 1 is a block flow diagram of the present invention.
Detailed Description
Example 1
Referring to fig. 1, a data integrity merging system includes a storage module, a merging module and an integrity evaluation module, where the storage module stores traffic by using an IP port quadruplet as a KEY and by using a session return port, and separately stores original traffic and merged traffic; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
The storage module stores the flow by taking an IP port quadruplet as a KEY and by session merging, and separately stores the original flow and the combined flow; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range, and as a complete technical scheme, compared with the prior art, the integrity evaluation module can eliminate duplicate and complement incomplete data of different acquisition points, performs centralized merging, can evaluate the integrity of the flow before merging and the flow after merging and display the integrity of the flow before merging and the flow after merging in percentage, so that the integrity of the data can be supported by more clear data before merging and after merging, the data can be stored in an incomplete mode, and the data can be merged continuously after the supplementary flow data is obtained, thereby improving the integrity.
Example 2
Referring to fig. 1, a data integrity merging system includes a storage module, a merging module and an integrity evaluation module, where the storage module stores traffic by using an IP port quadruplet as a KEY and by using a session return port, and separately stores original traffic and merged traffic; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
And the UDP is directly added into the session stream to complete the combination.
Example 3
Referring to fig. 1, a data integrity merging system includes a storage module, a merging module and an integrity evaluation module, where the storage module stores traffic by using an IP port quadruplet as a KEY and by using a session return port, and separately stores original traffic and merged traffic; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
And the UDP is directly added into the session stream to complete the combination.
And the TCP is firstly de-duplicated and then is judged whether to be complete, and the serial number and the confirmation number of the TCP are calculated and compared whether to be continuous or not.
Example 4
Referring to fig. 1, a data integrity merging system includes a storage module, a merging module and an integrity evaluation module, where the storage module stores traffic by using an IP port quadruplet as a KEY and by using a session return port, and separately stores original traffic and merged traffic; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
And the UDP is directly added into the session stream to complete the combination.
And the TCP is firstly de-duplicated and then is judged whether to be complete, and the serial number and the confirmation number of the TCP are calculated and compared whether to be continuous or not.
Each end of the TCP session contains a 32-bit serial number for tracking the data volume sent by the end, each packet contains a serial number, and the receiving end is used for notifying the sending end of successful data reception through an acknowledgement number.
Example 5
Referring to fig. 1, a data integrity merging system includes a storage module, a merging module and an integrity evaluation module, where the storage module stores traffic by using an IP port quadruplet as a KEY and by using a session return port, and separately stores original traffic and merged traffic; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
And the UDP is directly added into the session stream to complete the combination.
And the TCP is firstly de-duplicated and then is judged whether to be complete, and the serial number and the confirmation number of the TCP are calculated and compared whether to be continuous or not.
Each end of the TCP session contains a 32-bit serial number for tracking the data volume sent by the end, each packet contains a serial number, and the receiving end is used for notifying the sending end of successful data reception through an acknowledgement number.
The sequence number and the confirmation number in the TCP data packet from the client to the server and from the server to the client are respectively counted, a load length field is added, and 5 fields in total participate in calculation.
Example 6
Referring to fig. 1, a data integrity merging system includes a storage module, a merging module and an integrity evaluation module, where the storage module stores traffic by using an IP port quadruplet as a KEY and by using a session return port, and separately stores original traffic and merged traffic; the merging module is used for establishing a four-tuple flow table of UDP and TCP, merging the data packets of the same session into the same session through the four-tuple and the protocol identifier, and simplifying the merging of all flows into the merging of a single session; performing MD5 calculation on the full packet, performing deduplication judgment by using an MD5 value, if the same data packet of MD5 exists in the session, regarding the data packet as a duplicate data packet, and performing discarding treatment; if the data packet is not repeated data, dividing the data packet into UDP and TCP; and the integrity evaluation module estimates the integrity of the whole flow by weighting and calculating the integrity of all effective TCP sessions in the whole event range.
And the UDP is directly added into the session stream to complete the combination.
And the TCP is firstly de-duplicated and then is judged whether to be complete, and the serial number and the confirmation number of the TCP are calculated and compared whether to be continuous or not.
Each end of the TCP session contains a 32-bit serial number for tracking the data volume sent by the end, each packet contains a serial number, and the receiving end is used for notifying the sending end of successful data reception through an acknowledgement number.
The sequence number and the confirmation number in the TCP data packet from the client to the server and from the server to the client are respectively counted, a load length field is added, and 5 fields in total participate in calculation.
The integrity of the overall flow is the ratio of the number of all complete TCP sessions to the number of all TCP sessions.
Through the past repeated supplementation and scoring of the percentage value of the data integrity, a lot of low-quality data which cannot be accurately analyzed or cannot be analyzed in the prior art can be converted into high-quality data, so that more effective analysis can be obtained, and the analysis accuracy and reliability can be improved.
The following details the entire data integrity merge:
the initial serial number is random and may be any value between 0 and 4,294,967,295, demonstrated with respect to the initial serial number.
The description starts with the fourth packet:
firstly, the two directions of C- > S and S- > C are divided;
the fourth packet is a request packet of C, and the fourth packet contains GET type selection information one, wherein the total number of the GET type selection information is 725 bytes;
the fifth packet is an acknowledgement packet for S, ACK indicates that 725 bytes sent by segment C are acknowledged, and Len is 0 indicating that no data is transmitted.
The sixth packet is S send, Len is 1448 bytes indicating that data is already answering the C request;
the seventh packet is an acknowledgement packet for C, ACK update 1449 indicates that transmission data of S is acknowledged, and Seq update 726 indicates that acknowledgement of S to reception of the fourth packet data is accepted.
The eighth and ninth packets have the same meaning as the sixth and seventh packets.
The actually captured C- > S directional conversation flow can be calculated by accumulating Len fields in the C- > S;
SumC2SLen=LenPacket4+LenPacket7+LenPacket9=725+0+0=725
accumulating Len fields in S- > C can calculate the actually captured S- > C directional conversation flow;
SumS2CLen=LenPacket5+LenPacket6+LenPacket8=0+1448+1448=2896
then the theoretical total traffic that should be available in both directions is calculated. Find session start and first packet is 4, last packet is 9;
the confirmed C- > S transmission quantity is ConfirmC 2S-SeqPacket 9-SeqPacket 4-726-1-725
The confirmed S- > C transmission quantity is ConfirmC2S ═ ACKPacket9 ═ ACKPacket4 ═ 2897-1 ═ 2896
If the actually captured packet SumC2SLen is more than or equal to ConfirmC2S, the conversation in the C- > S direction is complete;
if the actually captured packet SumS2CLEn is more than or equal to ConfirmC2S, the session in the S- > C direction is complete;
if both directions are complete, the whole session is judged to be complete.