WO2024031972A1

WO2024031972A1 - Method, system and apparatus for identifying repeated data, and storage medium and product

Info

Publication number: WO2024031972A1
Application number: PCT/CN2023/079344
Authority: WO
Inventors: 罗来胜
Original assignee: 中兴通讯股份有限公司
Priority date: 2022-08-12
Filing date: 2023-03-02
Publication date: 2024-02-15
Also published as: CN117640586A

Abstract

Provided in the embodiments of the present application are a method for identifying repeated data, a system for identifying repeated data, an apparatus for identifying repeated data, and a computer-readable storage medium and a computer program product. The method for identifying repeated data comprises: acquiring source-destination IP address pair information, wherein the source-destination IP address pair information comprises sending-end IP address information and receiving-end IP address information (S1100); acquiring repeated-access type information, wherein the repeated-access type information is used for representing an IP address access type, which causes the generation of a repeated data packet (S1200); acquiring repeated-access time information, wherein the repeated-access time information is used for representing time information of repeated data packet access (S1300); and according to at least one of the source-destination IP address pair information, the repeated-access type information and the repeated-access time information, performing repeated-data identification on the generated repeated data packet (S1400).

Description

Duplicate data identification methods, systems, devices, storage media and products

Cross-references to related applications

This application is filed based on a Chinese patent application with application number 202210968336.1 and a filing date of August 12, 2022, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated into this application as a reference.

Technical field

This application relates to but is not limited to the field of communications, and in particular, to a method, system, device, storage medium and product for identifying duplicate data.

Background technique

In modern communication networks, data collection equipment is used to collect and analyze packets transmitted between network element devices in the IP communication network to learn the user's activity status and the transmission quality of the network. For example, when network communication is abnormal or Poor quality will cause application data packets to time out and be retransmitted. Data collection devices can analyze network quality by collecting application retransmission data packets.

In the process of actual application, due to incorrect network layout or configuration of the data collection equipment, the data collection equipment will repeatedly collect data packets, and the repeatedly collected data packets will cause interference to the real network quality judgment. How to deal with the above two Identifying and diagnosing the source of data packets is an urgent problem that needs to be solved.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

Embodiments of the present application provide a method, system, device, storage medium and product for identifying duplicate data.

In the first aspect, embodiments of the present application provide a method for identifying duplicate data, which includes: obtaining source-destination IP address pair information, where the source-destination IP address pair information includes sending-end IP address information and receiving-end IP address information; obtaining Repeated access type information, the repeated access type information is used to characterize the IP address access type that causes repeated data packets; obtain repeated access time information, the repeated access time information is used to represent the repeated data packet access time information; based on at least one of the source and destination IP address pair information, repeated access type information, and repeated access time information, identify duplicate data when generating duplicate data packets.

In the second aspect, embodiments of the present application provide a system for identifying duplicate data. The system includes: a configuration module configured to configure a comparison table between IP and maximum survival time; and a data packet receiving module configured to receive data packets. , and sends the data packet to the data packet analysis module; the data packet analysis module is configured to parse the data packet to obtain IP identification information, and identify repeated data packets according to the IP identification information to obtain repeated access types. Information, wherein the IP identification information includes the sending end IP address information and the receiving end IP address information; the statistics module is configured to store the sending end IP address information, the receiving end IP address information, and repeated access of the repeated data packets. Type information and access time information of the repeated data packet.

In a third aspect, embodiments of the present application provide a device for identifying duplicate data, including: at least one processor; at least one memory for storing at least one program; when at least one of the programs is executed by at least one of the processors hour Implement the method for identifying duplicate data as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium in which a processor-executable program is stored. When the processor-executable program is executed by the processor, it is used to implement the first aspect. Duplicate data identification method.

In a fifth aspect, embodiments of the present application provide a computer program product, including a computer program or computer instructions. The computer program or computer instructions are stored in a computer-readable storage medium. The processor of the computer device obtains the information from the computer. The readable storage medium reads the computer program or the computer instructions, and the processor executes the computer program or the computer instructions, so that the computer device performs the method for identifying duplicate data as described in the first aspect.

Description of drawings

Figure 1 is a network architecture diagram of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 2 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 3 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 4 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 5 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 6 is a module block diagram of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 7 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 8 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 9 is a flow chart of a method for identifying duplicate data provided by an embodiment of the present application;

Figure 10 is a working flow chart of the aging timer in the method for identifying duplicate data provided by an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the embodiments described here are only used to explain the present application and are not used to limit the present application.

It should be noted that although the functional modules are divided in the device schematic diagram and the logical sequence is shown in the flow chart, in some cases, the modules can be divided into different modules in the device or the order in the flow chart can be executed. The steps shown or described. The terms "first", "second", etc. in the description, claims, and above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific sequence or sequence.

In the description of the embodiments of this application, unless otherwise explicitly limited, words such as setting, installation, and connection should be understood in a broad sense. Those skilled in the art can reasonably determine the meaning of the above words in the embodiments of this application based on the content of the technical solution. . In the embodiments of this application, words such as "further", "exemplarily" or "optionally" are used as examples, illustrations or illustrations, and should not be interpreted as being more preferable or better than other embodiments or designs. Advantages. The use of the words "further," "exemplarily," or "optionally" is intended to present relevant concepts.

The embodiments of this application can be applied to any communication network based on IP protocol.

Generally speaking, data packets in IP communication systems are divided into two types, namely ordinary data packets and application retransmission data packets. Ordinary data packets refer to data packets transmitted in the normal state, and application retransmission data packets refer to Data packets are timed out and retransmitted due to abnormal network communication or poor quality among network element devices.

Currently, the packet data packets transmitted between network element devices in the IP communication network are collected and analyzed to identify users. The activity status and network transmission quality have been widely used around the world. In the actual application process, due to incorrect network layout or configuration of the data collection equipment, for example, the IP communication system network structure changes, the data aggregation equipment is upgraded or changed, and the data collection system is upgraded or changed. It will lead to the occurrence of abnormal situations such as repeated optical splitting access on one side of the same network element and repeated collection of multiple network elements. For data collection equipment, there are four types of data packets collected, namely ordinary data packets and application retransmission data packets. , single-sided repeated spectrometry data packets and double-sided repeated acquisition data packets. The latter two data packets are generated due to misoperation. If these two data packets are not processed, it will not only increase the load of the data acquisition equipment, but also affect the actual operation. Interference occurs in the analysis of network quality.

In some solutions, one solution is: after the data collection device discovers data anomalies, perform reverse verification of the data anomalies, and check whether there are abnormalities in the front-end devices one by one. If an anomaly is found, readjust the access policy to ensure Due to the uniqueness of data access, this solution is too passive, has low efficiency and low intelligence, and will have a negative impact on product quality and user experience.

Another solution is to perform a repeatability check on the received data packets at the front end of the data collection device. If the current data packet is a duplicate data packet, filter it directly. Although this solution can Discard it to avoid sending it to the data collection device, but the phenomenon of repeated access of data packets still exists, and the user cannot locate the problem device.

Based on this, embodiments of the present application provide a duplicate data identification method, system, device, storage medium and product. The duplicate data identification method in this application applies to all systems based on IP networks. IP networks include IPV4 and IPV6 versions. , but is not limited to any form of network structure. The embodiments of this application are based on the identified source and destination IP address pair information, repeated access type and repeated access time messages, thereby helping to troubleshoot and ensure the accuracy of network quality analysis. performance while also helping to improve the performance of data acquisition systems and analysis systems.

Embodiments of the present application can be applied to the network architecture shown in Figure 1. The network structure includes network element equipment, data packet aggregator, data collection system, data analysis system and duplicate data identification system. The network equipment refers to each communication client. end and routing equipment, the data packet aggregator is used to aggregate the data of each network element equipment for subsequent equipment collection, the data acquisition system is used to collect data packets, and transmit the data packets to the data analysis system. The data analysis system analyzes the received data. The data packets are analyzed and processed to evaluate the network transmission quality between network element devices. In the embodiment of this application, a duplicate data identification system is added between the data aggregator and the data collection system to identify the repeatedly accessed data. Packets are identified and fault diagnosis can be made based on the identified duplicate data packets. The duplicate data identification method in the embodiment of the present application is implemented based on the duplicate data identification system.

Related technologies are further elaborated below.

The method for identifying duplicate data will be described below in conjunction with Figure 2.

As shown in Figure 2, Figure 2 shows a flow chart of a method for identifying duplicate data provided by an embodiment of the present application. The method of identifying duplicate data in an embodiment of the present application at least includes but is not limited to the following steps:

Step S1100: Obtain the source and destination IP address pair information, which includes the sending end IP address information and the receiving end IP address information;

Step S1200: Obtain repeated access type information. The repeated access type information is used to characterize the IP address access type that causes duplicate data packets to be generated;

Step S1300: Obtain repeated access time information, which is used to represent the time information of repeated data packet access;

In one embodiment, in step S1100, step S1200, and step S1300, the source and destination IP address pair information, repeated access type information, and repeated access time information are obtained through the repeated access statistics table.

In one embodiment, the fields in the repeated access statistics table are configured in advance, and the repeated access statistics table is used to perform statistical calculations on repeated access information of repeated data packets. When duplicate data packets are accessed, the information in the duplicate access statistics table will be updated.

In one embodiment, as shown in Figure 3, the steps of obtaining the repeated access statistics table include but are not limited to step S2110, step S2120 and step S2130.

Step S2110: Obtain duplicate data packets;

In one embodiment, a preset data packet information table is obtained. The data packet information table is used to cache IP layer information of data packets and to determine whether the current data packet is a repeatedly accessed data packet.

In one embodiment, when a new data packet is obtained from the communication client or routing device, the currently received data packet is parsed, the IP layer information of the data packet is parsed, and the IP identification information of the current data packet is obtained. IP identification information is a keyword in the data packet information table. According to the IP identification information of the current data packet, the data packet information table is queried. If there is data matching the IP identification information of the current data packet in the data packet information table, Then the currently accessed data packet is a duplicate data packet, which is represented as a repeatedly collected data packet.

In one implementation, if there is no data matching the IP identification information of the current data packet in the data packet information table, the currently accessed data packet is the initial data packet, and the IP identification information of the initial data packet is stored in in the data packet information table, and update the corresponding other field information in the data packet information table.

Step S2120: Obtain the source and destination IP address pair information, repeated access type information and repeated access time information based on the repeated data packet;

In one embodiment, obtaining the source and destination IP address pair information based on the repeated data packets includes: parsing the repeated data packets, obtaining the sending end IP address information and the receiving end IP address information of the repeated data packets, and converting the sending end IP address information Determine the source and destination IP address pair information with the receiving end IP address information;

In one embodiment, the sending end IP address information and the receiving end IP address information are obtained from IP layer protocol parsing of the repeated data packet.

In one embodiment, obtaining the repeated access type information based on the repeated data packets includes: identifying the repeated access type of the repeated data packets and obtaining the repeated access type information of the repeated data packets;

In one embodiment, as shown in Figure 4, the step is to identify the repeated access type of the repeated data packet to obtain the repeated access type information of the repeated data packet, which may include but is not limited to step S2121, step S2122, step S2123 and step S2124：

Step S2121: Analyze the received current data packet to obtain the IP identification information used to identify the current data packet;

In one implementation, taking IPv4 as an example, the IP identification information includes the sending end IP address information, the receiving end IP address information, IP identification information, IP fragmentation mark information, IP fragmentation offset information, protocol type, IP packet total length information. It should be noted that in the IPV4 version, the sending end IP address information corresponds to the Source Address field of the IP protocol layer, the receiving end IP address information corresponds to the Destination Address field of the IP protocol layer, and the IP identification information corresponds to the Identification field of the IP protocol layer. IP The fragmentation mark information corresponds to the Fragment Offset field of the IP protocol layer, the total length information of the IP packet corresponds to the Total Length field of the IP protocol layer, and the survival time corresponds to the Time To Live field of the IP protocol layer. The time to live is referred to as TTL.

In one embodiment, taking IPv6 as an example, the IP identification information includes the sending end IP address information, the receiving end IP address information, flow label information, payload length information, and next header information. It should be noted that in the IPV6 version, the sender IP address information corresponds to the Source IP Address field of the IP protocol layer, the receiver IP address information corresponds to the Destination IP Address of the IP protocol layer, and the flow label information corresponds to the Flow label field of the IP protocol layer. , payload The payload length information corresponds to the Payload length field of the IP protocol layer.

Among them, the flow label information in the IPV6 version is similar to the IP identification information in the IPV4 version, the payload length information in the IPV6 version is similar to the payload part length of the IP packet total length information in the IPV4 version, and the hop count limit in the IPV6 version The HopLimit field is actually equivalent to the Time To Live field in the IPV4 version. Time To Live and Hop Limit are both the survival time in the embodiment of this application.

It should be noted that according to the IP standard protocol, in order to avoid the situation where data messages are transmitted in a loop between routing devices and never reach the destination, each message will carry a life time value, that is, the parameter value of the life time. The maximum survival time is set by the device corresponding to the sending IP address when constructing the message. The maximum survival time is represented by N. The value of N is generally 64. Each time a message passes through a routing device, the survival time value is reduced by 1 until the survival time value reaches 0 and the message is discarded. Therefore, the value of each data packet should satisfy The survival time value is less than or equal to N.

Step S2122: Query and obtain historical access source tag information based on the IP identification information;

In one implementation, a query is performed based on a preset data packet information table to obtain historical access source tag information.

In one embodiment, in the IPV4 version, the fields in the data packet information table include at least the sending end IP address information field, the receiving end IP address information field, the IP identification information field, the IP fragmentation mark information field, and the IP fragmentation bias information field. migration information field, IP packet total length information field; in the IPV6 version, the fields in the data packet information table include at least the sending end IP address information field, the receiving end IP address information field, the flow label information field, the payload length information field, Next header information field.

Step S2123: Determine the access source of the duplicate data packet and obtain the access source determination result;

In one embodiment, step S2133 includes: obtaining the maximum survival time and the current survival time of the duplicate data packet; if the current survival time matches the maximum survival time, the access source judgment result is the sending end access; if the current survival time matches the maximum survival time, The survival time does not match, and the access source judgment result is access from the receiving end. In other words, if the current survival time matches the maximum survival time, it means that the duplicate data packet has not passed through the routing device, indicating that the IP address of the sender of the duplicate data packet has been repeatedly entered. If the current survival time does not match the maximum survival time, it means that the data is duplicated. The table has passed through the routing device, and the value of the current survival time has been reduced, indicating that the IP address of the receiving end of the duplicate data packet has been entered repeatedly.

In one embodiment, if the current survival time matches the maximum survival time, it can be understood as if the current survival time and the maximum survival time are equal. If the current survival time and the maximum survival time do not match, it can be understood as if the current survival time is equal to the maximum survival time. Not equal. In the embodiment of the present application, the current survival time and the maximum survival time are not equal, which means that the current survival time is less than the maximum survival time.

Step S2124: Determine the repeated access type information of the repeated data packet based on the access source judgment result and historical access source mark information.

In one embodiment, the historical access source tag information includes historical sending end tag information and historical receiving end tag information;

In one implementation, historical sender tag information and historical receiver tag information are stored in a data packet information table.

In one implementation, the historical access source tag information is obtained through the following steps: obtain the maximum survival time and the initial data packet, and the initial data packet is the data packet obtained for the first time; parse the initial data packet to obtain the initial survival time; if the initial data packet If the survival time matches the maximum survival time, set the historical sender's mark information to 1 and the historical receiver's mark information to 0; if the initial survival time does not match the maximum survival time, set the historical sender's mark information to 0 and the historical receiver's mark information The end flag information is set to 1.

In one implementation, if the initial survival time matches the maximum survival time, it means that the initial data packet is from the sending end IP If the initial survival time does not match the maximum survival time, it means that the initial data packet was collected from the receiving IP address.

In one embodiment, if the initial survival time matches the maximum survival time, it can be understood as if the initial survival time and the maximum survival time are equal. If the initial survival time and the maximum survival time do not match, it can be understood as if the initial survival time and the maximum survival time do not match. Not equal. In the embodiment of the present application, the unequal initial survival time and the maximum survival time means that the initial survival time is less than the maximum survival time.

In one embodiment, obtaining the maximum survival time includes: obtaining a preconfigured IP and maximum survival time comparison table; querying the IP and maximum survival time comparison table according to the sender IP address information to obtain the maximum survival time.

In one embodiment, the at least fields included in the IP and maximum survival time comparison table are: the sending end IP address information field and the maximum survival time field. In the IP and maximum survival time comparison table, the sending end IP address information field As the key of this table, the maximum survival time corresponding to the IP and maximum survival time comparison table can be queried through the sending end IP address information.

In one implementation, the mapping relationship between the sending end IP address information and the maximum lifetime is obtained from the network element equipment vendor, which usually exists in the form of a configuration interface or configuration file.

It should be noted that when the network element device is connected to the network element, the sending IP address information and the corresponding maximum survival time of the network element device have been stored in the IP and maximum survival time comparison table. The IP and maximum survival time The survival time in the comparison table does not change as the data packet is transmitted between network element devices.

In one embodiment, based on the sender IP address information in the IP identification information, a query is performed in the IP and maximum survival time comparison table to obtain the corresponding maximum survival time.

In one embodiment, in step S2134, determining the repeated access type information of the repeated data packet according to the access source judgment result and historical access source mark information includes:

When the access source judgment result is that the sending end is accessed, the historical sending end tag information is judged. If the historical sending end tag information is 1, it is determined that the repeated access type of the repeated data packet is the sending end IP repeated access type. ;

When the access source judgment result is that the receiving end accesses, the historical sending end tag information is judged. If the historical receiving end tag information is 1, it is determined that the repeated access type of the repeated data packet is the receiving end IP repeated access type. ;

When the access source judgment result is that the sender accesses, the historical sender mark information is judged. If the historical sender mark information is 0, it is determined that the repeated access type of the duplicate data packet is sender-receiver IP duplication. access type;

When the access source judgment result is that the receiving end is accessed, the historical receiving end tag information is judged. If the historical receiving end tag information is 0, it is determined that the repeated access type of the repeated data packet is sender-receiver IP duplication. Access type.

In one embodiment, the sender IP repeated access type is referred to as SS type, that is, Source IP repeated access type, and the receiving end IP repeated access type is referred to as DD type, that is, Destination IP repeated access type, and the sender-receiver IP repeated access type is referred to as , referred to as SD type, that is, Source IP and Destination IP are repeated.

The method in the embodiment of the present application also includes: updating the historical access source tag information according to the current access source judgment result to obtain the current access source tag information; using the current access source tag information as the updated historical access source tag information. Enter source tag information.

In one embodiment, when the access source judgment result is that the sending end is accessed, the historical sending end mark information is judged. If the historical sending end mark information is 0, the historical sending end mark information is updated, and the historical sending end mark information is updated. The value of the sending end tag information is set to 1; when the access source judgment result is that the receiving end is accessed, the historical receiving end tag information is Make a judgment and if the historical receiving end mark information is 0, update the historical receiving end mark information and set the value of the historical receiving end mark information to 1.

In one embodiment, the access time of repeated data packets is recorded to obtain repeated access time information;

In one embodiment, as shown in Figure 5, the step of recording the access time of repeated data packets and obtaining the repeated access time information may include but is not limited to step S2125, step S2126, and step S2127:

Step S2125: Obtain the repeated access start time, which is the time when repeated access occurs for the first time;

Step S2126: Obtain the latest time of repeated access, which is the latest time when repeated access occurs;

In one embodiment, when there is a duplicate data packet access, the time information in the duplicate access statistics table is updated according to the access time of the current data packet. The updated time information includes the duplicate access start time field and the duplicate access The latest time field, or only update the latest time field for repeated access.

In one implementation, if the current data packet is a repeated data packet for the first time, the repeated access start time and the repeated access latest time are both set to the access time of the current repeated data packet.

In one embodiment, if the current data packet is not the first repeated data packet, the repeated access start time remains unchanged, and the latest repeated access time is set as the access time of the current repeated data packet.

In one embodiment, the latest time of repeated access is represented by the time of the latest access of a repeated data packet.

Step S2127: Use the difference between the latest repeated access time and the repeated access start time as repeated access time information.

In one embodiment, according to the repeated access time information, it can be known in which time period the repeated data packets are generated.

Step S2130: Update the corresponding information in the repeated access statistics table according to the source and destination IP address pair information, repeated access type information, and repeated access time information to obtain an updated repeated access statistics table.

In one implementation, the repeated access time information can be understood as the access time of repeated data packets.

Step S1400: Identify duplicate data generated in duplicate data packets based on at least one of the source and destination IP address pair information, duplicate access type information, and duplicate access time information.

In one implementation, by combining the source and destination IP address pair information, repeated access type information, and repeated access time information, the user can learn, for example, the source and destination IP address pair information, the number of accesses of the repeated access type, and the number of repeated access types. Access time information, etc., and then use this information to locate the device corresponding to the duplicate data packet, and monitor the subsequent fault elimination. The fault elimination scenario will be described in detail below.

As shown in Figure 6, the method for identifying duplicate data in the embodiment of the present application also includes but is not limited to the following steps:

Step S1500: Construct a counter corresponding to each repeated access type based on the repeated access type information;

In one implementation, the SS type counter is constructed according to the repeated access type SS type, the DD type counter is constructed based on the repeated access type DD type, and the SD type counter is constructed based on the repeated access type SD type.

Step S1600: Display the source and destination IP address pair information, numerical information corresponding to the counter, and repeated access time information.

In one embodiment, the repeated access statistics table is updated according to the source and destination IP address pair information and the repeated access type of the current data packet.

The update includes querying and obtaining the counter corresponding to the repeated access type based on the current source and destination IP address pair information, and adding 1 to the value of the counter obtained by querying. If the repeated access type is SS type, update the value of SS type counter and add 1 to the value of SS type counter; if the repeated access type is DD type, update the value of DD type counter and add 1 to the value of DD type counter; If the repeated access type is SD type, update the value of the SD type counter and add 1 to the value of the SD type counter.

In one embodiment, the source and destination IP address pair information, numerical information corresponding to the counter, and repeated access time are displayed. The display allows users to intuitively understand the source and destination IP address pairs of repeated data packets, the number of SS-type accesses, the number of DD-type accesses, the number of SD-type accesses, the latest time of repeated access and the start time of repeated access. , so that users can accurately understand the network element equipment corresponding to repeated data packets, so that they can quickly locate repeated access failure problems.

In one embodiment, fault diagnosis is described. For a certain source-destination IP address pair, fault elimination is reflected in three scenarios: not eliminated, partially eliminated, and completely eliminated. In one embodiment, fault elimination is divided into three scenarios: The fault is rectified for some source and destination IP address pairs and the fault is rectified for all source and destination IP address pairs. In order to determine whether the fault is eliminated for some source and destination IP addresses and for all source and destination IP addresses, a time threshold needs to be set; the elimination of the fault for partial source and destination IP addresses refers to if the SS type counter of a certain source and destination IP address pair and The DD type counters are constantly increasing. After the time threshold, the SS type counters no longer increment, but the DD type counters are still increasing, indicating that the source and destination IP addresses are in the correct pairing, and the SS type repeated access fault has been eliminated, but the DD type counters The repeated access fault has not been eliminated; the elimination of the fault for all source and destination IP addresses means that after the time threshold, the SS type counter, DD type counter, and SD type counter no longer increase.

In one embodiment, in step S1600, the data in the updated repeated access statistics table is displayed on the terminal.

An embodiment of the present application also provides a duplicate data identification system, including a configuration module, a data packet receiving module and a data packet analysis module. The configuration module is set to configure a comparison table between IP and maximum survival time. The data packet receiving module The module is set to receive data packets and send the data packets to the data packet analysis module. The data packet analysis module is set to parse the data packets to obtain IP identification information, and identify repeated data packets based on the IP identification information to obtain repeated access types. Information, where the IP identification information includes the sender IP address information and the receiver IP address information, and the statistics module is configured to store the sender IP address information, receiver IP address information, repeated access types, and duplicate data packets of duplicate data packets. access time information.

In one embodiment, the configuration module is further configured to configure the turning on or off of the duplicate data identification function.

The duplicate data identification system in the embodiment of the present application is applicable to any communication network based on the IP protocol, including IPV4 and IPV6 versions, but is not limited to any form of network structure.

In one implementation, the configuration module can be a configurator, the data packet receiving module can be a data packet receiver, the data packet analysis module can be a data packet analyzer, and the statistics module can be a repeated access statistician.

In one implementation, as shown in Figure 7, the duplicate data identification system includes a configurator, a data packet receiver, a data packet analyzer, a data packet sender, a duplicate access statistician and a duplicate access presenter. The following will Detailed introduction.

The configurator is responsible for maintaining configuration information. The current configuration information includes, but is not limited to, turning on and off the duplicate data identification function, configuring the maximum cache duration of data packets, maintaining the comparison table between IP and maximum survival time, and the maximum duration for fault resolution.

In one implementation, the default state of the duplicate data identification function is on, and the maximum cache duration of data packets is configured as 10 seconds.

In one embodiment, if the switch for duplicate data identification is turned on, the identified duplicate data packets are directly discarded to avoid sending duplicate data packets to the data collection system, thus increasing the workload of the data collection system.

In one embodiment, if the switch of the duplicate data identification function is turned off, all received data packets will not be identified, and the data packets will be directly forwarded to the data collection system to avoid the situation where there are no duplicate data packets. Next, the data transmission delay.

In one implementation, the comparison relationship between IP and maximum survival time can be obtained through parameter information provided by the collected device.

As shown in Table 1, the fields in the comparison table between IP and maximum survival time are shown, including the sending end IP address information field and the maximum survival time field. There is a mapping relationship between the sending end IP address information and the maximum survival time. The end IP address information is a keyword that can uniquely determine a record in the comparison table between IP and maximum survival time. According to the sending end IP address information, the value of the maximum life time of the corresponding device can be queried.

According to the IP protocol, in order to prevent messages from looping between routing devices and never reaching the destination, each message carries a survival time value. The maximum survival time value is set by the sending IP address device when constructing the message, for example , use N to represent the initial value of the survival time, N is generally 64, each time the message passes through a routing device, the survival time value is reduced by 1, and is discarded when the value is reduced to 0, so the survival time value of the data packet satisfies: the survival time is less than equal to N.

Table 1

The data packet receiver is mainly responsible for establishing and maintaining the duplicate data identification system and the communication link between the data packet aggregators. It is also responsible for receiving all data packets from the data packet aggregator and transferring the received data packets to the data Analyzer performs analysis and processing.

The data packet sender is mainly responsible for establishing and maintaining the communication link between the duplicate data identification system and the data collection system. It is also responsible for sending the data packets that need to be forwarded after processing by the data packet analyzer to the data collection system.

The data packet analyzer is mainly responsible for parsing the information of each protocol layer of the data packet, and identifying the repeated access type of the data packet based on the IP information of the protocol layer and the field information in the data packet information table.

In the IPV4 version, the IP layer protocol information that data packets need to parse is as shown in Table 2, including sender IP address information, receiver IP address information, IP identification information, IP fragmentation mark information, IP fragmentation offset information, protocol Type, IP packet total length information and TTL. The IP protocol layer fields corresponding to the information in Table 2 have been listed above.

Table 2

In the IPV6 version, the IP layer protocol information that data packets need to parse is as shown in Table 3, including the sending IP address information, the receiving IP address information, flow label information, payload length information, next header information, and hop limit. Information (Hop Limit), the IP protocol layer fields corresponding to the information in Table 3 have been listed above.

table 3

The packet analyzer is also responsible for maintaining the packet information table in order to cache the IP layer protocol information of the packets received within the maximum cache time. The packet analyzer also performs regular detection on the packet information table. If the packets are not received for a specified period of time, the packet analyzer will Records of duplicate data packets are cleared in a timely manner, and the specified time can be configured by the user.

In one embodiment, as shown in Figure 10, the aging time is set, the aging timer is used to regularly detect the records in the data packet information table, and the existence time of each record in the table is cyclically checked whether it is equal to or greater than the aging time. , if a record is equal to or greater than the aging time, delete the record.

In the IPV4 version, as shown in Table 4, the fields in the packet information table of the IPV4 version are shown, including the sending end IP address information, the receiving end IP address information, IP fragmentation mark information, and IP fragmentation offset information. , protocol type, and IP packet total length information are used as keywords for each record. The unique record in the data packet information table can be determined through the keywords.

Table 4

In the IPV6 version, as shown in Table 5, the fields in the packet information table of the IPV6 version are shown. The sender IP address information, receiver IP address information, flow label information, payload length information, and next header information are used as keywords for each record. This keyword can determine the only record in the packet information table.

Explain the historical sender mark information and historical receiver mark information:

From the IP and maximum survival time comparison table, query the survival time corresponding to the current sender IP address information. The initial values of the historical sender mark information and the historical receiver mark information are 0. The algorithm for accessing source tag information is: if the survival time carried by the current data packet is equal to the maximum survival time, then the value of the historical sender tag information is set to 1; if the survival time carried by the current data packet is not equal to the maximum survival time, then Set the value of the historical receiver tag information to 1.

Referring to Figure 8, for example, the access source tag information of the initial data packet is explained, and the initial survival time of the initial data packet is compared with the maximum survival time. If the initial survival time is equal to the maximum survival time, it means that the initial data packet is sent from It is collected from the end IP address, and the historical sending end mark information is set to 1, and the historical receiving end mark information is set to 0; if the initial survival time is not equal to the maximum survival time, it means that the initial data packet was collected from the receiving end IP address. , and set the historical sending end tag information to 0, and the historical receiving end tag information to 1. After completing the above operations, the initial data packet is sent to the data collection system for network quality analysis.

table 5

The repeated access counter is mainly responsible for maintaining the repeated access statistics table, which is used to store the sending IP address and receiving IP address of repeated data packets, and to count the number of repeated accesses and other information, so that the information in the repeated access statistics table can be stored Send to repeat access presenter for display.

As shown in Table 6, the fields in the repeated access statistics table are shown, including the sending end IP address information field, the receiving end IP address information field, SS type counter field, SS type counter field, SD type counter field, repeat Access start time field and repeat access time field. The three counter fields are used to record the number of accesses of the three repeated access types.

In the repeated access statistics table, the sending end IP address information and the receiving end IP address information are used as keywords for each record. This keyword can uniquely determine the only record in the repeated access statistics table.

Table 6

By repeatedly accessing the statistics table, you can get the following information:

According to the repeated access statistics table, you can know the source and destination IP address pairs with repeated access.

According to the repeated access statistics table, you can know the number of SS-type repeated accesses, the number of DD-type repeated accesses, and the number of SD-type repeated accesses corresponding to a certain source-destination IP address pair information.

According to the repeated access statistics table, you can know the start time of repeated access for a certain source and destination IP address.

According to the repeated access statistics table, you can know the time it takes for a certain source and destination IP address to completely resolve the repeated access failure.

According to the repeated access statistics table, you can know whether a certain repeated access type fault in a source-destination IP address pair has been resolved. For example, set a time threshold. If the value of the SS type counter is greater than 0 and within the time threshold If the value no longer increases, it means that the SS type repeated access fault problem has been solved.

According to the repeated access statistics table, we can know the repeated access time information of a certain source and destination IP address pair, that is, there is a time period for repeated access. The repeated access time period consists of the repeated access start time and the latest repeated access time. The difference is calculated.

In one embodiment, the SS type counter, the DD type counter and the SD type counter can each be set to a repeated access latest time to respectively monitor the elimination time of the three types of repeated access.

According to the repeated access statistics table, you can intuitively understand which source and destination IP address pairs have repeated access, and whether the repeated access problem of which source and destination IP address pairs has been resolved.

The repeated access displayer is mainly responsible for displaying the information in the repeated access statistics table in the form of a visual interface, allowing users to intuitively understand the source and destination IP address pairs of repeated data packets, the number of SS type repeated accesses, DD type Information such as the number of repeated accesses, the number of SD-type repeated accesses, the repeated access start time, the latest repeated access time, etc., so that users can accurately know the network element equipment corresponding to the repeated access data packets, so that they can expressly Locate the fault.

Referring to Figure 9, the repeated access statistics table is explained. The data packet analyzer passes the sending end IP address information, the receiving end IP address information and the repeated access type information of the repeated data packets to the repeated access statistician. According to the sending end IP address information, receiving end IP address information, query duplicate access statistics table.

If no record corresponding to the sender IP address information and the receiver IP address information is found in the repeated access statistics table, add a new record in the repeated access statistics table and add the sender IP address information and the receiver IP address. The address information is added to the record, and according to the repeated access type information, the value of the corresponding counter is set to 1, the values of other counters are set to 0, and the repeated access start time and the latest repeated access time are set Set to current time.

If a record corresponding to the sender IP address information and the receiver IP address information is found in the repeated access statistics table, the value of the counter corresponding to the repeated access type information is increased by 1, which is used to count the access of the repeated access type. quantity, and updates the latest repeated access time to the current time, indicating the access time of the latest repeated data packet.

Describe the query conditions, query fields and update fields of the data packet information table and repeated access statistics table. The data analyzer queries and updates the data packet information table. The query conditions are such as the keywords in Table 4 and Table 5. The query fields are Historical sender mark information, historical receiver mark information, the update fields in the data packet information table are, historical sender mark information, historical receiver mark information.

The data packet analyzer sends a statistical message to the repeated access statistician. The content of the message is the sending end IP address information, the receiving end IP address information and the repeated access type information.

The repeated access statistics counter queries and updates the repeated access statistics table. The query conditions are the sending IP address information and the receiving IP address information. The query fields are SS type counter, SS type counter and SD type counter. The field that needs to be updated is SS. Type counter, SS type counter, SD type counter, repeated access latest time and repeated access start time.

The following is a brief description of the process of repeated data packets:

The packet receiver receives all packets from the packet aggregator and forwards the received packets to the packet analyzer for processing. The packet analyzer performs protocol analysis on the received data packets in order to obtain relevant information of the data packets, such as the sending IP address information, receiving IP address information, IP fragmentation mark information, and IP fragmentation offset in the IPV4 version Information, IP packet total length information, TTL, sender IP address information, receiver IP address information, flow label information, payload length information, next header information, and hop limit information in the IPV6 version.

After protocol analysis of the data packet, based on the information parsed from the current data packet and the information stored in the data packet information table, it is identified whether the current data packet is a duplicate data packet. If it is a duplicate data packet, the duplicate data packet is further identified. Repeat access type.

After identifying the repeated access type of the repeated data packet, the sender IP address information, receiving end IP address information, and repeated access type information of the repeated data packet are sent to the repeated access statistician to count the number of repeated accesses. .

In one implementation, if the current data packet is not a duplicate data packet, the current data packet is transferred to the data packet sender; if it is a duplicate data packet, the duplicate data packet is discarded.

When the data packet sender receives the data packet, it forwards the data packet to the data acquisition system.

When the repeated access statistician receives the statistics request from the packet analyzer, it updates the information in the repeated access statistics table and sends the information in the repeated access information table to the repeated access presenter.

By analyzing the received data packets, the embodiment of the present application can accurately know which IP addresses have repeated access of data packets, and can know what type of repeated access type the repeated data packets belong to, and further obtain Know on which IP ends there are duplicate packets.

The embodiment of this application uses source and destination IP address pairs as units to count duplicate data packets, and can accurately know the number of duplicate data packet accesses for source and destination IP address pairs, and the three types of repeated accesses will be counted separately. And it can know the start time and the latest access time of duplicate data packets in the source and destination IP addresses.

The embodiment of the present application intuitively displays the information of repeated data packets to the user through repeated access to the presenter, so that the user can quickly locate the faulty device and troubleshoot the fault accurately and quickly. And through the changes in the three types of counters in the repeated access display, we can know whether the data packet repeated access failure is partially or completely eliminated; when the values of the three counters in the repeated access display are no longer incrementing, and combined with the latest The access time of a repeated data packet can be used to know at which point in time the repeated access failure is completely eliminated.

By filtering duplicate data packets, the embodiments of the present application avoid sending these abnormal data packets to the data collection system and data analysis system, thereby improving system performance and the accuracy of data analysis.

The embodiment of the present application provides a data source identification system that can pair information, repeated access types and repeated access time messages based on the identified source and destination IP addresses, thereby helping to troubleshoot and ensure the accuracy of network quality analysis. performance while also helping to improve the performance of data acquisition systems and analysis systems.

One embodiment of the present application also provides a device for identifying duplicate data. The device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor can execute the computer program to implement the above. The above-mentioned identification method of duplicate data.

As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer executable programs. In addition, the memory may include high-speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may include memory located remotely from the processor, and the remote memory may be connected to the processor through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.

The non-transitory software programs and instructions required to implement the image processing method of the above embodiment are stored in the memory. When executed by the processor, the repetitive data identification method in the above embodiment is executed.

The network element embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

One embodiment of the present application also provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to execute the duplicate data identification method of the above embodiment.

In addition, embodiments of the present application also provide a computer program product, which includes a computer program or computer instructions. The computer program or computer instructions are stored in a computer-readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium. Program or computer instructions, the processor executes the computer program or computer instructions, so that the computer device performs the above duplicate data identification method.

The mobile communication device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separate, that is, they may be located in one place, or may be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art can understand that all or some steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is known to those of ordinary skill in the art, the term computer storage media includes volatile and nonvolatile media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. removable, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disk (DVD) or other optical disk storage, magnetic cassettes, tapes, disk storage or other magnetic storage devices, or may Any other medium used to store the desired information and that can be accessed by a computer. Additionally, it is known to those of ordinary skill in the art that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above has described several embodiments of the present application, but the present application is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent modifications or substitutions without violating the spirit of the present application. These equivalents All modifications or substitutions are included in the scope defined by the claims of this application.

Claims

A method for identifying duplicate data, including:

Obtain the source and destination IP address pair information, which includes the sending end IP address information and the receiving end IP address information;

Obtain repeated access type information, which is used to characterize the IP address access type that causes duplicate data packets to be generated;

Obtain repeated access time information, which is used to represent the time information of repeated data packet access;

According to at least one of the source and destination IP address pair information, repeated access type information and repeated access time information, duplicate data is identified for generated duplicate data packets.
The method for identifying duplicate data according to claim 1, wherein the method further includes:

Construct a counter corresponding to each repeated access type according to the repeated access type information;

The source and destination IP address pair information, the numerical information corresponding to the counter, and the repeated access time information are displayed.
The method for identifying duplicate data according to claim 1, wherein the obtained source and destination IP address pair information, repeated access type information and repeated access time information are obtained through a repeated access statistics table.
The method for identifying duplicate data according to claim 3, wherein the step of obtaining the duplicate access statistics table includes:

Get duplicate packets;

According to the repeated data packet, source and destination IP address pair information, repeated access type information and repeated access time information are obtained;

According to the source and destination IP address pair information, the repeated access type information and the repeated access time information, the corresponding information in the repeated access statistics table is updated to obtain the updated repeated access Statistics table.
The method for identifying duplicate data according to claim 4, wherein said obtaining source and destination IP address pair information based on the duplicate data packet includes:

The repeated data packet is analyzed to obtain the sending end IP address information and the receiving end IP address information of the repeated data packet, and the sending end IP address information and the receiving end IP address information are determined as the source and destination. IP address pair information.
The method for identifying duplicate data according to claim 4, wherein said obtaining a duplicate access type according to the duplicate data packet includes:

Repeated access type identification is performed on the repeated data packet to obtain repeated access type information of the repeated data packet.
The method for identifying duplicate data according to claim 4, wherein said obtaining the duplicate access time according to the duplicate data packet includes:

The access time of the repeated data packet is recorded to obtain repeated access time information.
The method for identifying duplicate data according to claim 6, wherein the step of identifying the duplicate access type of the duplicate data packet to obtain the duplicate access type information of the duplicate data packet includes:

Analyze the current data packet received to obtain the IP identification information used to identify the current data packet;

According to the IP identification information, query and obtain historical access source tag information;

Determine the access source of the repeated data packet and obtain the access source determination result;

The repeated access type information of the repeated data packet is determined according to the access source judgment result and the historical access source mark information.
The method for identifying duplicate data according to claim 8, wherein said judging the access source of the duplicate data packet to obtain the access source judgment result includes:

Obtain the maximum survival time and the current survival time of the duplicate data packet;

If the current survival time matches the maximum survival time, the access source judgment result is that the sending end accesses;

If the current survival time does not match the maximum survival time, the access source determination result is access by the receiving end.
The method for identifying duplicate data according to claim 9, wherein the historical access source mark information includes historical sending end mark information and historical receiving end mark information;

Determining the repeated access type information of the repeated data packet based on the access source judgment result and the historical access source mark information includes:

When the access source judgment result is that the sending end is accessed, the historical sending end mark information is judged. If the historical sending end mark information is 1, it is determined that the repeated access type of the repeated data packet is send. End IP repeated access type;

When the access source judgment result is that the receiving end accesses, the historical receiving end tag information is judged. If the historical receiving end tag information is 1, it is determined that the repeated access type of the repeated data packet is receiving. End IP repeated access type;

When the access source judgment result is that the sending end is accessed, the historical sending end mark information is judged. If the historical sending end mark information is 0, it is determined that the repeated access type of the repeated data packet is send. End-receiving end IP repeated access type;

When the access source judgment result is that the receiving end accesses, the historical receiving end tag information is judged. If the historical receiving end tag information is 0, it is determined that the repeated access type of the repeated data packet is send. End-receiver IP duplicate access type.
The method for identifying duplicate data according to claim 10, wherein the method further includes:

According to the current access source judgment result, update the historical access source tag information to obtain the current access source tag information;

The current access source tag information is used as the updated historical access source tag information.
The method for identifying duplicate data according to claim 8, wherein the historical access source mark information includes historical sending end mark information and historical receiving end mark information; the historical access source mark information is obtained through the following steps:

Obtain the maximum survival time and the initial data packet, and the initial data packet is the data packet obtained for the first time;

Analyze the initial data packet to obtain the initial survival time;

If the initial survival time matches the maximum survival time, the historical sending end tag information is set to 1, and the historical receiving end tag information is set to 0;

If the initial survival time does not match the maximum survival time, the historical sending end tag information is set to 0, and the historical receiving end tag information is set to 1.
The method for identifying duplicate data according to claim 9 or 12, wherein said obtaining the maximum survival time includes:

Obtain the pre-configured IP and maximum survival time comparison table;

According to the IP address information of the sending end, a query is performed in the comparison table between the IP and the maximum survival time to obtain the maximum survival time.
The method for identifying duplicate data according to claim 7, wherein the access to the duplicate data packet is Record the time to obtain repeated access time information, including:

Obtain the repeated access start time, which is the time when the repeated access occurs for the first time;

Obtain the latest time of repeated access, where the latest time of repeated access is the latest time when repeated access occurs;

The difference between the latest repeated access time and the repeated access start time is used as repeated access time information.
A duplicate data identification system, including:

The configuration module is set to configure the comparison table between IP and maximum survival time;

a data packet receiving module configured to receive data packets and send the data packets to the data packet analysis module;

The data packet analysis module is configured to parse the data packet to obtain IP identification information, identify duplicate data packets according to the IP identification information and obtain repeated access type information, wherein the IP identification information includes the sending end IP address information. and receiving end IP address information;

The statistics module is configured to store the sending end IP address information, the receiving end IP address information, the repeated access type information and the access time information of the repeated data packets of the repeated data packets.
The duplicate data identification system according to claim 15, wherein:

The configuration module is also configured to configure the opening or closing of the duplicate data identification function.
A device for identifying duplicate data, including:

at least one processor;

At least one memory for storing at least one program;

The method for identifying duplicate data according to any one of claims 1 to 14 is implemented when at least one of the programs is executed by at least one of the processors.
A computer-readable storage medium in which a processor-executable program is stored, and when the processor-executable program is executed by the processor, it is used to realize the identification of duplicate data according to any one of claims 1 to 14. method.
A computer program product comprising a computer program or computer instructions stored in a computer-readable storage medium from which a processor of a computer device reads the computer program Or the computer instructions, the processor executes the computer program or the computer instructions, so that the computer device performs the duplicate data identification method according to any one of claims 1 to 14.