CN113452621A

CN113452621A - Simple and efficient multilink data deduplication method

Info

Publication number: CN113452621A
Application number: CN202110652823.2A
Authority: CN
Inventors: 张凯; 郑应强; 刘同鹤
Original assignee: Beijing LSSEC Technology Co Ltd
Current assignee: Beijing LSSEC Technology Co Ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-09-28
Anticipated expiration: 2041-06-11
Also published as: CN113452621B

Abstract

The invention discloses a simple and efficient multilink data deduplication method, which comprises the following steps: acquiring a plurality of link data packets received by a receiving end: respectively determining the types of a plurality of link data packets, classifying the link data packets, and determining the link data packets as a first link data packet set and a second link data packet set; the first link data packet set comprises a plurality of first link data packets, and the first link data packets are link fragment data; the second link data packet set comprises a plurality of second link data packets, and the second link data packets are link complete data; respectively judging whether a first link data packet in a first link data packet set is a redundant packet, screening out the first link data packet which is the redundant packet, and performing deduplication processing; and respectively judging whether the second link data packets in the second link data packet set are redundant packets, screening out the second link data packets which are the redundant packets, and performing deduplication processing. The deduplication efficiency is improved.

Description

Simple and efficient multilink data deduplication method

Technical Field

The invention relates to the technical field of data processing, in particular to a simple and efficient multilink data deduplication method.

Background

With the continuous development of the equipment redundancy technology, the method is widely applied to the communication field. In the data communication process, in order to ensure that the link can realize the data transmission with the maximum reliability, the multilink equipment can carry out three-level redundant transmission when transmitting the original user data, so that the completeness of the data received by a receiving end can be ensured, and the data incompleteness caused by packet loss and the like in the data transmission process is avoided. However, when the receiving end performs data reassembly, the receiving end receives repeated fragment data based on the three-level redundant transmission, and the redundant fragments are too many, which occupies more memory and also occupies the memory garbage recycling time. Under more extreme conditions, in order to ensure the maximum reliable transmission, multiple link sending ends may copy multiple copies of original user data and send the copied data to a receiving end of a multilink device, and if these data packets are all output to a service port of the receiving end, on one hand, the processing load at the post-stage is increased, and on the other hand, the processing result of service software may be affected.

In the prior art, the redundant data acquired by a receiving end has the problems of incomplete data deduplication, low deduplication efficiency and inaccurate deduplication, and the incomplete data can be reconstructed based on the inaccurate deduplication, so that the system can perform error control according to the incomplete data, and the safety and reliability of the system are affected.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the art described above. Therefore, the invention aims to provide a simple and efficient multilink data deduplication method, the type of a data packet is accurately determined, whether the data packet is a redundant packet or not is accurately judged according to the classified data packet, and then different deduplication methods are adopted, so that deduplication efficiency of the data packet is improved; on the one hand, the occupied system resources of the receiving end are reduced as much as possible, such as memory, CPU processing time and the like, on the other hand, for a user, no matter how the sending end sends data, the receiving end finally only receives one piece of original data, the user service level can not be affected by any adverse effect, the completeness and the accuracy of the data after the data are recombined at the receiving end are guaranteed, the quality of data transmission is improved, meanwhile, the phenomenon that the data are not complete due to inaccurate de-duplication is avoided, further, the system can perform error control according to incomplete data, the safety and the reliability of the system are improved, and the multi-link data are more simply and efficiently de-duplicated.

To achieve the above object, an embodiment of the present invention provides a simple and efficient method for removing duplicate data in multiple links, including:

acquiring a plurality of link data packets received by a receiving end:

respectively determining the types of a plurality of link data packets, classifying the link data packets, and determining the link data packets as a first link data packet set and a second link data packet set; the first link data packet set comprises a plurality of first link data packets, and the first link data packets are link fragment data; the second link data packet set comprises a plurality of second link data packets, and the second link data packets are link complete data;

respectively judging whether a first link data packet in a first link data packet set is a redundant packet, screening out the first link data packet which is the redundant packet, and performing deduplication processing;

and respectively judging whether the second link data packets in the second link data packet set are redundant packets, screening out the second link data packets which are the redundant packets, and performing deduplication processing.

According to some embodiments of the present invention, respectively determining whether a first link data packet in a first link data packet set is a redundant packet, screening out a first link data packet that is a redundant packet, and performing deduplication processing, includes:

acquiring a source id of a historical first link data packet, and establishing a duplicate removal black tree;

the duplicate removal red and black tree comprises a plurality of tree nodes, and a source port, a destination port and a historical first link data packet serial number are stored in each tree node;

establishing a first hash table according to a source port, a destination port and a historical first link data packet serial number stored in each tree node;

the first hash table comprises a plurality of first nodes, and history information is stored in each first node;

obtaining a source id of a first link data packet, and determining a target tree node on the duplicate removal black tree according to the source id of the first link data packet;

quickly positioning to a corresponding position in a first hash table according to a first target packet serial number stored in a target tree node, and determining a first target node;

comparing the first to-be-stored information of the current first link data packet with the historical information stored by the first target node, and judging whether the time difference between the historical information stored by the first target node and the first to-be-stored information of the current first link data packet is smaller than a preset time difference or not when the first to-be-stored information and the historical information stored by the first target node are consistent;

and when the time difference between the historical information stored by the first target node and the first to-be-stored information of the current first link data packet is determined to be smaller than the preset time difference, the first link data packet is represented as a redundant packet, and the deduplication processing is carried out without entering the subsequent data reassembly processing flow.

According to some embodiments of the present invention, respectively determining whether a second link data packet in a second link data packet set is a redundant packet, screening out a second link data packet that is a redundant packet, and performing deduplication processing, includes:

acquiring a source id of a historical second link data packet, and establishing a second hash table and a third hash table; the second hash table takes the source id of the historical second link data packet as KEY; establishing an incidence relation between the second hash table and the third hash table;

acquiring a source id of a second link data packet, and quickly searching a second target node in a second hash table;

quickly positioning to a corresponding position in a third hash table according to a second target packet serial number stored in the second target node, and determining a third target node;

when the third target packet sequence number stored in the third target node is determined to be consistent with the second link data packet sequence number, judging whether the time difference between the third target node historically storing the third target packet sequence number and the second link data packet received this time is smaller than a preset time difference;

and when the time difference between the third target node historically storing the third target packet serial number and the current received second link data packet is determined to be smaller than the preset time difference, the second link data packet is represented as a redundant packet, and the duplicate removal processing is carried out without entering the subsequent data recombination processing flow.

According to some embodiments of the invention, the time difference is 3 s.

According to some embodiments of the present invention, the history information includes a history first link data packet sequence number, a bit map value for reconstructing data, a length of the history first link data packet, an offset position of the history first link data packet in a data packet to be transmitted, a unique identifier ID of the history first link data packet, and a received time of the history first link data packet.

According to some embodiments of the present invention, after performing deduplication processing on the first link data packet screened out as the redundant packet, the method further includes:

randomly selecting two first link data packets from the first link data packet set subjected to the duplicate removal processing, wherein the two first link data packets are a first link data packet A and a first link data packet B respectively;

calculating the similarity between the first link data packet A and the first link data packet B, judging whether the similarity is greater than a preset similarity, and when the similarity is determined to be greater than the preset similarity, indicating that the duplicate removal processing of the first link data packet in the first link data packet set is unqualified and the duplicate removal processing needs to be carried out again;

calculating the similarity between the first link data packet A and the first link data packet B, including:

acquiring sub-link data (A) included in a first link data packet A₁,A₂，...，A_m)；

Acquiring sub-link data included in a second link packet B (B)₁,B₂，...，B_m)；

Calculating a similarity S (a, B) of the first link packet a and the first link packet B based on formula (1):

wherein m is the number of sub-link data included in the first link data packet a; n is a second chainThe number of sub-link data included in the link data packet B; i is the ith sub-link data in the first link data packet A; j is jth sub-link data in the second link data packet B; k is a radical of_ijA judgment coefficient for judging whether the ith sub-link data in the first link data packet A is the same as the jth sub-link data in the second link data packet B, and when the ith sub-link data in the first link data packet A and the jth sub-link data in the second link data packet B are the same, k is_ij1 is ═ 1; otherwise; k is a radical of_ij＝0。

According to some embodiments of the present invention, before determining and classifying the types of the plurality of link packets, the method further includes:

respectively judging whether the plurality of link data packets comprise the marked images, and determining the link data packets with the marked images as to-be-detected link data packets;

carrying out gray processing on the marked image in the link data packet to be detected, and calculating to obtain a gray conversion function;

calculating the average power of noise signals in the marked image according to the gray scale transformation function, judging whether the average power is greater than a preset average power or not, and performing noise reduction processing on the marked image when the average power is determined to be greater than the preset average power;

calculating the average power of noise signals in the marked image according to the gray scale transformation function, comprising:

calculating the average power of the noise signal in the length direction in the marked image

Wherein N is the length of the marker image; m is the width of the marker image; f (x, y) is a gray scale transformation function with respect to a pixel point (x, y) on the marker image;

calculating average power of noise signal in width direction in graphic data

Calculating the average power of the noise signal in the marked image according to the average power of the noise signal in the length direction and the average power of the noise signal in the width direction in the marked image:

wherein the content of the first and second substances,

is the average power of the noise signal in the marked image.

determining the ratio of the number of the link data packets received by the receiving end to the number of the link data packets sent by the sending end, and judging whether the ratio is smaller than a preset ratio or not;

when the ratio is determined to be smaller than the preset ratio, detecting the duty ratio of a preset network node in a use link for sending the link data packet at the time;

determining the load capacity of the preset network nodes according to the duty ratio, screening out the preset network nodes with the load capacity larger than the preset load capacity as marked preset network nodes;

determining link information comprising the marked preset network node by taking the marked preset network node as an extension point;

and setting the priority level of the used link according to the link information to ensure that the ratio of the number of the link data packets received again by the receiving end to the number of the link data packets sent again by the sending end is greater than or equal to a preset ratio.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flow chart of a simple and efficient method for removing duplicate multi-link data according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

As shown in fig. 1, an embodiment of the present invention provides a simple and efficient method for removing duplicate data from multiple links, including steps S1-S4:

s1, acquiring a plurality of link data packets received by the receiving end:

s2, determining the types of a plurality of link data packets respectively, classifying the link data packets, and determining the link data packets as a first link data packet set and a second link data packet set; the first link data packet set comprises a plurality of first link data packets, and the first link data packets are link fragment data; the second link data packet set comprises a plurality of second link data packets, and the second link data packets are link complete data;

s3, respectively judging whether the first link data packet in the first link data packet set is a redundant packet, screening out the first link data packet which is the redundant packet and carrying out deduplication processing;

and S4, respectively judging whether the second link data packets in the second link data packet set are redundant packets, screening out the second link data packets which are redundant packets, and performing deduplication processing.

The working principle of the technical scheme is as follows: acquiring a plurality of link data packets received by a receiving end; respectively determining the types of a plurality of link data packets, classifying the link data packets, and determining the link data packets as a first link data packet set and a second link data packet set; the first link data packet set comprises a plurality of first link data packets, and the first link data packets are link fragment data; the second link data packet set comprises a plurality of second link data packets, and the second link data packets are link complete data; namely, if the first link data packet is a redundant packet, the first link data packet is link fragment data redundancy; if the second link data packet is a redundant packet, the second link data packet is link complete data redundancy; respectively judging whether a first link data packet in a first link data packet set is a redundant packet, screening out the first link data packet which is the redundant packet, and performing deduplication processing; and respectively judging whether the second link data packets in the second link data packet set are redundant packets, screening out the second link data packets which are the redundant packets, and performing deduplication processing.

The beneficial effects of the above technical scheme are that: the type of the link data packet is accurately determined, whether the link data packet is a redundant packet is accurately judged according to the classified link data packet, and then different duplication eliminating methods are adopted, so that the duplication eliminating efficiency of the data packet is improved; on the one hand, the occupied system resources of the receiving end are reduced as much as possible, such as memory, CPU processing time and the like, on the other hand, for a user, no matter how the sending end sends data, the receiving end finally only receives one piece of original data, the user service level can not be affected by any adverse effect, the completeness and the accuracy of the data after the data are recombined at the receiving end are guaranteed, the quality of data transmission is improved, meanwhile, the phenomenon that the data are not complete due to inaccurate de-duplication is avoided, further, the system can perform error control according to incomplete data, the safety and the reliability of the system are improved, and the multi-link data are more simply and efficiently de-duplicated.

The working principle of the technical scheme is as follows: the Red Black Tree (Red Black Tree) is a self-balancing binary search Tree, is a data structure used in computer science, and is typically used for realizing an associated array. When judging whether a first link data packet in a first link data packet set is a redundant packet, acquiring a source id of a historical first link data packet, and establishing a duplicate removal Reddish-Black tree; the duplicate removal red and black tree comprises a plurality of tree nodes, and a source port, a destination port and a historical first link data packet serial number are stored in each tree node; the source port is the beginning of the transmission link and the destination port is the end of the transmission link. A hash table, also called a hash table, is a data structure directly accessed from a Key value (Key value). That is, it accesses the record by mapping the key value to a location in the table to speed the lookup. Establishing a first hash table according to a source port, a destination port and a historical first link data packet serial number stored in each tree node; the first hash table comprises a plurality of first nodes, and history information is stored in each first node; obtaining a source id of a first link data packet, and determining a target tree node on the duplicate removal black tree according to the source id of the first link data packet; quickly positioning to a corresponding position in a first hash table according to a first target packet serial number stored in a target tree node, and determining a first target node; comparing the first to-be-stored information of the current first link data packet with the historical information stored by the first target node, and judging whether the time difference between the historical information stored by the first target node and the first to-be-stored information of the current first link data packet is smaller than a preset time difference or not when the first to-be-stored information and the historical information stored by the first target node are consistent; and when the time difference between the historical information stored by the first target node and the first to-be-stored information of the current first link data packet is determined to be smaller than the preset time difference, the first link data packet is represented as a redundant packet, and the deduplication processing is carried out without entering the subsequent data reassembly processing flow. The historical first link data packet is the same type of historical data as the first link data packet.

The beneficial effects of the above technical scheme are that: the method comprises the steps of carrying out quick searching and positioning on a first link data packet based on a duplicate removal red-black tree established by a historical first link data packet, determining whether the first link data packet is recorded in a historical process, and accurately determining whether the first link data packet is a redundant packet based on two key factors of comparing first to-be-stored information of the current first link data packet with historical information stored by a first target node and comparing whether the time difference between the historical information stored by the first target node and the first to-be-stored information of the current first link data packet is smaller than a preset time difference.

The working principle of the technical scheme is as follows: when judging whether a second link data packet in a second link data packet set is a redundant packet, acquiring a source id of a historical second link data packet, and establishing a second hash table and a third hash table; the second hash table takes the source id of the historical second link data packet as KEY; establishing an incidence relation between the second hash table and the third hash table; acquiring a source id of a second link data packet, and quickly searching a second target node in a second hash table; quickly positioning to a corresponding position in a third hash table according to a second target packet serial number stored in the second target node, and determining a third target node; when the third target packet sequence number stored in the third target node is determined to be consistent with the second link data packet sequence number, judging whether the time difference between the third target node historically storing the third target packet sequence number and the second link data packet received this time is smaller than a preset time difference; and when the time difference between the third target node historically storing the third target packet serial number and the current received second link data packet is determined to be smaller than the preset time difference, the second link data packet is represented as a redundant packet, and the duplicate removal processing is carried out without entering the subsequent data recombination processing flow. The second hash table is used as a primary table. The third hash table is used as a secondary table. The second hash table comprises a plurality of second nodes; the third hash table comprises a plurality of second nodes; each node maintains a sequence number and time of the most recently output link packet recorded at the corresponding location. The historical second link data packet is the same type of historical data as the second link data packet.

The beneficial effects of the above technical scheme are that: and a two-stage hash table is established based on the historical second link data packet, so that whether the received second link data packet is a redundant packet or not is accurately judged, and the method is simpler and more efficient.

According to some embodiments of the invention, the time difference is 3 s.

And accurately judging whether the first link data packet is a redundant data packet or not based on the unique identification ID of the historical first link data packet and the first link data packet received this time.

wherein m is the number of sub-link data included in the first link data packet a; n is the number of sub-link data included in the second link packet B; i is the ith sub-link data in the first link data packet A; j is jth sub-link data in the second link data packet B; k is a radical of_ijA judgment coefficient for judging whether the ith sub-link data in the first link data packet A is the same as the jth sub-link data in the second link data packet B, and when the ith sub-link data in the first link data packet A and the jth sub-link data in the second link data packet B are the same, k is_ij1 is ═ 1; otherwise; k is a radical of_ij＝0。

The working principle and the beneficial effects of the technical scheme are as follows: after the selected first link data packets which are the redundant packets are subjected to deduplication processing, detecting the deduplication effect of a first link data packet set subjected to deduplication processing, specifically, randomly selecting two first link data packets which are a first link data packet A and a first link data packet B from the first link data packet set subjected to deduplication processing; calculating the similarity between the first link data packet A and the first link data packet B, judging whether the similarity is greater than a preset similarity, and when the similarity is determined to be greater than the preset similarity, indicating that the duplicate removal processing of the first link data packet in the first link data packet set is unqualified and the duplicate removal processing needs to be carried out again; in another embodiment, the two first link data packets may be randomly selected for multiple times, the similarity between the two first link data packets may be calculated, the mean value of the similarity may be calculated, and the accuracy of the detection of the deduplication effect may be improved according to the mean value of the similarity. The similarity between the first link data packet A and the first link data packet B is calculated, the sub-link data included in the two first link data packets are respectively compared, the arrangement sequence of the sub-link data included in the same first link data packet is considered, the calculated similarity is more reasonable, the similarity between the first link data packet A and the first link data packet B is accurately calculated according to a formula (1), and the accuracy of judging the similarity and the preset similarity is improved.

calculating average power of noise signal in width direction in graphic data

wherein the content of the first and second substances,

is the average power of the noise signal in the marked image.

The working principle and the beneficial effects of the technical scheme are as follows: before determining and classifying the types of a plurality of link data packets respectively, the method further comprises the following steps: respectively judging whether the plurality of link data packets comprise the marked images, and determining the link data packets with the marked images as to-be-detected link data packets; the marking image is a sending identification for marking the link data packet, so that whether the link data packet is lost in the transmission process is conveniently judged, and when the link data packet is lost, the lost link data packet is determined according to the marking image, so that the link data packet is conveniently sent again, and the accuracy of data transmission is improved. Carrying out gray processing on the marked image in the link data packet to be detected, and calculating to obtain a gray conversion function; calculating the average power of noise signals in the marked image according to the gray scale transformation function, judging whether the average power is greater than a preset average power or not, and performing noise reduction processing on the marked image when the average power is determined to be greater than the preset average power; the method and the device facilitate the improvement of the identification accuracy of the marked image, and improve the determination accuracy for determining the types of the plurality of link data packets and classifying the link data packets in the subsequent steps. The gray scale transformation function is prior art and will not be described herein. According to the average power of the noise signals in the marked image in the length direction and the average power of the noise signals in the width direction, the average power of the noise signals in the marked image is calculated, the accuracy of calculating the total average power of the noise signals in the marked image is improved, and the accuracy of judging the average power and the preset average power is further improved.

The working principle and the beneficial effects of the technical scheme are as follows: before determining the types of the plurality of link data packets and classifying the link data packets, the method also comprises the step of detecting the sending success rate and the packet loss number of the link data packets of the sending end, so that when the receiving end conducts data recombination, the plurality of data packets can form the data to be transmitted determined by the sending end completely. Determining the ratio of the number of the link data packets received by the receiving end to the number of the link data packets sent by the sending end, and judging whether the ratio is smaller than a preset ratio or not; indicating that the packet loss number is too large, the link quality of a used link of a data packet of a sending link needs to be detected, and detecting the duty ratio of a preset network node in the used link of the data packet of the sending link when the ratio is determined to be smaller than a preset ratio; determining the load capacity of the preset network nodes according to the duty ratio, screening out the preset network nodes with the load capacity larger than the preset load capacity as marked preset network nodes; determining link information comprising the marked preset network node by taking the marked preset network node as an extension point; and setting the priority level of the used link according to the link information to ensure that the ratio of the number of the link data packets received again by the receiving end to the number of the link data packets sent again by the sending end is greater than or equal to a preset ratio. The priority level of the used link is determined, the priority transmission of the data packet of the current link is guaranteed, the packet loss number is reduced, the number of the link data packets received by the receiving end is guaranteed, and the accuracy of data transmission is improved.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A simple and efficient multilink data deduplication method is characterized by comprising the following steps:

acquiring a plurality of link data packets received by a receiving end:

2. The simple and efficient method for removing duplicate data in multiple links according to claim 1, wherein determining whether the first link data packet in the first link data packet set is a redundant packet, respectively, and screening out the first link data packet that is the redundant packet and performing the duplicate removal process includes:

3. The simple and efficient method for removing duplicate data in multiple links according to claim 1, wherein determining whether the second link data packets in the second link data packet set are redundant packets, respectively, and screening out the second link data packets that are redundant packets and performing the duplicate removal process includes:

4. A simple and efficient multilink data deduplication method as recited in claim 2 or 3, wherein the time difference is 3 s.

5. The simple and efficient multilink data deduplication method of claim 2, wherein the historical information comprises a historical first link data packet sequence number, a bit map value for reassembly data, a length of the historical first link data packet, an offset position of the historical first link data packet within a data packet to be transmitted, a unique identification ID of the historical first link data packet, and a received time of the historical first link data packet.

6. The simple and efficient method for removing duplicate data in multiple links according to claim 1, wherein after the first link data packet screened out as the redundant packet is subjected to the duplicate removal process, the method further comprises:

7. The simple and efficient method for removing duplicate content in multilink data according to claim 1, further comprising, before determining and classifying the types of the plurality of link packets, respectively:

calculating average power of noise signal in width direction in graphic data

wherein the content of the first and second substances,

is the average power of the noise signal in the marked image.

8. The simple and efficient method for removing duplicate content in multilink data according to claim 1, further comprising, before determining and classifying the types of the plurality of link packets, respectively: