CN110019054B

CN110019054B - Log duplicate removal method and system, and content distribution network system

Info

Publication number: CN110019054B
Application number: CN201711487741.7A
Authority: CN
Inventors: 高顺路
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2023-01-31
Anticipated expiration: 2037-12-29
Also published as: CN110019054A

Abstract

The application provides a log duplicate removal method and system and a content distribution network system, wherein the log duplicate removal method comprises the following steps: acquiring a log data set; determining the name and the sequence number of the log data set in the log data set identifier of the log data set; wherein the sequence number is used for distinguishing different log data sets with the same log data set name; and if the storage state identifier corresponding to the sequence number in the storage state identifier array corresponding to the name of the log data set represents the stored log data set, refusing to store the log data set. The method and the device do not perform the strip-by-strip deduplication operation on the log data, and perform the log deduplication operation on the log data set. The log data set can include 4096 log data at most, so deduplication operations can be performed on multiple pieces of log data in the log data set at once. Therefore, the CPU resource consumption of the log collection device can be reduced, and the log deduplication operation efficiency can be improved.

Description

Log duplicate removal method and system, and content distribution network system

Technical Field

The present application relates to the field of communications technologies, and in particular, to a log deduplication method and system, and a content distribution network system.

Background

Currently, most systems include a log collecting device and a plurality of log generating devices, and the log generating devices can collect log data generated by themselves and transmit the log data to the log collecting device, so that the log collecting device performs other processes using the log data.

However, the log generating device may repeatedly transmit the same log data due to network jitter or the like, which may cause the log collecting device to contain repeated log data. In order to make it possible to subsequently use the accurate and valid log data, a deduplication operation may be performed on the log data.

At present, the scheme of log deduplication operation is as follows: the log generating device adds an identifier (ID of log data) to each piece of log data so that the log collecting device can perform a strip-by-strip deduplication operation for log data of duplicate identifiers.

Because the volume of log data on the log collection device is huge, executing the log deduplication operation scheme one by one not only consumes a large amount of CPU resources of the log collection device, but also results in low efficiency of executing the log deduplication operation by the log collection device.

Disclosure of Invention

In view of this, the present application provides a log deduplication method and system, which can perform deduplication operations on a log data set, thereby reducing CPU resource consumption of a log collection device and improving log deduplication operation efficiency.

In order to achieve the above object, the present application provides the following technical features:

a log deduplication method is applied to log collection equipment and comprises the following steps:

acquiring a log data set; the log data set comprises log data generated by the same process of the log generation equipment in a first time period and a log data set identifier;

determining the name and the sequence number of the log data set in the log data set identifier of the log data set; the sequence numbers are used for distinguishing different log data sets with the same log data set name;

determining a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set;

and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

Optionally, the determining, in the storage state identifier array corresponding to the log data set name, the storage state identifier corresponding to the sequence number includes:

searching whether the name of the log data set exists in a log data set name list;

if the name of the log data set is found, determining the storage state identification corresponding to the sequence number in the storage state identification array corresponding to the name of the log data set.

Optionally, the method further includes:

if the name of the log data set is not found, adding the name of the log data set into the name list of the log data set;

constructing a storage state identification array corresponding to the name of the log data set, wherein the storage state identifications in the storage state identification array all represent the states of the log data set which is not stored;

and storing the log data set, and changing the storage state identifier corresponding to the sequence number in the storage state identifier array so that the storage state identifier represents the state of the stored log data set.

Optionally, one first time period corresponds to one log data set, and a plurality of first time periods form a second time period, and the names of the log data sets corresponding to the plurality of first time periods in the second time period are consistent;

the sequence number in the log data set identification is used to distinguish the log data sets with the same log data set name in the second time period.

Optionally, the name of the log data set in the log data set identifier includes:

a device identifier of the log generation device;

a process identifier of the process in the log generation device;

a system timestamp corresponding to the second time period;

and the system timestamp is the product of an integer and the second time period, and the integer is obtained by rounding down the quotient of the system timestamp corresponding to the equipment log data set and the second time period.

Optionally, the sequence number in the log data set identifier includes:

the log generation equipment determines a numerical value of the log data set between 1 and the maximum numerical value, and different log data sets correspond to different serial numbers in the second time period;

and the maximum value is obtained by rounding up the quotient of the second time period and the first time period.

Optionally, the storage state identifier array includes a bit array, bits in the bit array correspond to the sequence number, and data values of the bits indicate the storage state identifiers.

A log deduplication method is applied to log generation equipment and comprises the following steps:

acquiring a log data set generated by the equipment in the same process in a first time period;

adding a log data set identifier to the log data set; the log data set identification comprises a log data set name and a serial number used for distinguishing different log data sets with the same log data set name;

sending a log data set to log collection equipment, so that the log collection equipment determines a name and a sequence number of the log data set in a log data set identifier of the log data set, and determines a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

Optionally, adding a log data set identifier to the log data set includes:

determining the name of the log data set corresponding to the log data set;

determining a sequence number corresponding to the log data set;

determining a character string consisting of the name of the log data set and the sequence number as the identifier of the log data set;

and adding the log data set identification to the log data set.

Optionally, the determining the name of the log data set corresponding to the log data set includes:

acquiring a device identifier of the device, a process identifier of the process and a system timestamp;

calculating a quotient value of the system timestamp and the second time period, rounding the quotient value downwards to obtain an integer, and calculating a product of the integer and the second time period;

and determining a character string formed by the device identification, the process identification and the product as the name of the log data set.

Optionally, the determining the sequence number corresponding to the log data set includes:

in the second time period, starting from the first initial count value, increasing 1 every time a log data set is generated until the count value is increased by a preset number of times; alternatively, the first and second liquid crystal display panels may be,

in the second time period, starting from a second initial count value, decreasing by 1 each time a log data set is generated, and directly decreasing the count value by preset times;

and the preset times are obtained by rounding up the quotient of the second time period and the first time period.

A content distribution network system comprising: the system comprises a central node and edge nodes connected with the central node, wherein each edge node comprises a plurality of servers;

the system comprises a server, a streaming message system and a message processing system, wherein the server is used for collecting a log data set generated by the same process in a first time period, adding a log data set identifier to the log data set and sending the log data set to the streaming message system; the log data set identification comprises log data set names adopted by a plurality of log data sets and serial numbers used for distinguishing different log data sets with the same log data set name;

the central node is used for acquiring a log data set from the streaming message system and determining the name and the sequence number of the log data set in a log data set identifier of the log data set; determining a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

A log deduplication system, comprising:

the log generation device is used for collecting log data sets generated by the same process in a first time period, adding a log data set identifier to the log data sets and sending the log data sets to the log collection device; the log data set identification comprises log data set names adopted by a plurality of log data sets and serial numbers used for distinguishing different log data sets with the same log data set name;

the log collection equipment is used for acquiring a log data set and determining the name and the sequence number of the log data set in a log data set identifier of the log data set; determining a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

acquiring a log data set; the log data set comprises log data generated by the same process of the log generation equipment in a first time period and a log data set name;

determining a log data set name for the log data set;

searching whether the name of the log data set exists in an existing log data set name list;

and if the name of the log data set is found, refusing to store the log data set.

Optionally, the method further includes:

if the name of the log data set is not found, storing the log data set;

and adding the name of the log data set in the name list of the log data set.

Optionally, the determining a name of the log data set includes: extracting the log dataset name in the log dataset;

wherein the log dataset name comprises:

a device identifier of the log generation device;

a process identifier of the process in the log generation device;

the log generation device generates a system time stamp during the log data set.

generating a name of a log data set, and adding the name of the log data set to the log data set;

sending a log data set to log collection equipment, so that the log collection equipment can search whether the name of the log data set exists in an existing log data set name list, and if the name of the log data set is found, refusing to store the log data set.

Optionally, adding a log data set name to the log data set includes:

acquiring the equipment identifier of the equipment, the process identifier of the process and a system timestamp;

determining a character string consisting of the device identifier, the process identifier and the system timestamp as the name of the log data set;

and adding the name of the log data set to the log data set.

A log deduplication system, comprising:

the log generation device is used for collecting log data sets generated by the same process in a first time period and adding log data set names to the log data sets; sending the log data set to a log collection device;

the log collection device is used for acquiring a log data set sent by the log generation device; searching whether a name of a log data set in the log data set exists in an existing log data set name list; and if so, refusing to store the log data set.

the server is used for collecting the log data sets generated by the same process in a first time period, adding the log data set names to the log data sets and sending the log data sets to the streaming message system;

the central node is used for acquiring a log data set from a streaming message system; searching whether a name of a log data set in the log data set exists in an existing log data set name list; and if so, refusing to store the log data set.

A log deduplication device integrated in a log collection device comprises:

a first acquisition unit configured to acquire a log data set; the log data set comprises log data and a log data set name, wherein the log data set comprises the log data and the log data set name which are generated by the same process of the log generation equipment in a first time period;

a determining unit configured to determine a log data set name of the log data set;

the searching unit is used for searching whether the name of the log data set exists in an existing log data set name list;

and the rejection storage unit is used for rejecting to store the log data set if the name of the log data set is found.

A log deduplication device integrated with a log generation device comprises:

the second acquisition unit is used for acquiring a log data set generated by the equipment in the same process in a first time period;

the first adding identification unit is used for generating a log data set name and adding the log data set name to the log data set;

the first sending unit is configured to send the log data set including the log data set name to the log collecting device, so that the log collecting device searches whether the log data set name exists in a log data set name list, and if the log data set name is found, the log data set is rejected to be stored.

A log deduplication device integrated in a log collection apparatus comprises:

a third acquisition unit configured to acquire a log data set; the log data set comprises log data generated by the same process of the log generation equipment in a first time period and a log data set identifier;

the determining unit is used for determining the name and the sequence number of the log data set in the log data set identifier of the log data set; wherein the sequence number is used for distinguishing different log data sets with the same log data set name;

a storage state identification determining unit, configured to determine, in a storage state identification array corresponding to the name of the log data set, a storage state identification corresponding to the sequence number;

and the rejection storage unit is used for rejecting to store the log data set if the storage state identifier corresponding to the sequence number represents the state of the stored log data set.

A log deduplication device integrated with a log generation device comprises:

the fourth acquisition unit is used for acquiring a log data set generated by the equipment in the same process in a first time period;

a second adding identification unit, configured to add a log data set identification to the log data set; the log data set identification comprises a log data set name and serial numbers used for distinguishing different log data sets with the same log data set name;

the second sending unit is used for sending the log data set to the log collection equipment so that the log collection equipment can determine the name and the serial number of the log data set in the log data set identifier of the log data set, and determine the storage state identifier corresponding to the serial number in the storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set. Through the technical means, the following beneficial effects can be realized:

the method and the device do not perform the strip-by-strip deduplication operation on the log data, and perform the log deduplication operation on the log data set. The log data set can include 4096 log data at most, so deduplication operations can be performed on multiple pieces of log data in the log data set at once. Therefore, the CPU resource consumption of the log collection equipment can be reduced, and the log deduplication operation efficiency can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic structural diagram of a log deduplication system disclosed in an embodiment of the present application;

FIG. 2 is a flowchart of a log deduplication method disclosed in an embodiment of the present application;

FIG. 3 is a flowchart of another log deduplication method disclosed in an embodiment of the present application;

FIG. 4 is a flowchart of another log deduplication method disclosed in an embodiment of the present application;

FIG. 5 is a schematic diagram of a bit array disclosed in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a content distribution network system disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The system timestamp refers to the total number of seconds from greenwich time 1970, 01, 00 hours 00 minutes 00 seconds (beijing time 1970, 01, 08 hours 00 minutes 00 seconds) to the present. For example, 1501925027 seconds, convert to beijing time 2017/8/5.

The method and the device do not perform duplicate removal operation on the log data one by one, but perform log duplicate removal operation on the log data set. The log data set can include 4096 log data at most, so deduplication operations can be performed on multiple pieces of log data in the log data set at once. Therefore, the CPU resource consumption of the log collection device can be reduced, and the log deduplication operation efficiency can be improved.

To facilitate understanding by those skilled in the art, a log deduplication operating system is provided, which includes a plurality of log generating apparatuses 100 and a log collecting apparatus 200.

According to an embodiment provided by the present application, a first embodiment of a log deduplication method is provided. Referring to fig. 2, the method specifically includes the following steps:

step S201: the log generation device obtains a log data set generated by a process in a first time period.

The log generating device may start a plurality of processes at the same time, and the execution process for each process is consistent, so that one process is taken as an example for explanation.

The log generation device can continuously collect and cache log data generated by the process, and after the collection time reaches a first time period, a plurality of pieces of log data of the process in the first time period form a log data set. A log data set may include up to 4096 logs.

For example, taking the first time period as 30 seconds as an example, the log generating device may collect log data of one process within 30 seconds, and form a log data set with a plurality of pieces of log data within 30 seconds.

Step S202: the log generation device 100 builds a log data set identification for the log data set. Wherein the log data set identification comprises a log data set name.

Since the log data of the processes of the log generating device in different first time periods are different, in order to uniquely represent the log data set generated by the log generating device 100 in the first time period of the processes, the log generating device 100 may obtain the device identifier of the device, the process identifier of the processes, and the current system timestamp.

And then, determining a character string consisting of the equipment identifier, the process identifier and the system timestamp as the name of the log data set.

For example, if the device identifier is represented by H, the process identifier is represented by P, and the system timestamp is represented by T, the log dataset name = CONCAT (H, P, T); wherein CONCAT is a composition string function.

Step S203: the log generating device 100 adds a log data set identification to the log data set.

The log generating device 100 adds the determined log data set identification to the log data set.

Step S204: the log generation device 100 sends the log data set containing the log data set identification to the log collection device 200.

Step S205: the log collection device 200 searches whether the log data set name exists in the log data set name list, if so, the step S206 is performed, and if not, the step S207 is performed.

The log collection device 200 records the log data set name in a log data set name list for the stored log data set. That is, the log data set name list includes a log data set name, which indicates that the log collection device 200 has stored the log data set corresponding to the log data set name.

Therefore, the log collection device 200 searches the log data set name list for whether there is a log data set name, and if so, it indicates that the log collection device has already stored the log data set, and the process proceeds to step S206. If not, it indicates that the log collection device has not stored the log data set, and the process proceeds to step S207.

Step S206: the log collection device 200 refuses to store the log data set.

To avoid duplicate storage of the same log data set, the log collection device 200 may refuse to store the log collection device. For example, the log collection device may perform a specific processing operation such as discarding the log data set, or deleting the log data set, to refuse to store the log data set.

Step S207: and storing the log data set corresponding to the name of the log data set, and adding the name of the log data set in a log data set name list.

In the event that the log collection device 200 does not store the log data set, the log collection device will store the log data set. In order to avoid storing the log data set again later, the log data set name is added to the log data set name list.

In the embodiment, log data sets are used for log deduplication, and the log data sets can contain 4096 log data at most, so that deduplication operation can be performed on a plurality of pieces of log data in the log data sets at one time. Therefore, the CPU resource consumption of the log collection device can be reduced, and the log deduplication operation efficiency can be improved.

The present application further provides a log deduplication device integrated in a log collection device 200, including:

a first acquisition unit configured to acquire a log data set; the log data set comprises log data and a log data set name, wherein the log data set comprises the log data and the log data set name which are generated by the same process of the equipment in a first time period;

The storage unit is used for storing the log data set if the name of the log data set is not found; and adding the name of the log data set in the name list of the log data set.

Wherein the log data set name comprises: a device identification of the device; a process identification of the process in the device; the device generates a system time stamp in the log data set process.

The present application further provides a log deduplication device, including:

and the first sending unit is used for sending the log data set containing the name of the log data set.

The first addition identification unit is specifically configured to: acquiring a device identifier of the device, a process identifier of the process and a system timestamp; determining a character string consisting of the device identifier, the process identifier and the system timestamp as the name of the log data set; and adding the name of the log data set to the log data set.

For specific implementation of the log deduplication device, reference may be made to the embodiment shown in fig. 2, and details are not described here.

In the first embodiment, the log collection device 200 needs to maintain the log data set name list, and the number of log data set names is huge because the number of log data sets is huge, which results in a large storage space occupied by the log data set name list.

In order to reduce the storage space occupied by the log data set name list, the present application provides a second embodiment of a log deduplication method.

Referring to fig. 3, the method specifically includes the following steps:

step S301: the log generating apparatus 100 obtains a log data set generated by one process of the apparatus during a first time period. The procedure of step S201 is the same, and is not described herein again.

Step S302: the log generating device 100 adds a log data set identifier to the log data set; the log data set identification comprises a log data set name and a sequence number, and the sequence number is used for distinguishing log data sets with the same log data set name in a second time period. Wherein the second time period comprises a number of the first time periods.

In the first embodiment, each first time period corresponds to a log data set name, and in order to reduce the number of the log data set names, a second time period is set in the first embodiment, and the second time period includes a plurality of first time periods; in this embodiment, each second time period corresponds to a log data set name, that is to say: several first time periods within the second time period all have the same log data set name.

For example, taking the first time period as 30 seconds and the second time period as 600 seconds as an example, 600/30=20 log data set names need to be set for the second time period of 600 seconds in the first embodiment. In the present embodiment, 1 log data set name is set in the second period, and 20 log data sets each adopt the log data set name.

For the purpose of enabling all log data sets in the second time period to adopt the same log data set name, the embodiment provides a process for determining the log data set name.

Referring to fig. 4, the method comprises the following steps:

step S401: the log generating device 100 acquires the device identification of the device, the process identification of the process, and the system timestamp.

The log generation device obtains the device identifier, the process identifier of the process in the device, and obtains the current system timestamp after obtaining the log data set.

Step S402: the log generating device 100 calculates a quotient of the system timestamp and the second time period, and calculates a product of the integer and the second time period, where the integer is obtained by rounding down the quotient.

In the second embodiment, the current system timestamp is not used directly, but the quotient value of the system timestamp and the second time period is calculated, and the quotient value is rounded down to obtain an integer, and after the quotient value is rounded up, the integer in the second time period can be made to be consistent.

Then, the product of the integer and the second time period is calculated. The product is taken as a uniform system timestamp for the second time period. The characters are arranged into a formula as follows: [ T/D ] D, where T is the system timestamp, D is the second time period, and [ x ] is a rounded down function.

For example, taking as an example that the first period is 30 seconds and the second period is 600 seconds of 15019200-15019799 (15019200 =25032 × 600, 15019799=25033 × 600-1), the process of generating the log data set within 30 seconds may be performed a plurality of times within 600 seconds.

Assuming that the log dataset corresponds to a current system timestamp of 15019248 seconds (25032 × 600 < 15019248 < 25033 × 600), then [15019248/600] × 600=25032 × 600=15019200. That is, within 600 seconds 15019200-15019799, no matter how many system timestamps there are, 15019200 (25032 × 600) is obtained after calculation according to the present step; the timestamp is a uniform system timestamp of the second time period.

Step S403: the log generating device 100 determines a character string composed of the device identifier, the process identifier, and the product as the log data set name.

For example, taking device identification H, process identification P, and product [ T/D ] × D as an example, the log dataset name = CONCAT (H, P, [ T/D ] × D); wherein CONCAT is a composition string function.

Since the device identifications are consistent, the process identifications are consistent, and the product is also consistent, the log data set names determined according to this step are also consistent. That is, the names of the respective log data sets within the second time period are all consistent.

Because the names of the plurality of log data sets are consistent in the second time period, in order to distinguish the log data sets, an attribute is added to the log data set identifier: a serial number. The sequence number is used to distinguish log data sets having the same log data set name generated during the second time period.

A plurality of log data sets may be generated in the second time period, and the specific number may be rounded up to a quotient of the second time period and the first time period. The subsequent log collection device constructs a storage state identification array according to the number of the log data sets. The number of the storage state identifications is larger than or equal to the number of the log data sets.

The storage state identification array comprises a plurality of storage state identifications. The storage state identifier has two states, one state storing the log dataset state and the other state representing the non-storage state dataset state.

In an initial situation, the storage state identifier corresponding to each sequence number in the storage state identifier array represents an unstored state data set.

After receiving the log data set, determining the name and the serial number of the log data set; determining a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

The storage state identification array may be implemented using a bit array, the number of bits of the bit array being greater than or equal to the number of log data sets. Subsequently, taking the bit array as an example, a specific implementation of the storage state identification array is described. See steps S306-S308 for the use of bit arrays.

For example, taking the example that 20 log data sets can be generated in the second time period, the minimum value of the number of bits in the subsequently constructed bit arrays is 20, that is, bit arrays of 20 bits and more than 20 bits can be constructed.

The application provides two implementation ways for determining the sequence number corresponding to the log data set:

the first implementation mode comprises the following steps: and in the second time period, starting from the first initial counting value, each generated log data set is increased by 1 until the counting value is increased by preset times. And the preset times are obtained by rounding up the quotient of the second time period and the first time period.

For example, taking constructing a bit array of 20 bits as an example, the first initial count value is incremented by 1 from 0 every time a log data set sequence number is generated, and the sequence number is allocated to the log data set; after the log data set is generated again, the sequence number is incremented by 1 again, the sequence number is assigned to the log data set, and so on.

Since 20 log data sets are generated in the first time period under the conditions that the second time period is 600 seconds and the first time period is 30 seconds, after the sequence number is incremented by 20 times, the purpose of allocating different sequence numbers to the respective log data sets in the second time period is completed.

That is, log data set 1 is assigned sequence number 1, log data set 2 is assigned sequence number 2, \8230 \8230andwie log data set 20 is assigned sequence number 20, so that each log data set has a different sequence number.

The second implementation mode comprises the following steps: and in the second time period, starting from a second initial count value, and decreasing by 1 each time when a log data set is generated, until the count value is decreased by preset times. And the preset times are obtained by rounding up the quotient of the second time period and the first time period.

For example, taking the example of constructing a bit array of 20 bits, log data set 1 is assigned sequence number 20, log data set 2 is assigned sequence numbers 19, \8230, and log data set 20 is assigned sequence number 1, such that each log data set has a different sequence number.

This method is similar to the first implementation, except that counting is performed in a descending manner, and the process is not described again.

Of course, in addition to the above two implementation manners, the sequence number may also be implemented in other manners as long as the log data sets generated in the second time period and having the same log data set name can be distinguished.

It will be appreciated that after a second time period has elapsed, i.e. after counting a predetermined number of times from the initial count value, another second time period is started. In another second time period, the sequence numbers are counted again from the initial counting value, so that different sequence numbers are assigned to different log data sets in another second time period.

After determining the name of the log data set corresponding to the log data set and determining the sequence number corresponding to the log data set, the log generation device 100 determines a character string formed by the name of the log data set and the sequence number as the log data set identifier. Taking the sequence number as an example, the log dataset identification = CONCAT (name, seq).

Then, returning to fig. 3, the flow proceeds to step S303: the log generation device 100 sends the log data set containing the log data set identification to the log collection device 200.

Step S304: the log collection device 200 obtains a log data set containing a log data set identifier, and parses the log data set to obtain a log data set name and a log data set sequence number.

The log collection device 200 obtains a log dataset that includes a log dataset identifier, where the log dataset identifier includes a log dataset name and a sequence number seq.

Step S305: the log collection device 200 searches the log data set name list for the log data set name, and if the log data set name list is found, the step S306 is performed, and if the log data set name list is not found, the step S307 is performed.

The log collection device 200 searches the log data set name in the log data set name list, and if the log data set name is not found, it indicates that the log data set name is not stored, and the process goes to step S306; if the log data set name is found, it indicates that the log data set name is already stored, and the process proceeds to step S307.

Step S306: if the bit array corresponding to the name of the log data set does not exist, storing the log data set, and constructing the bit array corresponding to the name of the log data set; and each bit initial value in the bit array represents the state of the log data set which is not stored, and the numerical value of the bit corresponding to the serial number in the bit array is modified, so that the modified numerical value represents the state of the stored log data set.

If the bit array corresponding to the name of the log data set does not exist, the log data set corresponding to the name of the log data set is not stored, and therefore the log data set is stored.

Then, according to a preset bit number (the bit number of the bit array is greater than or equal to the number of the log data sets in the second time period), a bit array corresponding to the name of the log data set is constructed, and an initial value representing the state that the log data set is not stored is given to each bit.

For example, a bit array of 20 bits corresponding to the name of the log dataset is constructed, and the 20 bits are each assigned to "0"; where a "1" indicates a stored log data set state and a "0" indicates an unstored log data set state.

And acquiring a serial number in a log data set identifier, and setting a numerical value of a bit position corresponding to the serial number in the bit array so as to enable the modified numerical value to represent the state of the stored log data set. For example, taking the sequence number as 1 as an example, in order to indicate that the log data set corresponding to the sequence number under the name of the log data set is already stored, the 1 st bit of the bit array is set to "1".

Step S307: the log collection device 200 adds the log data set name to the log data set name list, and constructs a correspondence between the log data set name and the bit array.

The log collection device adds the log data set name from the list of log data set names to indicate that the log data set name has been stored once, with a bit array corresponding to the log data set name.

And constructing the corresponding relation between the log data set name and the bit data group so as to find the bit data group based on the log data set name in the following.

Step S308: if the log collection device 200 finds the name of the log data set in the log data set name list, it determines the bit array corresponding to the name of the log data set, and determines the value of the bit corresponding to the serial number in the bit array.

The log collection device 200 finds the log data set name in the log data set name list, and determines a bit array corresponding to the log data set name based on a pre-established correspondence. Then, the serial number in the log data set identifier is obtained, and the numerical value of the bit corresponding to the serial number is determined from the bit array.

Step S309: and refusing to store the log data set when the numerical value of the bit corresponding to the sequence number represents the stored log data set.

For example, if the value of the bit corresponding to the sequence number is "1", this indicates that the log data set has already been stored, and the log data set is rejected from being stored again in order to avoid repeated storage of the log data set.

Step S310: storing the log data set under the condition that the numerical value of the bit corresponding to the sequence number indicates that the log data set is not stored; and modifying the numerical value of the bit corresponding to the sequence number in the bit array so that the modified numerical value represents the state of the stored log data set.

For ease of understanding, reference is made to fig. 5, which illustrates, by way of example:

the technician determines the first time period, the second time period and the bit number of the bit array in advance, taking the first time period as 30 seconds, the second time period as 600 and the bit number of the bit array as 20 as an example:

the log generating device 100 generates a log data set 1 in the first 30 seconds of the second time period, and sends the log data set 1 to the log collecting device, with the log data set identifier 1 being (name, seq = 1).

The log collection device 200 receives the log data set 1, finds that there is no bit array corresponding to the name of the log data set after the log collection device is checked, and therefore, constructs a 20-bit array, and establishes a corresponding relationship with the name of the log data set, and an initial value of each bit in the bit array is "0" to indicate that each log data set in the second time period is not stored.

Then, the log collection device stores the log data set 1, and sets the 1 st bit (1 is the sequence number of the log data set 1) in the bit array to "1", indicating that the log data set 1 in the second time period has been stored.

When the log collection device 200 receives the log data set 1 again and determines that the value corresponding to the sequence number 1 in the bit array is "1", the log collection device 200 knows that the log data set 1 is already stored, and therefore, the log collection device does not store the log data set 1 any more.

The log generating device 100 generates a log data set 2 with a log data set identifier 2 of (name, seq = 2) \8230; and generates a log data set 20 with a log data set identifier 20 of (name, seq = 20) in the next 30 seconds. The log generation device will send log data set 1-log data set 20 one by one.

Subsequently, after receiving the log data set 2, the 2 nd bit of the bit array is set to be 1, after receiving the log data set 3, the 3 rd bit of the bit array is set to be 1, 8230, and after receiving the log data set 20, the 20 th bit of the bit array is set to be 1.

A content distribution network system is described below, and with reference to fig. 6, the content distribution network system specifically includes: a central node 300 and an edge node 400 connected to the central node 300, the edge node 400 comprising a number of servers 401.

Two implementations of the log deduplication method in the content distribution network system are described below:

the first implementation mode comprises the following steps:

the server 401 is configured to collect a log data set generated by one process in a first time period, and add a log data set identifier to the log data set; wherein the log data set identification comprises a log data set name; sending the log data set containing the log data set identification to a streaming message system;

the central node 300 is configured to obtain a log data set including a log data set identifier from a streaming message system; searching whether the name of the log data set exists in an existing log data set name list; and if so, refusing to store the log data set.

In this embodiment, the server is equivalent to the log generating device in the first embodiment shown in fig. 2, and the central node is equivalent to the log collecting device. Therefore, for a specific implementation of the first implementation, reference may be made to the description of the first embodiment shown in fig. 2, and details are not described here again.

The second implementation mode comprises the following steps:

the server 401 is configured to collect a log data set generated by the same process in a first time period, add a log data set identifier to the log data set, and send the log data set to a streaming message system; the log data set identification comprises log data set names adopted by a plurality of log data sets and serial numbers used for distinguishing different log data sets with the same log data set name;

the central node 300 is configured to obtain a log data set from the streaming message system, and determine a name and a sequence number of the log data set in a log data set identifier of the log data set; determining a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

In this embodiment, the server corresponds to the log generating device in the second embodiment shown in fig. 3, and the central node corresponds to the log collecting device. Therefore, for a specific implementation of the second implementation, reference may be made to the description of the second embodiment shown in fig. 3, and details are not described here again.

The application provides a log deduplication device, includes:

the determining unit is used for determining the name and the serial number of the log data set in the log data set identifier of the log data set; wherein the sequence number is used for distinguishing different log data sets with the same log data set name;

a storage state identification determining unit, configured to determine a storage state identification corresponding to the sequence number in a storage state identification array corresponding to the log data set name;

and the rejection storage unit is used for rejecting to store the log data set if the storage state identifier corresponding to the sequence number represents the state of the stored log data set. The determining of the storage state identification unit specifically includes: the searching unit is used for searching whether the name of the log data set exists in a log data set name list; and the storage state identification determining unit is used for determining the storage state identification corresponding to the sequence number in the storage state identification array corresponding to the log data set name if the log data set name is found.

The storage unit is used for adding the name of the log data set into the name list of the log data set if the name of the log data set is not found; constructing a storage state identification array corresponding to the name of the log data set, wherein the storage state identifications in the storage state identification array all represent the states of the log data set which is not stored; storing the log data set; and changing the storage state identifier corresponding to the sequence number in the storage state identifier array so that the storage state identifier represents the stored log data set state.

The log data collection system comprises a plurality of log data sets, a first time period, a second time period and a plurality of storage units, wherein the first time period corresponds to one log data set, the first time periods form the second time period, and the names of the log data sets corresponding to the first time periods in the second time period are consistent; the sequence number in the log data set identifier is used to distinguish the log data sets with the same log data set name in the second time period.

Wherein the log data set name in the log data set identifier comprises: a device identification of the device; a process identifier of the process in the log generation device; a system timestamp corresponding to the second time period; and the system timestamp is the product of an integer and the second time period, and the integer is obtained by rounding down the quotient of the system timestamp corresponding to the equipment log data set and the second time period.

The sequence number in the log dataset identification comprises: the device is a numerical value determined by the log data set between 1 and the maximum numerical value, and different log data sets correspond to different serial numbers in the second time period; and the maximum value is obtained by rounding up the quotient of the second time period and the first time period.

The storage state identification array comprises a bit array, bits in the bit array correspond to the sequence numbers, and data values of the bits represent storage state identifications.

The application also provides a log duplicate removal device, which is characterized by comprising:

a second adding identification unit, configured to add a log data set identification to the log data set; the log data set identification comprises a log data set name and a serial number used for distinguishing different log data sets with the same log data set name;

the second sending unit is used for sending the log data set to the log collection equipment so that the log collection equipment can determine the name and the serial number of the log data set in the log data set identifier of the log data set, and determine the storage state identifier corresponding to the serial number in the storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

Wherein the second adding identification unit includes:

determining a log data set name corresponding to the log data set; determining a sequence number corresponding to the log data set; determining a character string consisting of the name of the log data set and the sequence number as the identifier of the log data set; and adding the log data set identification to the log data set.

Wherein the determining the name of the log data set corresponding to the log data set comprises:

acquiring a device identifier of the device, a process identifier of the process and a system timestamp; calculating a quotient value of the system timestamp and the second time period, rounding the quotient value downwards to obtain an integer, and calculating a product of the integer and the second time period; and determining a character string formed by the device identification, the process identification and the product as the name of the log data set.

Wherein, the determining the sequence number corresponding to the log data set includes: in the second time period, starting from the first initial count value, increasing 1 every time a log data set is generated until the count value is increased by a preset number of times; or, in the second time period, starting from a second initial count value, decreasing by 1 each time a log data set is generated, and reaching the count value for decreasing by preset times;

For specific implementation of the log deduplication device, reference may be made to the embodiment shown in fig. 3, which is not described herein again.

The functions described in the method of the present embodiment, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the technical solutions or portions of the embodiments contributing to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device, a network device, or the like) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A log deduplication method is characterized in that,

applied to a log collection device, the method comprises the following steps:

determining the name and the sequence number of the log data set in the log data set identifier of the log data set; wherein the sequence number is used for distinguishing different log data sets with the same log data set name;

2. The method of claim 1,

determining the storage state identifier corresponding to the sequence number in the storage state identifier array corresponding to the log data set name, including:

and if the name of the log data set is found, determining the storage state identifier corresponding to the sequence number in the storage state identifier array corresponding to the name of the log data set.

3. The method of claim 2,

further comprising:

constructing a storage state identification array corresponding to the name of the log data set, wherein the storage state identifications in the storage state identification array all represent the states of the log data set which are not stored;

4. The method of claim 1,

the method comprises the following steps that a first time period corresponds to a log data set, a plurality of first time periods form a second time period, and the names of the log data sets corresponding to the first time periods in the second time period are consistent;

the sequence number in the log data set identifier is used to distinguish the log data sets with the same log data set name in the second time period.

5. The method of claim 4,

the log data set name in the log data set identifier includes:

a device identification of the log generation device;

a process identifier of the process in the log generation device;

a system timestamp corresponding to the second time period;

6. The method of claim 4,

the sequence number in the log dataset identification comprises:

7. The method of claim 1,

8. A log deduplication method is characterized in that,

applied to a log generating device, the method comprising:

9. The method of claim 8,

adding a log data set identifier to the log data set comprises:

determining the name of the log data set corresponding to the log data set;

determining a sequence number corresponding to the log data set;

and adding the log data set identification to the log data set.

10. The method of claim 9,

the determining the name of the log data set corresponding to the log data set includes:

calculating a quotient value of the system timestamp and a second time period, rounding the quotient value downwards to obtain an integer, and calculating a product of the integer and the second time period; the log data collection system comprises a plurality of log data sets, a first time period, a second time period and a plurality of storage units, wherein the first time period corresponds to one log data set, the first time periods form the second time period, and the names of the log data sets corresponding to the first time periods in the second time period are consistent; the serial number in the log data set identifier is used for distinguishing log data sets with the same log data set name in a second time period;

11. The method of claim 9,

the determining the sequence number corresponding to the log data set includes:

in a second time period, starting from the first initial count value, increasing 1 every time a log data set is generated until the count value is increased by a preset number of times; the log data collection system comprises a plurality of log data sets, a first time period, a second time period and a plurality of storage units, wherein the first time period corresponds to one log data set, the first time periods form the second time period, and the names of the log data sets corresponding to the first time periods in the second time period are consistent; the serial number in the log data set identifier is used for distinguishing log data sets with the same log data set name in a second time period; alternatively, the first and second electrodes may be,

12. A content distribution network system characterized in that,

the method comprises the following steps:

the system comprises a central node and edge nodes connected with the central node, wherein the edge nodes comprise a plurality of servers;

the server is used for collecting the log data set generated by the same process in a first time period, adding a log data set identifier to the log data set and sending the log data set to the streaming message system; the log data set identification comprises log data set names adopted by a plurality of log data sets and serial numbers used for distinguishing different log data sets with the same log data set name;

13. A log deduplication system is provided,

the method comprises the following steps:

the log generation device is used for collecting log data sets generated in the same process in a first time period, adding a log data set identifier to the log data sets and sending the log data sets to the log collection device; the log data set identification comprises log data set names adopted by a plurality of log data sets and serial numbers used for distinguishing different log data sets with the same log data set name;

the log collection device is used for acquiring a log data set and determining the name and the serial number of the log data set in a log data set identifier of the log data set; determining a storage state identifier corresponding to the sequence number in a storage state identifier array corresponding to the name of the log data set; and if the storage state identification corresponding to the sequence number represents the state of the stored log data set, refusing to store the log data set.

14. A log deduplication method is characterized in that,

applied to a log collection device, the method comprising:

determining a log data set name for the log data set;

15. The method of claim 14,

further comprising:

if the name of the log data set is not found, storing the log data set;

and adding the name of the log data set in the name list of the log data set.

16. The method of claim 15,

the determining a log data set name of the log data set includes: extracting the log dataset name in the log dataset;

wherein the log data set name comprises:

a device identification of the log generation device;

a process identifier of the process in the log generation device;

17. A log deduplication method is characterized in that,

applied to a log generating device, the method comprising:

sending a log data set containing a log data set name to log collection equipment so that the log collection equipment can search whether the log data set name exists in an existing log data set name list, and if the log data set name is found, refusing to store the log data set.

18. The method of claim 17,

adding a log data set name to the log data set comprises:

and adding the name of the log data set to the log data set.

19. A log deduplication system, comprising,

the method comprises the following steps:

the log generation device is used for collecting log data sets generated by the same process in a first time period and adding log data set names to the log data sets;

sending the log data set to a log collection device;

the log collection device is used for acquiring a log data set sent by the log generation device;

searching whether a name of a log data set in the log data set exists in an existing log data set name list; and if so, refusing to store the log data set.

20. A content distribution network system characterized in that,

the method comprises the following steps:

the system comprises a central node and edge nodes connected with the central node, wherein each edge node comprises a plurality of servers;

the central node is used for acquiring a log data set from a streaming message system;

21. A log deduplication device is characterized in that,

integrated into a log collection device, comprising:

22. A log deduplication device is characterized in that,

integrated in a log generation device, comprising:

23. A log deduplication device is characterized in that,

integrated into a log collection device, comprising:

24. A log deduplication device is characterized in that,

integrated in a log generation device, comprising: