CN106649344B

CN106649344B - Weblog compression method and device

Info

Publication number: CN106649344B
Application number: CN201510728041.7A
Authority: CN
Inventors: 才宇东
Original assignee: Huawei Digital Technologies Suzhou Co Ltd
Current assignee: Huawei Digital Technologies Suzhou Co Ltd
Priority date: 2015-10-31
Filing date: 2015-10-31
Publication date: 2020-01-10
Anticipated expiration: 2035-10-31
Also published as: CN106649344A

Abstract

The invention discloses a weblog compression method and device, which are used for solving the problem of low compression ratio of the conventional weblog compression method. The method comprises the following steps: analyzing the acquired weblog to determine at least one characteristic contained in the weblog; if the service type of the existing first data set does not contain the first feature of the weblog in the set, determining the similarity between the feature set of the weblog and the feature set of the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is larger than a set threshold value, merging the weblog into the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set; and each data set is compressed and stored, so that the number of compressed packets is effectively reduced, and the storage space is further reduced.

Description

Weblog compression method and device

Technical Field

The present invention relates to the field of network technologies, and in particular, to a method and an apparatus for compressing weblogs.

Background

In the current era of extremely developed internet, the weblog collecting and querying system has wide application. Various IT systems, network equipment and safety equipment can generate a large amount of network logs, the formats of the network log data are often greatly different, and a large amount of unstructured data needs to be adapted to a network log acquisition and query system so as to perform service analysis. In the face of massive unstructured data, the collected weblogs are generally compressed and stored, so that storage resources can be effectively saved, and the cost of purchasing storage equipment by a user is reduced.

A commonly used method for compressing weblogs is: firstly, all the collected weblogs are uniformly stored, and then the stored weblogs are subjected to secondary compression storage. Because the weblogs are uniformly stored and then compressed, and finally the obtained compressed packets are written into a disk for storage, that is, the process sequentially comprises write once, read once and write once, which results in waste on Input and Output (IO for short); typically, different weblogs have differences between features, which are referred to as miscellaneous features. When the weblogs are compressed, due to the existence of a large number of heterogeneous features, the similarity between the weblogs is low, and the compression rate is low.

Another commonly used method for compressing weblogs is as follows: all the collected weblogs are firstly compressed uniformly, and then the obtained compressed packets are written into a disk for storage, namely the process comprises one-time reading and one-time writing, although one-time writing is reduced, a large amount of field data with mixed characteristics still exist during compression, and the compression rate is low.

Another commonly used method for compressing weblogs is as follows: the collected weblogs are classified according to the service types of the weblogs, and then the weblogs of different service types are compressed and stored respectively. Although the compression ratio is improved compared with the first two compression methods, because the service types of the weblogs are more, the weblogs of each service type are compressed and then stored, a larger storage space is still required, and the compression ratio is still lower.

In summary, as the number of the weblogs is increasingly huge, the compressed weblogs need to occupy a larger storage space due to a lower compression rate in the existing weblog compression method.

Disclosure of Invention

The embodiment of the invention provides a weblog compression method and device, which are used for solving the problem of low compression ratio of the existing weblog compression method.

In a first aspect, a method for compressing a weblog, the method comprising:

analyzing the acquired weblog to determine at least one characteristic contained in the weblog;

if the existing service type of the first data set does not contain the first feature of the weblog in the set, determining the similarity between the feature set of the weblog and the feature set of the first data set, wherein the first feature is a feature used for representing the service type of the weblog in the at least one feature, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;

if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is determined to be larger than a set threshold value, merging the weblog into the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;

compressing and storing each data set, wherein if the data set comprises the first data set, the first data set is compressed and stored; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.

In the method of the embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog in a set, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.

In a possible implementation manner, determining similarity between the feature set of the blog and the feature set of the first data set includes:

determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;

and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.

In a possible implementation manner, after merging the weblog into the first data set, the method further includes:

determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.

In a possible implementation, the compressing and storing process performed on each data set includes:

after the number of the stored weblogs reaches a set first threshold value, compressing and storing each data set; or

After the sum of the data amount of the stored weblogs reaches a set second threshold value, compressing and storing each data set; or

And when a set compression period comes, compressing and storing each data set.

and compressing and storing each data set in a columnar storage mode. Because the column type storage mode is adopted for compression and storage, a higher compression ratio can be obtained.

In a possible implementation manner, after determining at least one feature included in the blog, the method further includes:

according to the first characteristic of the weblog, when the service type of the first data set is determined and the first characteristic is included in the service type set, the weblog is merged to include the first data set.

In a possible implementation manner, after compressing and storing each data set, the method further includes:

forming a third data set according to at least one characteristic contained in the weblog collected in a set time period;

if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set;

and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.

In a second aspect, an apparatus for compressing a blog, the apparatus comprising:

the characteristic analysis module is used for analyzing the acquired weblog and determining at least one characteristic contained in the weblog;

a first processing module, configured to determine, if an existing service type of a first data set does not include a first feature of the weblog in the set, a similarity between a feature set of the weblog and the feature set of the first data set, where the first feature is a feature of the at least one feature, which is used to represent the service type of the weblog, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;

a second processing module, configured to merge the weblog into the first data set if it is determined that a similarity between the feature set of the weblog and the feature set of the first data set is greater than a set threshold; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;

the compression module is used for compressing and storing each data set, wherein if the data set comprises the first data set, the first data set is compressed and stored; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.

In the apparatus of the embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog in a set, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.

In a possible implementation manner, when determining the similarity between the feature set of the blog and the feature set of the first data set, the first processing module is specifically configured to:

In a possible implementation manner, after merging the weblog into the first data set, the second processing module is further configured to:

In a possible implementation manner, when the compression module performs compression and storage processing on each data set, the compression module is specifically configured to:

And when a set compression period comes, compressing and storing each data set.

In a possible implementation manner, the first processing module is further configured to:

In a possible implementation manner, the apparatus further includes:

the optimization module is used for forming a third data set according to at least one characteristic contained in the weblog collected in a set time period; if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set; and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.

In a third aspect, a server comprises: the system comprises a processor, an input interface, an output interface, a memory and a system bus; wherein:

when the server runs, the processor reads the program in the memory and executes the method embodiment.

The memory is used for storing data used by the processor when executing operations;

the input interface is used for reading in data under the control of the processor;

an output interface outputs data under control of the processor.

In the server according to the embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog in a set, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.

Drawings

Fig. 1 is a schematic diagram of a weblog compression method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of another weblog compression method according to an embodiment of the present invention;

FIG. 3 is a diagram of a classification tree formed in accordance with an embodiment of the present invention;

fig. 4 is a schematic diagram of a weblog compression apparatus according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating another weblog compression apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a server according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto. It is to be understood that the embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.

As shown in fig. 1, a method for compressing a weblog according to an embodiment of the present invention includes:

s11, analyzing the collected weblog to determine the characteristics contained in the weblog;

the weblog is characterized by fields for storing different contents, such as srcip (source IP), dstip (destination IP), srcport (source port), dspport (destination port), and so on.

S12, if the service type of the existing first data set does not contain the first feature of the weblog, determining the similarity between the feature set of the weblog and the feature set of the first data set.

In this embodiment of the present invention, the first feature is a feature used to indicate a service type of the weblog in the at least one feature.

For example, the first feature of the blog is an eventType field in the blog, which is used to store a Service type of the blog, such as an Intrusion Prevention System (IPS) Service type, a LOGIN Service type, a Distributed Denial of Service (DDoS) Service type, and the like.

In this embodiment of the present invention, the union of the service types of the first data set is a union of the service types of the weblogs in the first data set.

For example, assuming that a network log 1 in a data set belongs to an IPS service type, a network log 2 also belongs to an IPS service type, a network log 3 belongs to a LOGIN service type, and a network log 4 belongs to a DDoS service type, a service type union corresponding to the data set is { IPS service type, LOGIN service type, DDoS service type.

In the embodiment of the present invention, the feature set of the blog is a set composed of features of the blog.

In this embodiment of the present invention, the feature set of the first data set is a union of features of all weblogs in the first data set.

For example, it is assumed that the first data set includes two weblogs, and the characteristics of the first weblog include srcip, dstip, srcport, dspport, natsrcip, natdspip, username, and descriptor; the characteristics of the second weblog comprise srcip, dstip, srcport, dspport, username, appname and domain; the feature set of the first data set is then:

{srcip,dstip,srcport,dspport,natsrcip,natdspip,username,describe,appname,domain}。

S13A, if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is determined to be larger than a set threshold value, the weblog is merged to the first data set.

S13B, if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than the set threshold value, creating a second data set, and merging the weblog into the second data set.

S14, compressing and storing each data set; wherein: if the data set comprises the first data set, compressing and storing the first data set; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.

In the embodiment of the invention, each data set is compressed and stored by taking the data set as a unit.

For example, if the data set includes first data sets, each first data set is compressed and stored; and if the data set comprises a first data set and a second data set, respectively compressing and storing the first data set and the second data set.

In this embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog centrally, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set, specifically: if the similarity between the feature set of the weblog and the feature set of the first data set is larger than a set threshold, merging the weblog into the first data set; and if the similarity between the feature set of the weblog and the feature set of the first data set is not larger than a set threshold, creating a second data set, and merging the weblog into the second data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.

In this embodiment of the present invention, as another optional implementation manner, as shown in fig. 2, after S11, the method further includes:

s15, according to the first characteristic of the weblog, when the service type corresponding to the existing first data set is determined and the first characteristic is included in the service type, the weblog is merged into the first data set.

In this embodiment of the present invention, the determining, in S12, a similarity between the feature set of the weblog and the feature set of the first data set includes:

In a specific implementation, a knowledge base may be preset, where the knowledge base is a feature sequence formed by features in feature sets of all weblogs according to a set ordering rule. When the first numerical value and the second numerical value are determined, firstly, forming a first characteristic sequence by the characteristics in the characteristic set of the weblog according to a set sorting rule, and forming a second characteristic sequence by the characteristics in the characteristic set of the first data set according to the set sorting rule; comparing the first characteristic sequence and the second characteristic sequence with the set knowledge base respectively to form a first mark sequence and a second mark sequence, wherein the lengths of the first mark sequence and the second mark sequence are both the same as the length of the set knowledge base, and the first mark sequence and the second mark sequence are both bit sequences only including 0 and 1, wherein the characteristic corresponding to the bit with the bit value of 1 in the first mark sequence is a characteristic included in the weblog, and the characteristic corresponding to the bit with the bit value of 0 is a characteristic not included in the weblog; the features corresponding to the bits with the bit value of 1 in the second marker sequence are the features contained in the feature set of the first data set, and the features corresponding to the bits with the bit value of 0 are the features not contained in the feature set of the first data set.

For example, assume that a first feature sequence formed by a feature set of a weblog according to a set sorting rule is as follows: srcip, dstip, srcport, dspport, natsrcip, natdspip, username, descriptor;

the second characteristic sequence formed by the characteristic set of the first data set according to the set sorting rule is as follows: srcip, dstip, srcport, dspport, username, appname, domain;

the set knowledge base is as follows: srcip, dstip, srcport, dspport, natsrcip, natdspip, username, descriptor, appname, domain, netid, localinfo;

then: the first marker sequence formed by comparing the first characteristic sequence with the set knowledge base is as follows: 1,1,1,1,1,1,1,0,0,0, 0; the second marker sequence formed by comparing the second characteristic sequence with the set knowledge base is as follows: 1,1,1,1,0,0,1,0,1,1,0,0. Calculating the bit number of 5 (i.e. a first numerical value) of the same position in the first marker sequence and the second marker sequence being 1; the number of bits for calculating the same position in the first marker sequence and the second marker sequence is only 10 (i.e. the second value) with a number of 1. And calculating the similarity between the feature set of the weblog and the feature set of the first data set to be 5/10-0.5.

Optionally, after merging the weblog into the first data set in S13A, the method further includes:

Specifically, after merging the weblog into the first data set, the feature set of the first data set needs to be updated, that is, a union of the feature set of the weblog and the feature set of the first data set is determined as the feature set of the first data set.

In the embodiment of the present invention, as shown in fig. 3, the classification tree formed by the classification in the above manner is a classification tree in which the first classification, the second classification, and the like are parent nodes, the parent nodes represent formed data sets, the service classes 1 and 2, and the like are child nodes, and the child nodes represent weblogs included in the data sets.

In the embodiment of the present invention, in S14, the compression and storage processing is performed on each data set, and includes the following three triggers:

mode 1, event a triggering, that is, after the number of the stored weblogs, that is, the number of the weblogs, reaches a set first threshold value, triggering compression and storage processing, specifically:

after the number of the stored weblogs reaches a set first threshold value, for example, the second threshold value may be 1000, and each data set is compressed and stored.

Mode 2 and event B triggering, namely after the sum of the data amounts of the stored weblogs reaches a set second threshold, triggering compression and storage processing, specifically:

after the sum of the data amounts of the stored weblogs reaches a set second threshold value, which may be 100 mbytes, for example, each data set is compressed and stored.

Mode 3, cycle triggering, that is, after each set compression cycle arrives, triggering compression and storage processing, specifically:

and when a set compression period comes, compressing and storing each data set.

Based on any of the above embodiments, optionally, the compressing and storing process performed on each data set in S14 includes:

Of course, the embodiment of the present invention is not limited to performing the compression and storage processing by using the columnar storage, and may also perform the compression and storage processing on each data set by using other manners known in the art, such as a line storage manner.

Based on any of the above embodiments, optionally, after compressing and storing each data set in S14, a compressed packet corresponding to each data set is obtained, and each compressed packet is stored in a TLV format, where T represents a feature identifier (such as srcip, dstip, srcport, etc.), L represents a length of the compressed packet, and V represents the compressed packet itself.

For example, a TLV is a triplet, which is collectively referred to as Type, Length, and Value. The T, L field is usually fixed in length (usually 1-4 bytes), and the V field is variable in length. T, L, and V, in the embodiment of the present invention, T represents a feature identifier (i.e. one of the features of the blog, which represents which feature is stored), L represents the length of the stored compressed packet, and V represents the stored compressed packet.

Based on any of the above embodiments, after compressing and storing each of the data sets in S14, optimizing the service type of each of the data sets, specifically:

For example, after the compression and storage processing of the weblog is completed, the currently formed classification tree may be optimized, specifically: after the compression and storage processing of the weblogs is completed, forming a new data set (namely, a third data set) according to the characteristics contained in the weblogs collected within a set time period, for example, according to the characteristics contained in the weblogs collected within 1 day before the current time, so as to form an optimized classification tree; for the third data set, if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set; and if the data set comprises the first data set and the second data set and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set, so that the original classification tree is replaced by the optimized classification tree.

The above method process flow may be implemented by a software program, which may be stored in a storage medium, and when the stored software program is called, the above method steps are performed.

Based on the same inventive concept, an embodiment of the present invention further provides a weblog compression apparatus, where a principle of the apparatus to solve the problem is similar to that of the above-mentioned weblog compression method, and parts of the apparatus that are the same as the above-mentioned method are specifically referred to in the description of the embodiment shown in fig. 1 and fig. 2, and are not described again here.

An apparatus for compressing a blog according to an embodiment of the present invention, as shown in fig. 4, includes:

the feature analysis module 41 is configured to analyze the collected weblog to determine at least one feature included in the weblog;

a first processing module 42, configured to determine, if an existing service type union of a first data set does not include a first feature of the weblog, a similarity between a feature set of the weblog and the feature set of the first data set, where the first feature is a feature of the at least one feature, which is used to represent the service type of the weblog, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;

a second processing module 43, configured to merge the weblog into the first data set if it is determined that the similarity between the feature set of the weblog and the feature set of the first data set is greater than a set threshold; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;

a compression module 44, configured to perform compression and storage processing on each data set, where if the data set includes the first data set, the compression and storage processing is performed on the first data set; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.

In the embodiment of the invention, when the service type of the existing first data set does not contain the first feature of the weblog in a set, the weblog is classified according to the similarity between the feature set of the weblog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.

Optionally, when the first processing module 42 determines the similarity between the feature set of the blog and the feature set of the first data set, it is specifically configured to:

Based on any of the above embodiments, optionally, after the second processing module 43 merges the weblog into the first data set, the second processing module is further configured to:

Optionally, the compression module 44 is specifically configured to:

And when a set compression period comes, compressing and storing each data set.

As another optional implementation manner, the first processing module 42 is further configured to:

Based on any one of the above embodiments, optionally, as shown in fig. 5, the apparatus further includes:

an optimization module 45, configured to form a third data set according to at least one feature included in the weblog collected within a set time period; if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set; and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.

In the embodiment of the present invention, the method in the embodiments shown in fig. 1 and fig. 2 may be implemented by a server, as shown in fig. 6, where the server includes: a processor 61, an input interface 62, an output interface 63, a memory 64, and a system bus 65; wherein:

the processor 61 is responsible for logical operations and processing. When the server runs, the processor 61 reads the program in the memory 64 and executes the above method embodiment, specifically: the processor 61 performs the above-described steps S11, S12, S13A, S13B, and S14. Optionally, the processor 61 may also execute the step S15.

The storage 64 includes a memory and a hard disk, and can store data (such as a first data set, a second data set, a compressed packet obtained by compressing the data set, and the like) used by the processor 61 when executing operations. The input interface 62 is used for reading in data (such as weblogs) under the control of the processor 61, and the output interface 63 is used for outputting data (such as compressed packets) under the control of the processor 61.

The bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 61, and various circuits, represented by memory 64 and the hard disk, linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of weblog compression, the method comprising:

2. The method of claim 1, wherein determining the similarity of the feature set of the weblog to the feature set of the first data set comprises:

3. The method of claim 1 or 2, wherein after merging the weblog into the first dataset, further comprising:

4. The method of claim 1, wherein compressing and storing each data set comprises:

And when a set compression period comes, compressing and storing each data set.

5. The method of claim 1 or 4, wherein compressing and storing each data set comprises:

and compressing and storing each data set in a columnar storage mode.

6. The method of claim 1, wherein determining at least one characteristic included in the blog further comprises:

7. The method of any one of claims 1, 2, 4, and 6, wherein after compressing and storing each data set, further comprising:

8. An apparatus for compressing a blog, the apparatus comprising:

9. The apparatus of claim 8, wherein the first processing module is specifically configured to:

10. The apparatus of claim 8 or 9, wherein the second processing module, after merging the weblog into the first dataset, is further to:

11. The apparatus of claim 8, wherein the compression module is specifically configured to:

And when a set compression period comes, compressing and storing each data set.

12. The apparatus of claim 8, wherein the first processing module is further to:

13. The apparatus of any one of claims 8, 9, 11, 12, further comprising: