CN106649344B - Weblog compression method and device - Google Patents

Weblog compression method and device Download PDF

Info

Publication number
CN106649344B
CN106649344B CN201510728041.7A CN201510728041A CN106649344B CN 106649344 B CN106649344 B CN 106649344B CN 201510728041 A CN201510728041 A CN 201510728041A CN 106649344 B CN106649344 B CN 106649344B
Authority
CN
China
Prior art keywords
data set
weblog
feature
data
service type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510728041.7A
Other languages
Chinese (zh)
Other versions
CN106649344A (en
Inventor
才宇东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Digital Technologies Suzhou Co Ltd
Original Assignee
Huawei Digital Technologies Suzhou Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Digital Technologies Suzhou Co Ltd filed Critical Huawei Digital Technologies Suzhou Co Ltd
Priority to CN201510728041.7A priority Critical patent/CN106649344B/en
Publication of CN106649344A publication Critical patent/CN106649344A/en
Application granted granted Critical
Publication of CN106649344B publication Critical patent/CN106649344B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a weblog compression method and device, which are used for solving the problem of low compression ratio of the conventional weblog compression method. The method comprises the following steps: analyzing the acquired weblog to determine at least one characteristic contained in the weblog; if the service type of the existing first data set does not contain the first feature of the weblog in the set, determining the similarity between the feature set of the weblog and the feature set of the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is larger than a set threshold value, merging the weblog into the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set; and each data set is compressed and stored, so that the number of compressed packets is effectively reduced, and the storage space is further reduced.

Description

Weblog compression method and device
Technical Field
The present invention relates to the field of network technologies, and in particular, to a method and an apparatus for compressing weblogs.
Background
In the current era of extremely developed internet, the weblog collecting and querying system has wide application. Various IT systems, network equipment and safety equipment can generate a large amount of network logs, the formats of the network log data are often greatly different, and a large amount of unstructured data needs to be adapted to a network log acquisition and query system so as to perform service analysis. In the face of massive unstructured data, the collected weblogs are generally compressed and stored, so that storage resources can be effectively saved, and the cost of purchasing storage equipment by a user is reduced.
A commonly used method for compressing weblogs is: firstly, all the collected weblogs are uniformly stored, and then the stored weblogs are subjected to secondary compression storage. Because the weblogs are uniformly stored and then compressed, and finally the obtained compressed packets are written into a disk for storage, that is, the process sequentially comprises write once, read once and write once, which results in waste on Input and Output (IO for short); typically, different weblogs have differences between features, which are referred to as miscellaneous features. When the weblogs are compressed, due to the existence of a large number of heterogeneous features, the similarity between the weblogs is low, and the compression rate is low.
Another commonly used method for compressing weblogs is as follows: all the collected weblogs are firstly compressed uniformly, and then the obtained compressed packets are written into a disk for storage, namely the process comprises one-time reading and one-time writing, although one-time writing is reduced, a large amount of field data with mixed characteristics still exist during compression, and the compression rate is low.
Another commonly used method for compressing weblogs is as follows: the collected weblogs are classified according to the service types of the weblogs, and then the weblogs of different service types are compressed and stored respectively. Although the compression ratio is improved compared with the first two compression methods, because the service types of the weblogs are more, the weblogs of each service type are compressed and then stored, a larger storage space is still required, and the compression ratio is still lower.
In summary, as the number of the weblogs is increasingly huge, the compressed weblogs need to occupy a larger storage space due to a lower compression rate in the existing weblog compression method.
Disclosure of Invention
The embodiment of the invention provides a weblog compression method and device, which are used for solving the problem of low compression ratio of the existing weblog compression method.
In a first aspect, a method for compressing a weblog, the method comprising:
analyzing the acquired weblog to determine at least one characteristic contained in the weblog;
if the existing service type of the first data set does not contain the first feature of the weblog in the set, determining the similarity between the feature set of the weblog and the feature set of the first data set, wherein the first feature is a feature used for representing the service type of the weblog in the at least one feature, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;
if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is determined to be larger than a set threshold value, merging the weblog into the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;
compressing and storing each data set, wherein if the data set comprises the first data set, the first data set is compressed and stored; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.
In the method of the embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog in a set, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.
In a possible implementation manner, determining similarity between the feature set of the blog and the feature set of the first data set includes:
determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;
and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.
In a possible implementation manner, after merging the weblog into the first data set, the method further includes:
determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.
In a possible implementation, the compressing and storing process performed on each data set includes:
after the number of the stored weblogs reaches a set first threshold value, compressing and storing each data set; or
After the sum of the data amount of the stored weblogs reaches a set second threshold value, compressing and storing each data set; or
And when a set compression period comes, compressing and storing each data set.
In a possible implementation, the compressing and storing process performed on each data set includes:
and compressing and storing each data set in a columnar storage mode. Because the column type storage mode is adopted for compression and storage, a higher compression ratio can be obtained.
In a possible implementation manner, after determining at least one feature included in the blog, the method further includes:
according to the first characteristic of the weblog, when the service type of the first data set is determined and the first characteristic is included in the service type set, the weblog is merged to include the first data set.
In a possible implementation manner, after compressing and storing each data set, the method further includes:
forming a third data set according to at least one characteristic contained in the weblog collected in a set time period;
if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set;
and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.
In a second aspect, an apparatus for compressing a blog, the apparatus comprising:
the characteristic analysis module is used for analyzing the acquired weblog and determining at least one characteristic contained in the weblog;
a first processing module, configured to determine, if an existing service type of a first data set does not include a first feature of the weblog in the set, a similarity between a feature set of the weblog and the feature set of the first data set, where the first feature is a feature of the at least one feature, which is used to represent the service type of the weblog, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;
a second processing module, configured to merge the weblog into the first data set if it is determined that a similarity between the feature set of the weblog and the feature set of the first data set is greater than a set threshold; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;
the compression module is used for compressing and storing each data set, wherein if the data set comprises the first data set, the first data set is compressed and stored; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.
In the apparatus of the embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog in a set, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.
In a possible implementation manner, when determining the similarity between the feature set of the blog and the feature set of the first data set, the first processing module is specifically configured to:
determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;
and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.
In a possible implementation manner, after merging the weblog into the first data set, the second processing module is further configured to:
determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.
In a possible implementation manner, when the compression module performs compression and storage processing on each data set, the compression module is specifically configured to:
after the number of the stored weblogs reaches a set first threshold value, compressing and storing each data set; or
After the sum of the data amount of the stored weblogs reaches a set second threshold value, compressing and storing each data set; or
And when a set compression period comes, compressing and storing each data set.
In a possible implementation manner, the first processing module is further configured to:
according to the first characteristic of the weblog, when the service type of the first data set is determined and the first characteristic is included in the service type set, the weblog is merged to include the first data set.
In a possible implementation manner, the apparatus further includes:
the optimization module is used for forming a third data set according to at least one characteristic contained in the weblog collected in a set time period; if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set; and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.
In a third aspect, a server comprises: the system comprises a processor, an input interface, an output interface, a memory and a system bus; wherein:
when the server runs, the processor reads the program in the memory and executes the method embodiment.
The memory is used for storing data used by the processor when executing operations;
the input interface is used for reading in data under the control of the processor;
an output interface outputs data under control of the processor.
In the server according to the embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog in a set, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.
Drawings
Fig. 1 is a schematic diagram of a weblog compression method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of another weblog compression method according to an embodiment of the present invention;
FIG. 3 is a diagram of a classification tree formed in accordance with an embodiment of the present invention;
fig. 4 is a schematic diagram of a weblog compression apparatus according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating another weblog compression apparatus according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a server according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in further detail with reference to the drawings attached hereto. It is to be understood that the embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.
As shown in fig. 1, a method for compressing a weblog according to an embodiment of the present invention includes:
s11, analyzing the collected weblog to determine the characteristics contained in the weblog;
the weblog is characterized by fields for storing different contents, such as srcip (source IP), dstip (destination IP), srcport (source port), dspport (destination port), and so on.
S12, if the service type of the existing first data set does not contain the first feature of the weblog, determining the similarity between the feature set of the weblog and the feature set of the first data set.
In this embodiment of the present invention, the first feature is a feature used to indicate a service type of the weblog in the at least one feature.
For example, the first feature of the blog is an eventType field in the blog, which is used to store a Service type of the blog, such as an Intrusion Prevention System (IPS) Service type, a LOGIN Service type, a Distributed Denial of Service (DDoS) Service type, and the like.
In this embodiment of the present invention, the union of the service types of the first data set is a union of the service types of the weblogs in the first data set.
For example, assuming that a network log 1 in a data set belongs to an IPS service type, a network log 2 also belongs to an IPS service type, a network log 3 belongs to a LOGIN service type, and a network log 4 belongs to a DDoS service type, a service type union corresponding to the data set is { IPS service type, LOGIN service type, DDoS service type.
In the embodiment of the present invention, the feature set of the blog is a set composed of features of the blog.
In this embodiment of the present invention, the feature set of the first data set is a union of features of all weblogs in the first data set.
For example, it is assumed that the first data set includes two weblogs, and the characteristics of the first weblog include srcip, dstip, srcport, dspport, natsrcip, natdspip, username, and descriptor; the characteristics of the second weblog comprise srcip, dstip, srcport, dspport, username, appname and domain; the feature set of the first data set is then:
{srcip,dstip,srcport,dspport,natsrcip,natdspip,username,describe,appname,domain}。
S13A, if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is determined to be larger than a set threshold value, the weblog is merged to the first data set.
S13B, if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than the set threshold value, creating a second data set, and merging the weblog into the second data set.
S14, compressing and storing each data set; wherein: if the data set comprises the first data set, compressing and storing the first data set; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.
In the embodiment of the invention, each data set is compressed and stored by taking the data set as a unit.
For example, if the data set includes first data sets, each first data set is compressed and stored; and if the data set comprises a first data set and a second data set, respectively compressing and storing the first data set and the second data set.
In this embodiment of the present invention, when the service type of the existing first data set does not include the first feature of the blog centrally, the blog is classified according to the similarity between the feature set of the blog and the feature set of the first data set, specifically: if the similarity between the feature set of the weblog and the feature set of the first data set is larger than a set threshold, merging the weblog into the first data set; and if the similarity between the feature set of the weblog and the feature set of the first data set is not larger than a set threshold, creating a second data set, and merging the weblog into the second data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.
In this embodiment of the present invention, as another optional implementation manner, as shown in fig. 2, after S11, the method further includes:
s15, according to the first characteristic of the weblog, when the service type corresponding to the existing first data set is determined and the first characteristic is included in the service type, the weblog is merged into the first data set.
In this embodiment of the present invention, the determining, in S12, a similarity between the feature set of the weblog and the feature set of the first data set includes:
determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;
and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.
In a specific implementation, a knowledge base may be preset, where the knowledge base is a feature sequence formed by features in feature sets of all weblogs according to a set ordering rule. When the first numerical value and the second numerical value are determined, firstly, forming a first characteristic sequence by the characteristics in the characteristic set of the weblog according to a set sorting rule, and forming a second characteristic sequence by the characteristics in the characteristic set of the first data set according to the set sorting rule; comparing the first characteristic sequence and the second characteristic sequence with the set knowledge base respectively to form a first mark sequence and a second mark sequence, wherein the lengths of the first mark sequence and the second mark sequence are both the same as the length of the set knowledge base, and the first mark sequence and the second mark sequence are both bit sequences only including 0 and 1, wherein the characteristic corresponding to the bit with the bit value of 1 in the first mark sequence is a characteristic included in the weblog, and the characteristic corresponding to the bit with the bit value of 0 is a characteristic not included in the weblog; the features corresponding to the bits with the bit value of 1 in the second marker sequence are the features contained in the feature set of the first data set, and the features corresponding to the bits with the bit value of 0 are the features not contained in the feature set of the first data set.
For example, assume that a first feature sequence formed by a feature set of a weblog according to a set sorting rule is as follows: srcip, dstip, srcport, dspport, natsrcip, natdspip, username, descriptor;
the second characteristic sequence formed by the characteristic set of the first data set according to the set sorting rule is as follows: srcip, dstip, srcport, dspport, username, appname, domain;
the set knowledge base is as follows: srcip, dstip, srcport, dspport, natsrcip, natdspip, username, descriptor, appname, domain, netid, localinfo;
then: the first marker sequence formed by comparing the first characteristic sequence with the set knowledge base is as follows: 1,1,1,1,1,1,1,0,0,0, 0; the second marker sequence formed by comparing the second characteristic sequence with the set knowledge base is as follows: 1,1,1,1,0,0,1,0,1,1,0,0. Calculating the bit number of 5 (i.e. a first numerical value) of the same position in the first marker sequence and the second marker sequence being 1; the number of bits for calculating the same position in the first marker sequence and the second marker sequence is only 10 (i.e. the second value) with a number of 1. And calculating the similarity between the feature set of the weblog and the feature set of the first data set to be 5/10-0.5.
Optionally, after merging the weblog into the first data set in S13A, the method further includes:
determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.
Specifically, after merging the weblog into the first data set, the feature set of the first data set needs to be updated, that is, a union of the feature set of the weblog and the feature set of the first data set is determined as the feature set of the first data set.
In the embodiment of the present invention, as shown in fig. 3, the classification tree formed by the classification in the above manner is a classification tree in which the first classification, the second classification, and the like are parent nodes, the parent nodes represent formed data sets, the service classes 1 and 2, and the like are child nodes, and the child nodes represent weblogs included in the data sets.
In the embodiment of the present invention, in S14, the compression and storage processing is performed on each data set, and includes the following three triggers:
mode 1, event a triggering, that is, after the number of the stored weblogs, that is, the number of the weblogs, reaches a set first threshold value, triggering compression and storage processing, specifically:
after the number of the stored weblogs reaches a set first threshold value, for example, the second threshold value may be 1000, and each data set is compressed and stored.
Mode 2 and event B triggering, namely after the sum of the data amounts of the stored weblogs reaches a set second threshold, triggering compression and storage processing, specifically:
after the sum of the data amounts of the stored weblogs reaches a set second threshold value, which may be 100 mbytes, for example, each data set is compressed and stored.
Mode 3, cycle triggering, that is, after each set compression cycle arrives, triggering compression and storage processing, specifically:
and when a set compression period comes, compressing and storing each data set.
Based on any of the above embodiments, optionally, the compressing and storing process performed on each data set in S14 includes:
and compressing and storing each data set in a columnar storage mode. Because the column type storage mode is adopted for compression and storage, a higher compression ratio can be obtained.
Of course, the embodiment of the present invention is not limited to performing the compression and storage processing by using the columnar storage, and may also perform the compression and storage processing on each data set by using other manners known in the art, such as a line storage manner.
Based on any of the above embodiments, optionally, after compressing and storing each data set in S14, a compressed packet corresponding to each data set is obtained, and each compressed packet is stored in a TLV format, where T represents a feature identifier (such as srcip, dstip, srcport, etc.), L represents a length of the compressed packet, and V represents the compressed packet itself.
For example, a TLV is a triplet, which is collectively referred to as Type, Length, and Value. The T, L field is usually fixed in length (usually 1-4 bytes), and the V field is variable in length. T, L, and V, in the embodiment of the present invention, T represents a feature identifier (i.e. one of the features of the blog, which represents which feature is stored), L represents the length of the stored compressed packet, and V represents the stored compressed packet.
Based on any of the above embodiments, after compressing and storing each of the data sets in S14, optimizing the service type of each of the data sets, specifically:
forming a third data set according to at least one characteristic contained in the weblog collected in a set time period;
if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set;
and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.
For example, after the compression and storage processing of the weblog is completed, the currently formed classification tree may be optimized, specifically: after the compression and storage processing of the weblogs is completed, forming a new data set (namely, a third data set) according to the characteristics contained in the weblogs collected within a set time period, for example, according to the characteristics contained in the weblogs collected within 1 day before the current time, so as to form an optimized classification tree; for the third data set, if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set; and if the data set comprises the first data set and the second data set and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set, so that the original classification tree is replaced by the optimized classification tree.
The above method process flow may be implemented by a software program, which may be stored in a storage medium, and when the stored software program is called, the above method steps are performed.
Based on the same inventive concept, an embodiment of the present invention further provides a weblog compression apparatus, where a principle of the apparatus to solve the problem is similar to that of the above-mentioned weblog compression method, and parts of the apparatus that are the same as the above-mentioned method are specifically referred to in the description of the embodiment shown in fig. 1 and fig. 2, and are not described again here.
An apparatus for compressing a blog according to an embodiment of the present invention, as shown in fig. 4, includes:
the feature analysis module 41 is configured to analyze the collected weblog to determine at least one feature included in the weblog;
a first processing module 42, configured to determine, if an existing service type union of a first data set does not include a first feature of the weblog, a similarity between a feature set of the weblog and the feature set of the first data set, where the first feature is a feature of the at least one feature, which is used to represent the service type of the weblog, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;
a second processing module 43, configured to merge the weblog into the first data set if it is determined that the similarity between the feature set of the weblog and the feature set of the first data set is greater than a set threshold; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;
a compression module 44, configured to perform compression and storage processing on each data set, where if the data set includes the first data set, the compression and storage processing is performed on the first data set; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.
In the embodiment of the invention, when the service type of the existing first data set does not contain the first feature of the weblog in a set, the weblog is classified according to the similarity between the feature set of the weblog and the feature set of the first data set. The merging scheme provided by the invention can classify the weblogs with different service types and high similarity into the same class, thereby effectively reducing the number of compressed packets and further reducing the storage space.
Optionally, when the first processing module 42 determines the similarity between the feature set of the blog and the feature set of the first data set, it is specifically configured to:
determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;
and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.
Based on any of the above embodiments, optionally, after the second processing module 43 merges the weblog into the first data set, the second processing module is further configured to:
determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.
Optionally, the compression module 44 is specifically configured to:
after the number of the stored weblogs reaches a set first threshold value, compressing and storing each data set; or
After the sum of the data amount of the stored weblogs reaches a set second threshold value, compressing and storing each data set; or
And when a set compression period comes, compressing and storing each data set.
As another optional implementation manner, the first processing module 42 is further configured to:
according to the first characteristic of the weblog, when the service type of the first data set is determined and the first characteristic is included in the service type set, the weblog is merged to include the first data set.
Based on any one of the above embodiments, optionally, as shown in fig. 5, the apparatus further includes:
an optimization module 45, configured to form a third data set according to at least one feature included in the weblog collected within a set time period; if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set; and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.
In the embodiment of the present invention, the method in the embodiments shown in fig. 1 and fig. 2 may be implemented by a server, as shown in fig. 6, where the server includes: a processor 61, an input interface 62, an output interface 63, a memory 64, and a system bus 65; wherein:
the processor 61 is responsible for logical operations and processing. When the server runs, the processor 61 reads the program in the memory 64 and executes the above method embodiment, specifically: the processor 61 performs the above-described steps S11, S12, S13A, S13B, and S14. Optionally, the processor 61 may also execute the step S15.
The storage 64 includes a memory and a hard disk, and can store data (such as a first data set, a second data set, a compressed packet obtained by compressing the data set, and the like) used by the processor 61 when executing operations. The input interface 62 is used for reading in data (such as weblogs) under the control of the processor 61, and the output interface 63 is used for outputting data (such as compressed packets) under the control of the processor 61.
The bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 61, and various circuits, represented by memory 64 and the hard disk, linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (13)

1. A method of weblog compression, the method comprising:
analyzing the acquired weblog to determine at least one characteristic contained in the weblog;
if the existing service type of the first data set does not contain the first feature of the weblog in the set, determining the similarity between the feature set of the weblog and the feature set of the first data set, wherein the first feature is a feature used for representing the service type of the weblog in the at least one feature, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;
if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is determined to be larger than a set threshold value, merging the weblog into the first data set; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;
compressing and storing each data set, wherein if the data set comprises the first data set, the first data set is compressed and stored; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.
2. The method of claim 1, wherein determining the similarity of the feature set of the weblog to the feature set of the first data set comprises:
determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;
and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.
3. The method of claim 1 or 2, wherein after merging the weblog into the first dataset, further comprising:
determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.
4. The method of claim 1, wherein compressing and storing each data set comprises:
after the number of the stored weblogs reaches a set first threshold value, compressing and storing each data set; or
After the sum of the data amount of the stored weblogs reaches a set second threshold value, compressing and storing each data set; or
And when a set compression period comes, compressing and storing each data set.
5. The method of claim 1 or 4, wherein compressing and storing each data set comprises:
and compressing and storing each data set in a columnar storage mode.
6. The method of claim 1, wherein determining at least one characteristic included in the blog further comprises:
according to the first characteristic of the weblog, when the service type of the first data set is determined and the first characteristic is included in the service type set, the weblog is merged to include the first data set.
7. The method of any one of claims 1, 2, 4, and 6, wherein after compressing and storing each data set, further comprising:
forming a third data set according to at least one characteristic contained in the weblog collected in a set time period;
if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set;
and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.
8. An apparatus for compressing a blog, the apparatus comprising:
the characteristic analysis module is used for analyzing the acquired weblog and determining at least one characteristic contained in the weblog;
a first processing module, configured to determine, if an existing service type of a first data set does not include a first feature of the weblog in the set, a similarity between a feature set of the weblog and the feature set of the first data set, where the first feature is a feature of the at least one feature, which is used to represent the service type of the weblog, the service type union of the first data set is a union of the service types of the weblogs in the first data set, the feature set of the weblog is a set formed by the features of the weblog, and the feature set of the first data set is a union of the features of all the weblogs in the first data set;
a second processing module, configured to merge the weblog into the first data set if it is determined that a similarity between the feature set of the weblog and the feature set of the first data set is greater than a set threshold; if the similarity between the characteristic set of the weblog and the characteristic set of the first data set is not larger than a set threshold value, creating a second data set, and merging the weblog into the second data set;
the compression module is used for compressing and storing each data set, wherein if the data set comprises the first data set, the first data set is compressed and stored; and if the data set comprises the first data set and the second data set, respectively compressing and storing the first data set and the second data set.
9. The apparatus of claim 8, wherein the first processing module is specifically configured to:
determining a first numerical value and a second numerical value, wherein the first numerical value is the number of features in the intersection of the feature set of the weblog and the feature set of the first data set, and the second numerical value is the number of features in the union of the feature set of the weblog and the feature set of the first data set;
and determining the similarity between the feature set of the weblog and the feature set of the first data set according to the first numerical value and the second numerical value, wherein the similarity between the feature set of the weblog and the feature set of the first data set is the ratio of the first numerical value to the second numerical value.
10. The apparatus of claim 8 or 9, wherein the second processing module, after merging the weblog into the first dataset, is further to:
determining a union of the feature set of the weblog and the feature set of the first data set as the feature set of the first data set.
11. The apparatus of claim 8, wherein the compression module is specifically configured to:
after the number of the stored weblogs reaches a set first threshold value, compressing and storing each data set; or
After the sum of the data amount of the stored weblogs reaches a set second threshold value, compressing and storing each data set; or
And when a set compression period comes, compressing and storing each data set.
12. The apparatus of claim 8, wherein the first processing module is further to:
according to the first characteristic of the weblog, when the service type of the first data set is determined and the first characteristic is included in the service type set, the weblog is merged to include the first data set.
13. The apparatus of any one of claims 8, 9, 11, 12, further comprising:
the optimization module is used for forming a third data set according to at least one characteristic contained in the weblog collected in a set time period; if the service type union of the third data set is a subset of the service type union of the first data set, replacing the first data set with the third data set, wherein the service type union of the third data set is the union of the service types of the weblogs in the third data set; and if the data set comprises the first data set and the second data set, and the service type union of the third data set is a subset of the service type union of the second data set, replacing the second data set with the third data set.
CN201510728041.7A 2015-10-31 2015-10-31 Weblog compression method and device Active CN106649344B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510728041.7A CN106649344B (en) 2015-10-31 2015-10-31 Weblog compression method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510728041.7A CN106649344B (en) 2015-10-31 2015-10-31 Weblog compression method and device

Publications (2)

Publication Number Publication Date
CN106649344A CN106649344A (en) 2017-05-10
CN106649344B true CN106649344B (en) 2020-01-10

Family

ID=58809347

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510728041.7A Active CN106649344B (en) 2015-10-31 2015-10-31 Weblog compression method and device

Country Status (1)

Country Link
CN (1) CN106649344B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897500B (en) * 2018-05-23 2021-10-26 联想图像(天津)科技有限公司 Data transmission method and device and electronic equipment
CN109189763A (en) * 2018-09-17 2019-01-11 北京锐安科技有限公司 A kind of date storage method, device, server and storage medium
CN112559618B (en) * 2020-12-23 2023-07-11 光大兴陇信托有限责任公司 External data integration method based on financial wind control business
CN113535654B (en) * 2021-06-11 2023-10-31 安徽安恒数智信息技术有限公司 Log processing method, system, electronic device and storage medium
CN113553589B (en) * 2021-07-30 2022-09-02 江苏易安联网络技术有限公司 Extraction method, device and application of malicious software propagation characteristics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1842021A (en) * 2005-03-28 2006-10-04 华为技术有限公司 Log information storage method
WO2012031269A1 (en) * 2010-09-03 2012-03-08 Loglogic, Inc. Random access data compression
CN102541863A (en) * 2010-12-14 2012-07-04 联芯科技有限公司 Webpage compression method applied to mobile terminal
CN102609491A (en) * 2012-01-20 2012-07-25 东华大学 Column-storage oriented area-level data compression method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1842021A (en) * 2005-03-28 2006-10-04 华为技术有限公司 Log information storage method
WO2012031269A1 (en) * 2010-09-03 2012-03-08 Loglogic, Inc. Random access data compression
CN102541863A (en) * 2010-12-14 2012-07-04 联芯科技有限公司 Webpage compression method applied to mobile terminal
CN102609491A (en) * 2012-01-20 2012-07-25 东华大学 Column-storage oriented area-level data compression method

Also Published As

Publication number Publication date
CN106649344A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN106649344B (en) Weblog compression method and device
US11487772B2 (en) Multi-party data joint query method, device, server and storage medium
US10764315B1 (en) Virtual private cloud flow log event fingerprinting and aggregation
WO2017107965A1 (en) Web anomaly detection method and apparatus
WO2016206567A1 (en) Distributed stream computing system, method and device
US20230040635A1 (en) Graph-based impact analysis of misconfigured or compromised cloud resources
US9391831B2 (en) Dynamic stream processing within an operator graph
US9632899B2 (en) Method for analyzing request logs in advance to acquire path information for identifying problematic part during operation
US10657033B2 (en) How to track operator behavior via metadata
US9069915B2 (en) Identifying and routing poison tuples in a streaming application
US10079750B2 (en) Limiting data output from windowing operations
US8700632B2 (en) Managing heterogeneous data
CN105447113A (en) Big data based informatiion analysis method
US20140040279A1 (en) Automated data exploration
CN106909454B (en) Rule processing method and equipment
US20190253532A1 (en) Increasing data resiliency operations based on identifying bottleneck operators
US10089167B2 (en) Log file reduction according to problem-space network topology
TWI727639B (en) Method and device for tracing block chain transactions
CN112612832B (en) Node analysis method, device, equipment and storage medium
US20210182458A1 (en) Method, device and computer program product for data simulation
CN110750582A (en) Data processing method, device and system
CN106375351A (en) Abnormal domain name detection method and device
US20200065233A1 (en) Automatically establishing significance of static analysis results
TWI662486B (en) Method and device for checking completeness of distributed business processing
CN104410567A (en) Instant communication method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant