CN117112549B

CN117112549B - Big data merging method based on bloom filter

Info

Publication number: CN117112549B
Application number: CN202311365012.XA
Authority: CN
Inventors: 代颖超; 张仑; 梁思杰; 牛威
Original assignee: Zhongke Xingtu Measurement And Control Technology Co ltd
Current assignee: Zhongke Xingtu Measurement And Control Technology Co ltd
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2024-03-26
Anticipated expiration: 2043-10-20
Also published as: CN117112549A

Abstract

The invention discloses a big data merging method based on a bloom filter, which comprises the following steps: s1, adopting Redis to cache syslog log data sent by different devices/hosts in batches; s2, consuming the syslog log data cached in the Redis batch, and obtaining the merged field encryption value of the syslog log data after analyzing and processing the consumed syslog log data; s3, screening the encryption value of the integrated field of the syslog log data by using a bloom filter, and transferring the screened syslog log data to a database; according to the invention, the Redis batch cache syslog log data is used for carrying out the analysis processing of the encrypted value of the merging field, and the merging field is used for reducing a large amount of redundant data in the syslog log data, saving the storage space, reducing the use cost of a database and improving the use efficiency of the database.

Description

Big data merging method based on bloom filter

Technical Field

The invention relates to the technical field of big data merging and storing, in particular to a big data merging method based on a bloom filter.

Background

The recent rapid development of the internet has led humans to enter an era of explosive growth in information content. Everyone's life is filled with structured and unstructured data. With the overall shift of human life to the internet, the big data age will inevitably come, and as the leading edge concept of the global internet, the big data mainly comprises two aspects of characteristics: on the one hand the amount of information in the whole society has grown drastically and on the other hand the information available to individuals has grown exponentially. From the technological development perspective, "big data" is an inevitable product of the trend of "data" -! And as this trend continues to go deep, we will be in the near future in an era of "everything is recorded and everything is digitized".

In the big data age, the amount of data generated in various fields has been increasing explosively. Data accumulates at a staggering rate from social media, sensor data, to online transactions and cloud storage. These data contain valuable information and insight, and in this context efficient storage of large data and good analytical exploitation is becoming increasingly urgent. The data analysis capability determines the quality and success/failure of the value discovery process in the big data. The most important difference between data collection, analysis, storage and past data analysis in the big data age is the dramatic increase in data volume. The demands for storage, querying and analysis of data are rapidly increasing due to the increasing amount of data. The big data age requires efficient data processing and analysis methods, and the traditional mode is from data receiving, preprocessing to data merging and storing, and risks of redundant data, data loss, cache penetration and service downtime exist.

Patent document CN103116599a discloses a method for removing fast redundancy of urban mass data stream based on improved Bloom Filter structure, and the method is related to a method for removing redundancy data based on Bloom Filter structure; but it focuses on redundancy removal after data set storage by the Bloom Filter structure, and does not take advantage of the Bloom Filter screening approach.

Disclosure of Invention

The invention aims to provide a bloom filter-based big data merging method, which solves the problems of reduced data processing efficiency, data loss, cache penetration and service downtime caused by a large amount of redundant data when big data is processed.

The aim of the invention can be achieved by the following technical scheme: a big data merging method based on a bloom filter comprises the following steps:

s1, adopting Redis to cache syslog log data sent by different devices/hosts in batches;

s2, consuming the syslog log data cached in the Redis batch, and obtaining the merged field encryption value of the syslog log data after analyzing and processing the consumed syslog log data;

s3, screening the encryption value of the integrated field of the syslog log data by using a bloom filter, and transferring the screened syslog log data to a database.

Further: the step of screening the encrypted value of the syslog log data merging field by using a bloom filter in the S3 is as follows:

s31, searching whether a corresponding merging field encryption value exists in the bloom filter for the passed syslog log data by the bloom filter;

s32, when the merging field encryption value does not exist in the bloom filter, searching whether the data which is the same as the merging field exists in the database, updating the data if the data exists, and updating the data to the database if the data does not exist; and storing the merge field encryption value into a bloom filter and a Redis;

s33, when the merging field encryption value exists in the bloom filter, updating the data which are the same as the merging field in the database;

s34, repeating S31-S33 to finish the consumption syslog log data screening process.

Further: the step of searching whether the bloom filter has the corresponding merging field encryption value for the passed syslog log data in the S31 is as follows:

s311, the bloom filter converts the encryption value into a hash value;

s312, comparing byte array positions corresponding to the hash values by using a bloom filter;

s313, if the compared hash value does not exist in the byte array position, returning a null value, and judging that the corresponding merging field encryption value does not exist in the bloom filter.

Further: in S33, when the merging field encryption value exists in the bloom filter, the step of updating the data identical to the merging field in the database is as follows:

s331, when a merging field encryption value exists in the bloom filter, inquiring merging field encryption value data stored by Redis to confirm whether the merging field encryption value data exists truly;

s332, if the Redis has the same data as the merging field, updating the same data and updating the same data as the merging field in the database;

s333, if the Redis does not have the same data as the merging field, the data is inserted into the database.

The invention has the beneficial effects that:

1. according to the invention, the syslog log data is cached in batches by the Redis and the analysis processing of the encrypted value of the merging field is carried out, and the processing of the merging field reduces a large amount of redundant data in the syslog log data, so that the storage space is saved, the use cost of the database is reduced, and the use efficiency of the database is improved.

2. According to the invention, the bloom filter is carried out on the data of the merging field encryption value, the bloom filter has higher screening speed, the screening speed is obviously faster than that of searching the same merging field data in the Redis, the Redis searching and using of the request data can be prevented continuously through the quick screening of the bloom filter, so that the Redis operation speed is reduced, the Redis cache penetration is caused, the problem of Redis cache penetration can be solved through the arrangement of the bloom filter, and the removing speed of redundant data is improved.

3. The invention adopts the bloom filter to screen and filter, and simultaneously increases the utilization of Redis screening, and can lead the removal of redundant data to be more accurate and lead the removal rate of the redundant data to be higher through re-inquiring the encrypted value of the merging field in the Redis.

4. The invention uniformly encrypts the data merging fields, thereby effectively preventing malicious attack and preventing service downtime.

5. The invention adopts the bloom filter to control the Redis cache penetration in a tolerant range, and the bloom filter can be utilized to pre-cache the main key of the data query, the encryption value of the merging field is cached in the bloom filter, when the data query is carried out according to the encryption value of the merging field, the bloom filter firstly judges whether the value exists, if the value exists, the next processing is carried out, if the value does not exist, the processing returns directly, and the cache penetration is effectively controlled in a tolerant range.

Drawings

FIG. 1 is a flow chart of a bloom filter-based big data merging method of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar symbols indicate like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As shown in fig. 1, the invention discloses a bloom filter-based big data merging method, which comprises the following steps:

Redis adopts a message queue Redis Stream to buffer data, the event Stream data is stored through an orderly and continuously growing log sequence, each event is a message containing a plurality of fields, the messages are added to the tail of the Redis Stream, the Redis Stream receives syslog log data sent by different devices/hosts in a UDP mode, and the message queue Redis Stream buffers the syslog log data sent by different devices/hosts in batches; the message queue Redis Stream provides the functions of persistence and master-slave replication, so that any client can access syslog log data at any moment for consumption, the position of each client accessing the syslog log data can be stored, the client can dynamically adjust the consumption speed according to the processing capacity of the client, the reliable processing of the data is ensured, and the data loss is effectively prevented.

And using the client to consume the syslog log data cached in batches in the message queue Redis Stream, analyzing the syslog log data, and acquiring the encrypted value of the merge field of the syslog log data.

The merging field is used for extracting and merging the characteristic values in the syslog log data according to a unified format, and is used as a merging field, for example, equipment information, time information, content information and the like in the syslog log data, a large amount of redundant data can be generated in the syslog log data after the processing of the merging field, the same redundant data has no use value, and occupies a large amount of storage space, so that the use cost of a database is increased, and the use efficiency of the database is reduced.

The parsed syslog log data is encrypted according to the merging field to obtain an encrypted value, so that malicious attacks can be prevented.

In order to screen redundant data in the syslog log data, a bloom filter can be utilized to screen the encrypted value of the merging field of the syslog log data, and then the screened syslog log data is transferred to a database.

As shown in fig. 1, in particular:

s31, searching whether the bloom filter has a corresponding merging field encryption value or not according to the passed syslog log data by the bloom filter, wherein the specific steps of searching whether the bloom filter has the corresponding merging field encryption value are as follows:

s311, the bloom filter converts the encryption value into a hash value;

When data of a non-existing merging field encryption value is requested, the bloom filter converts the comparison encryption value into a hash value when the data passes through the bloom filter, the byte array position corresponding to the hash value is compared, if the compared hash value does not exist in the byte array position, the value can be found to be non-existing immediately, a null value is directly returned, the speed is almost as fast as ignoring, and the speed is obviously faster than that of searching the same merging field data in Redis.

When the redundant data with the same merging fields are subjected to duplicate checking, the request data can be prevented from being continuously searched and used for the Redis through the quick screening of the bloom filter, so that the Redis operation speed is reduced, the penetration of the Redis cache is caused, and the problem of the penetration of the Redis cache can be solved through the setting of the bloom filter.

the encryption value of the merging field in the syslog log data consumed in the bloom filter and the Redis is synchronously kept updated; if the bloom filter does not have a merge field encryption value, it may be determined that there is also no merge field encryption value in Redis, then the piece of syslog log data may be considered new data, and then the data is updated to the database.

S33, when the merging field encryption value exists in the bloom filter, updating the data which are the same as the merging field in the database; the step of updating the same data in the database as the merge field may be:

The bloom filter is used for screening the encryption value of the merging field, the hash value converted from the comparison encryption value is adopted, and the byte array position corresponding to the hash value is compared, so that the bloom filter is different from the Redis in the screening process of the encryption value of the merging field, and the result is that when the same data of the encryption value of the merging field exists in the bloom filter, the same merging field data may or may not exist in the Redis, if the same data of the merging field exists in the Redis, the same data is updated, and the same data of the merging field in the database is updated; if the Redis does not have the same data as the merge field, the data is updated to the database.

By re-inquiring the log data consumed by syslog in Redis, the redundant data can be removed more accurately, and the removal rate of the redundant data is higher.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

It is to be understood that the terms "center," "longitudinal," "transverse," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counter-clockwise," "axial," "radial," "circumferential," and the like are directional or positional relationships as indicated based on the drawings, merely to facilitate describing the invention and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and therefore should not be construed as limiting the invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; may be mechanically connected, may be electrically connected or may be in communication with each other; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In the present invention, unless expressly stated or limited otherwise, a first feature "up" or "down" a second feature may be the first and second features in direct contact, or the first and second features in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

Claims

1. The big data merging method based on the bloom filter is characterized by comprising the following steps of:

s3, screening the encryption value of the integrated field of the syslog log data by using a bloom filter, and transferring the screened syslog log data to a database;

the step of screening the encrypted value of the syslog log data merging field by using a bloom filter in the S3 is as follows:

2. The bloom filter-based big data merging method of claim 1, wherein: the step of searching whether the bloom filter has the corresponding merging field encryption value for the passed syslog log data in the S31 is as follows:

s311, the bloom filter converts the encryption value into a hash value;

3. The bloom filter-based big data merging method of claim 1, wherein: in S33, when the merging field encryption value exists in the bloom filter, the step of updating the data identical to the merging field in the database is as follows: