CN115499244A - Streaming data safe transmission and storage method based on data lake - Google Patents

Streaming data safe transmission and storage method based on data lake Download PDF

Info

Publication number
CN115499244A
CN115499244A CN202211429921.0A CN202211429921A CN115499244A CN 115499244 A CN115499244 A CN 115499244A CN 202211429921 A CN202211429921 A CN 202211429921A CN 115499244 A CN115499244 A CN 115499244A
Authority
CN
China
Prior art keywords
data
transmission
encrypted
streaming
source
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211429921.0A
Other languages
Chinese (zh)
Inventor
姜杰
姜自成
姜开德
张学德
李萌萌
徐莹莹
莫言田
陈源
姜宝永
管绍朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jianghua Group Co ltd
Original Assignee
Jianghua Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jianghua Group Co ltd filed Critical Jianghua Group Co ltd
Priority to CN202211429921.0A priority Critical patent/CN115499244A/en
Publication of CN115499244A publication Critical patent/CN115499244A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • H04L63/0442Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload wherein the sending and receiving network entities apply asymmetric encryption, i.e. different keys for encryption and decryption
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • H04L63/123Applying verification of the received information received data contents, e.g. message integrity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/18Network architectures or network communication protocols for network security using different networks or channels, e.g. using out of band channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/30Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy
    • H04L9/3066Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy involving algebraic varieties, e.g. elliptic or hyper-elliptic curves

Abstract

The invention relates to the technical field of data processing, and particularly discloses a streaming data safe transmission and storage method based on a data lake. According to the method, event data are acquired at a plurality of data source ends, the Event data are filtered and cleaned to obtain key data, and data encryption processing is carried out on the key data to obtain encrypted data; constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel; and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology. The method can filter and encrypt key information at a data source end, ensures the safety of data in the transmission process, safely and efficiently stores streaming data on the premise of not destroying the original structure of the data, and can increase the query speed of the data while reducing the occupied storage space.

Description

Streaming data safe transmission and storage method based on data lake
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a streaming data safe transmission and storage method based on a data lake.
Background
The big data refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and the big data is a massive information asset which has more utilization value after a new processing mode is adopted, and is characterized by large quantity, various types and strong timeliness. Studies on big data generally include: data acquisition, data storage, data analysis and the like. Wherein the data storage is located at the core position of the data processing. Data must be stored safely and efficiently for subsequent data analysis mining.
The streaming data consists of a real-time, continuous and ordered sequence of data items, and has the characteristics of large scale, difficult prediction and the like. In addition, the format of the streaming data is complex, and the data is at risk of being damaged, tampered or leaked in the transmission and storage processes. Thus, transmission and storage of streaming data face greater security challenges than static data. Protection of streaming data needs to extend to the whole process from the source of data production to the storage system.
There are three main aspects of security threats faced during streaming data transmission and storage: (1) Before data is transmitted, if the data is not subjected to security protection, the data is exposed to the risk of leakage. After the streaming data is generated, the streaming data is firstly cached locally and then transmitted through the network. The data which is not encrypted is stored in the local server in a plaintext form, and an unauthorized user can acquire data information by checking the cache data and even tamper the data information. (2) The data transmission process faces security threats such as monitoring and tampering. An attacker can acquire data in the transmission process by monitoring the data transmission port and intercept or tamper the data. (3) Streaming data faces the risk of loss in the storage system. As time goes on, the scale of streaming data is continuously enlarged, and the traditional data storage system cannot meet the storage requirement of massive streaming data. In addition, the streaming data generally comprises multiple types of structured data, semi-structured data, unstructured data and the like, and the traditional structured database storage mode is not suitable for the streaming data with complex types.
Disclosure of Invention
The embodiment of the invention aims to provide a streaming data safe transmission and storage method based on a data lake, and aims to solve the problems in the background art.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
a streaming data safe transmission and storage method based on a data lake specifically comprises the following steps:
acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data;
constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel;
and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology.
As a further limitation of the technical solution of the embodiment of the present invention, the acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data specifically includes the following steps:
acquiring Event data of a plurality of data source ends in real time;
filtering and cleaning Event data acquired in real time in a stream processing mode to obtain key data;
and based on an ECC (error correction code) lightweight encryption algorithm, carrying out data encryption processing on the key data to obtain encrypted data.
As a further limitation of the technical solution of the embodiment of the present invention, the encrypting the key data based on the ECC lightweight encryption algorithm to obtain the encrypted data specifically includes the following steps:
generating an elliptic curve E based on an ECC lightweight encryption algorithm, acquiring an elliptic group, and finishing initialization of the encryption algorithm;
calculating all points satisfying the elliptic curve E and obtaining a base point
Figure 100002_DEST_PATH_IMAGE001
Mapping byte information of the key data to an elliptic curve E to realize data encryption processing and obtain encrypted data.
As a further limitation of the technical solution of the embodiment of the present invention, the elliptic curve E is defined as follows:
Figure 100002_DEST_PATH_IMAGE002
wherein, the first and the second end of the pipe are connected with each other,
Figure 100002_DEST_PATH_IMAGE003
and is
Figure 100002_DEST_PATH_IMAGE004
Figure 100002_DEST_PATH_IMAGE005
Is a field of rational numbers that is defined,
Figure 100002_DEST_PATH_IMAGE006
is a discriminant of an elliptic curve equation, which is defined as:
Figure 100002_DEST_PATH_IMAGE007
as a further limitation of the technical solution of the embodiment of the present invention, the simplified formula of the elliptic curve E is as follows:
Figure 100002_DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE009
Figure 100002_DEST_PATH_IMAGE010
is a finite field and p is a large prime number.
As a further limitation of the technical solution of the embodiment of the present invention, the constructing a multi-source streaming data transmission channel by using Flume and Kafka, and the performing encrypted transmission on the encrypted data at the multiple data sources through the multi-source streaming data transmission channel specifically includes the following steps:
a flash proxy server at a data receiving end generates a private key;
a Flume proxy server at a data receiving end generates a certificate signed by a trusted key;
a Flume receiving end proxy server establishes a trust repository;
and (3) creating a PKCS12 file by using the key and the certificate to encrypt the flash channel, and carrying out encrypted transmission on the encrypted data of a plurality of data source ends.
As a further limitation of the technical solution of the embodiment of the present invention, constructing a multi-source streaming data transmission channel by using Flume and Kafka, and performing encrypted transmission on encrypted data at multiple data source ends through the multi-source streaming data transmission channel further includes:
and collecting and summarizing the Event data lines corresponding to the data source ends by using a plurality of flash processes, and storing the Event data lines into the big data storage system in a centralized manner.
As a further limitation of the technical solution of the embodiment of the present invention, constructing a multi-source streaming data transmission channel by using Flume and Kafka, and performing encrypted transmission on encrypted data at multiple data source ends through the multi-source streaming data transmission channel further includes:
and (4) adopting Kafka to butt joint encrypted data transmitted by the flash channel, serializing the encrypted data, and realizing data caching when the data transmission amount is large.
As a further limitation of the technical solution of the embodiment of the present invention, the constructing a multi-source streaming data transmission channel by using Flume and Kafka, and the performing encryption transmission on the encrypted data at the multiple data source ends through the multi-source streaming data transmission channel specifically includes:
and in the intermediate stage of the transmission of the encrypted data, using Kafka to buffer the data, temporarily storing the encrypted data in the Kafka, and storing the data in a downstream storage system according to a set processing speed.
As a further limitation of the technical solution of the embodiment of the present invention, the step of constructing a data lake by using Hadoop and compressing and performing partitioned storage processing on encrypted data obtained by encryption transmission based on LZO compression technology specifically includes the following steps:
constructing a data lake by using Hadoop, wherein HDFS is adopted for bottom storage;
taking a timestamp of data transmission as a partition name, and performing partition storage on encrypted data;
and recompiling the LZO algorithm by using a Hadoop source code, and compressing the encrypted data stored in the partition according to the recompiled LZO algorithm.
Compared with the prior art, the invention has the beneficial effects that:
(1) The data source end filters and encrypts the data in a stream processing mode, so that the transmission efficiency of the data is ensured while key information is not leaked;
(2) A multi-data-source safe transmission channel based on Flume is constructed, so that the safety of streaming data in the network transmission process is ensured;
(3) A safe and efficient data lake storage system is constructed by using big data components such as Flume, kafka and Hadoop, and the integrity of data flow is guaranteed by adopting Kafka.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention.
Fig. 1 shows a schematic diagram of security threats presented by streaming data.
Fig. 2 shows a schematic diagram of a method provided by an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It can be understood that, as shown in fig. 1, there is a schematic diagram of security threats existing in streaming data, and in the prior art, there are three main aspects of security threats faced in the transmission and storage processes of streaming data: (1) Before data is transmitted, if security protection is not carried out, the risk of leakage is faced. After the streaming data is generated, the streaming data is firstly cached locally and then transmitted through a network. The data which is not encrypted is stored in the local server in a plaintext form, and an unauthorized user can acquire the data information by looking at the cache data and even tamper the data information. (2) The data transmission process faces security threats such as monitoring and tampering. An attacker can acquire data in the transmission process by monitoring the data transmission port and intercept or tamper the data.
(3) Streaming data is exposed to the risk of loss in storage systems. With the lapse of time, the scale of streaming data is continuously enlarged, and the traditional data storage system cannot meet the storage requirement of massive streaming data. In addition, the streaming data generally comprises multiple types of structured data, semi-structured data, unstructured data and the like, and the traditional structured database storage mode is not suitable for the streaming data with complex types.
In order to solve the problems, the embodiment of the invention obtains Event data at a plurality of data source ends, filters and cleans the Event data to obtain key data, and encrypts the key data to obtain encrypted data; constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel; and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology. The method can filter and encrypt key information at a data source end, ensures the safety of data in the transmission process, safely and efficiently stores streaming data on the premise of not destroying the original structure of the data, and can increase the query speed of the data while reducing the occupied storage space.
Fig. 2 shows a schematic diagram of a method provided by an embodiment of the invention.
Specifically, in a preferred embodiment provided by the present invention, a streaming data secure transmission and storage method based on a data lake specifically includes the following steps:
acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data.
In the embodiment of the invention, event data is acquired at a plurality of data source ends in real time, the Event data acquired in real time is filtered and cleaned in a stream processing mode to obtain key data, and an elliptic curve E is generated based on an ECC lightweight encryption algorithm and can be defined as the formula:
Figure DEST_PATH_IMAGE011
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE012
and is
Figure DEST_PATH_IMAGE013
Figure DEST_PATH_IMAGE014
Is a field of rational numbers as defined,
Figure DEST_PATH_IMAGE015
is a discriminant of the elliptic curve equation, which is defined as:
Figure DEST_PATH_IMAGE016
when the elliptic curve E satisfies the discriminant formula, call
Figure DEST_PATH_IMAGE017
Is a Weierstrass equation and is a formula of a Weierstrass,
Figure 928799DEST_PATH_IMAGE017
simplified, the general expression of an elliptic curve is as follows:
Figure DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE019
Figure DEST_PATH_IMAGE020
is a finite field, p is a large prime number;
ECC encryption requires that an elliptic curve is selected first
Figure DEST_PATH_IMAGE021
Taking a point on the elliptic curve as a base point G, and selecting a private key
Figure DEST_PATH_IMAGE022
And generates a public key
Figure DEST_PATH_IMAGE023
Where K and G are elliptic curves
Figure 301399DEST_PATH_IMAGE021
A point of (a); finally, embedding the plaintext into one point of the elliptic curve and using the public key
Figure DEST_PATH_IMAGE024
And (3) encryption of a plaintext is completed, data encryption processing is realized, and encrypted data is obtained, wherein an ECC encryption algorithm is as follows:
input data to be encrypted
Output encrypted data
Generating elliptic curves
Figure 986196DEST_PATH_IMAGE021
Obtaining elliptic curve group
Figure DEST_PATH_IMAGE025
While(x<=p-1) do
y
Figure DEST_PATH_IMAGE026
Get all satisfied points
Figure DEST_PATH_IMAGE027
And obtaining a base point
Figure DEST_PATH_IMAGE028
Getbody obtains event body content
Mapping the byte information to an elliptic curve to realize encryption;
end while
return encrypted data.
It can be understood that Event data, as a basic unit of flash transmission data, is composed of a header and a body, multiple continuous Event data constitute streaming data in flash, in the Event data, the body contains key information to be transmitted, the encryption interceptor acquires the key information and calls an encryption algorithm to encrypt the key information, the encrypted data continues to be transmitted in the form of data stream, and the encryption algorithm is:
Input: Events
Output: Encrypted events
list < Event > creates an Event collection object to store incoming data streams
if event != null then
Add (event) add event to event object set
while List < Event >! = null Event set object is not empty do
Calling Algorithm 1 to encrypt event volume data
end while
end if
return Encrypted events。
And step two, constructing a multi-source streaming data transmission channel by using the flash and the Kafka, and carrying out encryption transmission on the encrypted data of the plurality of data source ends through the multi-source streaming data transmission channel.
In the embodiment of the invention, a Flume proxy server at a data receiving end generates a private key, generates a certificate signed by a trusted key, and ensures that a data receiving end is credible, the Flume receiving end proxy server creates a trust repository to verify the authenticity of the key, an algorithm uses the key and the certificate to create a PKCS12 file to realize encryption of a transmission channel, encrypted data at a plurality of data source ends are encrypted and transmitted, and the working flow algorithm of the encryption channel is as follows:
while channel connection do
Key creation private Key for openssl gersa-des 3-out flash
if signed request presence
Creating a self-signed certificate else
Key generates a certificate signing request and creates a self-signed certificate
Creating trust store for flash to verify authenticity of keys
Openssl PKCS 12-export-in Flume02.Crt uses keys and certificates to create PKCS12 files to encrypt the Flume channel
end if
end while;
Considering that streaming data generally comes from a plurality of data sources, in order to ensure more efficient transmission and storage of the data, a plurality of flash processes are used, event data of the plurality of data sources are collected and summarized and are stored in a large data storage system in a centralized mode, kafka is adopted to interface encrypted data transmitted by a flash channel, the data are serialized, and when the data transmission amount is large, caching of the data is achieved, so that the streaming data can be stored in a distributed storage system efficiently and stably. The core of the source data storage is a reasonable docking scheme designed for Flume and Kafka, so that a transmission channel is ensured to rapidly and stably store data into a storage system at a high throughput according to a specific sequence, kafka can transmit streaming data in a serialized mode, in addition, under the condition that the data volume is large, kafka can play a caching role, the probability of writing failure of HDFS data at the time of data peak is reduced, and the core design code of the Flume docking Kafka is as follows:
definition of # sink type
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = ****
Name of topic for distinguishing the acquired message
a1.sinks.k1.kafka.topic = test
a1.sinks.k1.serializer.class = kafka.serializer.StringEncoder
# Channel based on memory
a1.channels.c1.type = memory
Thus, the Flume and Kafka docking scheme has the following functions in transmitting data:
(1) Buffering and peak clipping: when the data transmitted by the upstream flash has burst flow, the data throughput of a downstream storage system can be exceeded, so that the data is lost, kafka is used in the middle stage of data transmission to play a buffering role, the data is temporarily stored in the Kafka, and the downstream storage system can store the data at a set processing speed, so that the data can be smoothly transmitted to the storage system;
(2) And (3) robustness: the Kafka message queue can accumulate requests, and even if a downstream storage system fails for a short time, data loss cannot be caused;
(3) Ensuring the integrity of the data: in order to ensure that data sent by a producer can be completely and reliably sent to a specified topic, after each partition (partition) of the topic receives the data sent by the producer, an ack (acknowledgement confirmation receipt) is sent to the producer, if the producer receives the ack, the next round of data sending is carried out, otherwise, the data is sent again, after the producer sends a message, a feeder of the specified topic synchronizes the data received by a Leader, and after all the feeders complete data synchronization, the topic sends the ack to the producer, and confirms that all the data are received; after the producer receives the ack confirmation, other data are continuously sent; if an ack acknowledgment is not received, the producer determines that the piece of data was not completely delivered and will resend the data until an ack acknowledgment is received, ensuring the integrity and reliability of the streaming data transmission.
It can be understood that the flash is a distributed data acquisition system for efficiently collecting, aggregating and transmitting massive streaming data from a plurality of different data sources to a designated storage system, and the design principle is based on streaming data, and log information, event data and the like can be collected from various website servers and stored in a centralized storage system such as HDFS, HBase and the like. The Flume architecture processes streaming data by generating events (events) and agents (agents).
It can be understood that Kafka is a distributed publish-subscribe message system, which can efficiently process streaming data, subscribers subscribe interested Topic according to requirements and acquire messages specifying Topic, and compared with the conventional message system, kafka can well ensure the ordering of streaming data, and has a function of temporarily storing data.
And thirdly, constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning mode based on an LZO compression technology.
In the embodiment of the invention, a Hadoop is used for constructing a data lake, the bottom storage of the data lake adopts HDFS, in order to ensure that the subsequent streaming data stored in the data lake can be efficiently searched, a scheme of partitioning according to time is designed, namely a timestamp of data transmission is used as a partition name, in addition, the occupied space of the data is reduced through an LZO compression technology, in the distributed storage, the most commonly used compression modes are LZO and Snapp, when the Hadoop processes the data, a slice is one of operations for improving the operation efficiency of the Hadoop, but the Snapp does not support the slice operation, the LZO supports the slice and has a higher compression speed, but the Hadoop does not support the LZO compression, so that a Hadoop source code is needed to be used for recompiling the LZO algorithm, the functions of the Hadoop are redefined according to the partition and compression requirements, the data can be compressed with higher efficiency, and the recompiled core design code is as follows:
# sink description
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://Hadoop1:9000/flume/%Y%m%d/%H
Prefix of # upload File
a2.sinks.k2.hdfs.filePrefix = logs-
Whether to scroll the folder by time # or not
a2.sinks.k2.hdfs.round = true
How many time units are required to create a new folder
a2.sinks.k2.hdfs.roundValue = 1
# defines the time unit
a2.sinks.k2.hdfs.roundUnit = hour
Whether to use local time stamp
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# setting File types to support compression
a2.sinks.k2.hdfs.fileType = CompressedStream
a2.sinks.k2.hdfs.codeC = lzop。
It can be understood that the data lake is a novel big data storage scheme, can store massive raw data, supports any data format, has better data analysis and processing capacity, and is mainly characterized in that: the method has the advantages of low cost of data storage, high data fidelity, good accessibility and flexible data analysis, and can directly store the original data into the data lake without specifying the storage purpose of the data.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A streaming data safe transmission and storage method based on a data lake is characterized by specifically comprising the following steps:
acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data;
constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel;
and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology.
2. The streaming data safe transmission and storage method based on the data lake as claimed in claim 1, wherein the steps of acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data specifically include:
acquiring Event data of a plurality of data source ends in real time;
filtering and cleaning Event data acquired in real time in a stream processing mode to obtain key data;
and based on an ECC (error correction code) lightweight encryption algorithm, carrying out data encryption processing on the key data to obtain encrypted data.
3. The streaming data safe transmission and storage method based on the data lake as claimed in claim 2, wherein the ECC lightweight encryption algorithm is used to encrypt the key data, and the obtaining of the encrypted data specifically includes the following steps:
generating an elliptic curve E based on an ECC lightweight encryption algorithm, acquiring an elliptic group, and finishing initialization of the encryption algorithm;
calculating all points satisfying the elliptic curve E and obtaining a base point
Figure DEST_PATH_IMAGE001
Mapping byte information of the key data to an elliptic curve E to realize data encryption processing and obtain encrypted data.
4. A method for secure transmission and storage of streaming data based on a data lake according to claim 3, wherein the elliptic curve E is defined as:
Figure DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE003
and is
Figure DEST_PATH_IMAGE004
Figure DEST_PATH_IMAGE005
Is a field of rational numbers that is defined,
Figure DEST_PATH_IMAGE006
is a discriminant of an elliptic curve equation, which is defined as:
Figure DEST_PATH_IMAGE007
5. a method for secure transmission and storage of streaming data based on a data lake according to claim 4, wherein the simplified formula of the elliptic curve E is:
Figure DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
is a finite field and p is a large prime number.
6. The method for securely transmitting and storing streaming data based on data lake according to claim 1, wherein the constructing a multi-source streaming data transmission channel by using flash and Kafka, and the encrypted data from multiple data sources is transmitted through the multi-source streaming data transmission channel by encryption, specifically comprising the following steps:
a flash proxy server at a data receiving end generates a private key;
a Flume proxy server at a data receiving end generates a certificate signed by a trusted key;
a Flume receiving end proxy server establishes a trust repository;
and (3) creating a PKCS12 file by using the key and the certificate to encrypt the flash channel, and carrying out encrypted transmission on the encrypted data of a plurality of data source ends.
7. The method for secure transmission and storage of streaming data based on a data lake of claim 6, wherein the constructing a multi-source streaming data transmission channel using flash and Kafka, the encrypted transmission of the encrypted data at the plurality of data sources via the multi-source streaming data transmission channel further comprises:
and collecting and summarizing the Event data lines corresponding to the data source ends by using a plurality of flash processes, and storing the Event data lines into the big data storage system in a centralized manner.
8. The method for secure transmission and storage of streaming data based on a data lake of claim 6, wherein the constructing a multi-source streaming data transmission channel using flash and Kafka, the encrypted transmission of the encrypted data at the plurality of data sources via the multi-source streaming data transmission channel further comprises:
and (3) adopting Kafka to butt joint the encrypted data transmitted by the Flume channel, serializing the encrypted data, and realizing the caching of the data when the data transmission quantity is large.
9. The method for securely transmitting and storing streaming data based on a data lake of claim 6, wherein the Flume and Kafka are used to construct a multi-source streaming data transmission channel, and the encrypted data at multiple data sources is transmitted by encrypting the encrypted data at multiple data sources through the multi-source streaming data transmission channel specifically comprises:
and in the intermediate stage of the transmission of the encrypted data, using Kafka to buffer the data, temporarily storing the encrypted data in the Kafka, and storing the data in a downstream storage system according to a set processing speed.
10. The streaming data safe transmission and storage method based on the data lake as claimed in claim 1, wherein the step of constructing the data lake by using Hadoop and compressing and storing the encrypted data obtained by encrypted transmission in a partitioned manner based on LZO compression technology specifically comprises the following steps:
constructing a data lake by using Hadoop, and adopting HDFS for bottom storage;
taking a timestamp of data transmission as a partition name, and performing partition storage on encrypted data;
and recompiling the LZO algorithm by using a Hadoop source code, and compressing the encrypted data stored in the partition according to the recompiled LZO algorithm.
CN202211429921.0A 2022-11-16 2022-11-16 Streaming data safe transmission and storage method based on data lake Pending CN115499244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211429921.0A CN115499244A (en) 2022-11-16 2022-11-16 Streaming data safe transmission and storage method based on data lake

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211429921.0A CN115499244A (en) 2022-11-16 2022-11-16 Streaming data safe transmission and storage method based on data lake

Publications (1)

Publication Number Publication Date
CN115499244A true CN115499244A (en) 2022-12-20

Family

ID=85115757

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211429921.0A Pending CN115499244A (en) 2022-11-16 2022-11-16 Streaming data safe transmission and storage method based on data lake

Country Status (1)

Country Link
CN (1) CN115499244A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737349A (en) * 2023-08-16 2023-09-12 中国移动紫金(江苏)创新研究院有限公司 Stream data processing method, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107317838A (en) * 2017-05-24 2017-11-03 重庆邮电大学 A kind of astronomical metadata archiving method and system based on stream data processing framework
US20210397738A1 (en) * 2020-06-22 2021-12-23 Sophos Limited Filtered data lake for enterprise security

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107317838A (en) * 2017-05-24 2017-11-03 重庆邮电大学 A kind of astronomical metadata archiving method and system based on stream data processing framework
US20210397738A1 (en) * 2020-06-22 2021-12-23 Sophos Limited Filtered data lake for enterprise security

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张聪辉: ""Hadoop架构下的大数据安全存储技术研究"", 《万方平台在线出版》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116737349A (en) * 2023-08-16 2023-09-12 中国移动紫金(江苏)创新研究院有限公司 Stream data processing method, system and storage medium
CN116737349B (en) * 2023-08-16 2023-11-03 中国移动紫金(江苏)创新研究院有限公司 Stream data processing method, system and storage medium

Similar Documents

Publication Publication Date Title
CN111448781B (en) Computer-implemented method for communicating shared blockchain data
US10789215B1 (en) Log-structured storage systems
EP3673446B1 (en) Managing blockchain-based centralized ledger systems
US11423015B2 (en) Log-structured storage systems
EP3673376B1 (en) Log-structured storage systems
US10896006B1 (en) Log-structured storage systems
US10885022B1 (en) Log-structured storage systems
KR102412024B1 (en) Indexing and recovery of encoded blockchain data
EP3695303B1 (en) Log-structured storage systems
CN111902817A (en) Block chain data storage based on shared nodes and error correction coding
CN111095218B (en) Method, system and device for storing shared block chain data based on error correction coding
CN111523133A (en) Block chain and cloud data collaborative sharing method
US20200213331A1 (en) Data service system
US11250428B2 (en) Managing transaction requests in ledger systems
WO2019228574A2 (en) Log-structured storage systems
EP3682340A2 (en) Log-structured storage systems
CN111526197A (en) Cloud data secure sharing method
CN111095210A (en) Storing shared blockchain data based on error correction coding
CN112307501B (en) Big data system based on block chain technology, storage method and using method
CN112732695B (en) Cloud storage data security deduplication method based on block chain
US11455631B2 (en) Managing transaction requests in ledger systems
CN111033491A (en) Storing shared blockchain data based on error correction coding
CN115499244A (en) Streaming data safe transmission and storage method based on data lake
CN116304265A (en) Electronic file management method and system based on blockchain
US11455297B2 (en) Managing transaction requests in ledger systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20221220

RJ01 Rejection of invention patent application after publication