CN115499244A - Streaming data safe transmission and storage method based on data lake - Google Patents
Streaming data safe transmission and storage method based on data lake Download PDFInfo
- Publication number
- CN115499244A CN115499244A CN202211429921.0A CN202211429921A CN115499244A CN 115499244 A CN115499244 A CN 115499244A CN 202211429921 A CN202211429921 A CN 202211429921A CN 115499244 A CN115499244 A CN 115499244A
- Authority
- CN
- China
- Prior art keywords
- data
- transmission
- encrypted
- streaming
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000005540 biological transmission Effects 0.000 title claims abstract description 90
- 238000003860 storage Methods 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012545 processing Methods 0.000 claims abstract description 19
- 230000006835 compression Effects 0.000 claims abstract description 13
- 238000007906 compression Methods 0.000 claims abstract description 13
- 238000005516 engineering process Methods 0.000 claims abstract description 8
- 238000000638 solvent extraction Methods 0.000 claims abstract description 6
- 230000008569 process Effects 0.000 claims description 11
- 238000013500 data storage Methods 0.000 claims description 10
- 238000005192 partition Methods 0.000 claims description 10
- 238000004140 cleaning Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 7
- 239000000126 substance Substances 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 2
- 210000001503 joint Anatomy 0.000 claims description 2
- 238000007405 data analysis Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003032 molecular docking Methods 0.000 description 3
- 230000003139 buffering effect Effects 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- VBMOHECZZWVLFJ-GXTUVTBFSA-N (2s)-2-[[(2s)-6-amino-2-[[(2s)-6-amino-2-[[(2s,3r)-2-[[(2s,3r)-2-[[(2s)-6-amino-2-[[(2s)-2-[[(2s)-6-amino-2-[[(2s)-2-[[(2s)-2-[[(2s)-2,6-diaminohexanoyl]amino]-5-(diaminomethylideneamino)pentanoyl]amino]propanoyl]amino]hexanoyl]amino]propanoyl]amino]hexan Chemical compound NC(N)=NCCC[C@@H](C(O)=O)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCCCN)NC(=O)[C@H]([C@@H](C)O)NC(=O)[C@H]([C@H](O)C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](C)NC(=O)[C@H](CCCN=C(N)N)NC(=O)[C@@H](N)CCCCN VBMOHECZZWVLFJ-GXTUVTBFSA-N 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 108010068904 lysyl-arginyl-alanyl-lysyl-alanyl-lysyl-threonyl-threonyl-lysyl-lysyl-arginine Proteins 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
- H04L63/0442—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload wherein the sending and receiving network entities apply asymmetric encryption, i.e. different keys for encryption and decryption
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/64—Protecting data integrity, e.g. using checksums, certificates or signatures
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/02—Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
- H04L63/0227—Filtering policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/12—Applying verification of the received information
- H04L63/123—Applying verification of the received information received data contents, e.g. message integrity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/18—Network architectures or network communication protocols for network security using different networks or channels, e.g. using out of band channels
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L9/00—Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
- H04L9/30—Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy
- H04L9/3066—Public key, i.e. encryption algorithm being computationally infeasible to invert or user's encryption keys not requiring secrecy involving algebraic varieties, e.g. elliptic or hyper-elliptic curves
Abstract
The invention relates to the technical field of data processing, and particularly discloses a streaming data safe transmission and storage method based on a data lake. According to the method, event data are acquired at a plurality of data source ends, the Event data are filtered and cleaned to obtain key data, and data encryption processing is carried out on the key data to obtain encrypted data; constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel; and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology. The method can filter and encrypt key information at a data source end, ensures the safety of data in the transmission process, safely and efficiently stores streaming data on the premise of not destroying the original structure of the data, and can increase the query speed of the data while reducing the occupied storage space.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a streaming data safe transmission and storage method based on a data lake.
Background
The big data refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and the big data is a massive information asset which has more utilization value after a new processing mode is adopted, and is characterized by large quantity, various types and strong timeliness. Studies on big data generally include: data acquisition, data storage, data analysis and the like. Wherein the data storage is located at the core position of the data processing. Data must be stored safely and efficiently for subsequent data analysis mining.
The streaming data consists of a real-time, continuous and ordered sequence of data items, and has the characteristics of large scale, difficult prediction and the like. In addition, the format of the streaming data is complex, and the data is at risk of being damaged, tampered or leaked in the transmission and storage processes. Thus, transmission and storage of streaming data face greater security challenges than static data. Protection of streaming data needs to extend to the whole process from the source of data production to the storage system.
There are three main aspects of security threats faced during streaming data transmission and storage: (1) Before data is transmitted, if the data is not subjected to security protection, the data is exposed to the risk of leakage. After the streaming data is generated, the streaming data is firstly cached locally and then transmitted through the network. The data which is not encrypted is stored in the local server in a plaintext form, and an unauthorized user can acquire data information by checking the cache data and even tamper the data information. (2) The data transmission process faces security threats such as monitoring and tampering. An attacker can acquire data in the transmission process by monitoring the data transmission port and intercept or tamper the data. (3) Streaming data faces the risk of loss in the storage system. As time goes on, the scale of streaming data is continuously enlarged, and the traditional data storage system cannot meet the storage requirement of massive streaming data. In addition, the streaming data generally comprises multiple types of structured data, semi-structured data, unstructured data and the like, and the traditional structured database storage mode is not suitable for the streaming data with complex types.
Disclosure of Invention
The embodiment of the invention aims to provide a streaming data safe transmission and storage method based on a data lake, and aims to solve the problems in the background art.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
a streaming data safe transmission and storage method based on a data lake specifically comprises the following steps:
acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data;
constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel;
and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology.
As a further limitation of the technical solution of the embodiment of the present invention, the acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data specifically includes the following steps:
acquiring Event data of a plurality of data source ends in real time;
filtering and cleaning Event data acquired in real time in a stream processing mode to obtain key data;
and based on an ECC (error correction code) lightweight encryption algorithm, carrying out data encryption processing on the key data to obtain encrypted data.
As a further limitation of the technical solution of the embodiment of the present invention, the encrypting the key data based on the ECC lightweight encryption algorithm to obtain the encrypted data specifically includes the following steps:
generating an elliptic curve E based on an ECC lightweight encryption algorithm, acquiring an elliptic group, and finishing initialization of the encryption algorithm;
Mapping byte information of the key data to an elliptic curve E to realize data encryption processing and obtain encrypted data.
As a further limitation of the technical solution of the embodiment of the present invention, the elliptic curve E is defined as follows:
wherein, the first and the second end of the pipe are connected with each other,and is,Is a field of rational numbers that is defined,is a discriminant of an elliptic curve equation, which is defined as:
as a further limitation of the technical solution of the embodiment of the present invention, the simplified formula of the elliptic curve E is as follows:
wherein the content of the first and second substances,,is a finite field and p is a large prime number.
As a further limitation of the technical solution of the embodiment of the present invention, the constructing a multi-source streaming data transmission channel by using Flume and Kafka, and the performing encrypted transmission on the encrypted data at the multiple data sources through the multi-source streaming data transmission channel specifically includes the following steps:
a flash proxy server at a data receiving end generates a private key;
a Flume proxy server at a data receiving end generates a certificate signed by a trusted key;
a Flume receiving end proxy server establishes a trust repository;
and (3) creating a PKCS12 file by using the key and the certificate to encrypt the flash channel, and carrying out encrypted transmission on the encrypted data of a plurality of data source ends.
As a further limitation of the technical solution of the embodiment of the present invention, constructing a multi-source streaming data transmission channel by using Flume and Kafka, and performing encrypted transmission on encrypted data at multiple data source ends through the multi-source streaming data transmission channel further includes:
and collecting and summarizing the Event data lines corresponding to the data source ends by using a plurality of flash processes, and storing the Event data lines into the big data storage system in a centralized manner.
As a further limitation of the technical solution of the embodiment of the present invention, constructing a multi-source streaming data transmission channel by using Flume and Kafka, and performing encrypted transmission on encrypted data at multiple data source ends through the multi-source streaming data transmission channel further includes:
and (4) adopting Kafka to butt joint encrypted data transmitted by the flash channel, serializing the encrypted data, and realizing data caching when the data transmission amount is large.
As a further limitation of the technical solution of the embodiment of the present invention, the constructing a multi-source streaming data transmission channel by using Flume and Kafka, and the performing encryption transmission on the encrypted data at the multiple data source ends through the multi-source streaming data transmission channel specifically includes:
and in the intermediate stage of the transmission of the encrypted data, using Kafka to buffer the data, temporarily storing the encrypted data in the Kafka, and storing the data in a downstream storage system according to a set processing speed.
As a further limitation of the technical solution of the embodiment of the present invention, the step of constructing a data lake by using Hadoop and compressing and performing partitioned storage processing on encrypted data obtained by encryption transmission based on LZO compression technology specifically includes the following steps:
constructing a data lake by using Hadoop, wherein HDFS is adopted for bottom storage;
taking a timestamp of data transmission as a partition name, and performing partition storage on encrypted data;
and recompiling the LZO algorithm by using a Hadoop source code, and compressing the encrypted data stored in the partition according to the recompiled LZO algorithm.
Compared with the prior art, the invention has the beneficial effects that:
(1) The data source end filters and encrypts the data in a stream processing mode, so that the transmission efficiency of the data is ensured while key information is not leaked;
(2) A multi-data-source safe transmission channel based on Flume is constructed, so that the safety of streaming data in the network transmission process is ensured;
(3) A safe and efficient data lake storage system is constructed by using big data components such as Flume, kafka and Hadoop, and the integrity of data flow is guaranteed by adopting Kafka.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention.
Fig. 1 shows a schematic diagram of security threats presented by streaming data.
Fig. 2 shows a schematic diagram of a method provided by an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It can be understood that, as shown in fig. 1, there is a schematic diagram of security threats existing in streaming data, and in the prior art, there are three main aspects of security threats faced in the transmission and storage processes of streaming data: (1) Before data is transmitted, if security protection is not carried out, the risk of leakage is faced. After the streaming data is generated, the streaming data is firstly cached locally and then transmitted through a network. The data which is not encrypted is stored in the local server in a plaintext form, and an unauthorized user can acquire the data information by looking at the cache data and even tamper the data information. (2) The data transmission process faces security threats such as monitoring and tampering. An attacker can acquire data in the transmission process by monitoring the data transmission port and intercept or tamper the data.
(3) Streaming data is exposed to the risk of loss in storage systems. With the lapse of time, the scale of streaming data is continuously enlarged, and the traditional data storage system cannot meet the storage requirement of massive streaming data. In addition, the streaming data generally comprises multiple types of structured data, semi-structured data, unstructured data and the like, and the traditional structured database storage mode is not suitable for the streaming data with complex types.
In order to solve the problems, the embodiment of the invention obtains Event data at a plurality of data source ends, filters and cleans the Event data to obtain key data, and encrypts the key data to obtain encrypted data; constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel; and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology. The method can filter and encrypt key information at a data source end, ensures the safety of data in the transmission process, safely and efficiently stores streaming data on the premise of not destroying the original structure of the data, and can increase the query speed of the data while reducing the occupied storage space.
Fig. 2 shows a schematic diagram of a method provided by an embodiment of the invention.
Specifically, in a preferred embodiment provided by the present invention, a streaming data secure transmission and storage method based on a data lake specifically includes the following steps:
acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data.
In the embodiment of the invention, event data is acquired at a plurality of data source ends in real time, the Event data acquired in real time is filtered and cleaned in a stream processing mode to obtain key data, and an elliptic curve E is generated based on an ECC lightweight encryption algorithm and can be defined as the formula:
wherein the content of the first and second substances,and is,Is a field of rational numbers as defined,is a discriminant of the elliptic curve equation, which is defined as:
when the elliptic curve E satisfies the discriminant formula, call
wherein the content of the first and second substances,,is a finite field, p is a large prime number;
ECC encryption requires that an elliptic curve is selected firstTaking a point on the elliptic curve as a base point G, and selecting a private keyAnd generates a public keyWhere K and G are elliptic curvesA point of (a); finally, embedding the plaintext into one point of the elliptic curve and using the public keyAnd (3) encryption of a plaintext is completed, data encryption processing is realized, and encrypted data is obtained, wherein an ECC encryption algorithm is as follows:
input data to be encrypted
Output encrypted data
While(x<=p-1) do
Getbody obtains event body content
Mapping the byte information to an elliptic curve to realize encryption;
end while
return encrypted data.
It can be understood that Event data, as a basic unit of flash transmission data, is composed of a header and a body, multiple continuous Event data constitute streaming data in flash, in the Event data, the body contains key information to be transmitted, the encryption interceptor acquires the key information and calls an encryption algorithm to encrypt the key information, the encrypted data continues to be transmitted in the form of data stream, and the encryption algorithm is:
Input: Events
Output: Encrypted events
list < Event > creates an Event collection object to store incoming data streams
if event != null then
Add (event) add event to event object set
while List < Event >! = null Event set object is not empty do
Calling Algorithm 1 to encrypt event volume data
end while
end if
return Encrypted events。
And step two, constructing a multi-source streaming data transmission channel by using the flash and the Kafka, and carrying out encryption transmission on the encrypted data of the plurality of data source ends through the multi-source streaming data transmission channel.
In the embodiment of the invention, a Flume proxy server at a data receiving end generates a private key, generates a certificate signed by a trusted key, and ensures that a data receiving end is credible, the Flume receiving end proxy server creates a trust repository to verify the authenticity of the key, an algorithm uses the key and the certificate to create a PKCS12 file to realize encryption of a transmission channel, encrypted data at a plurality of data source ends are encrypted and transmitted, and the working flow algorithm of the encryption channel is as follows:
while channel connection do
Key creation private Key for openssl gersa-des 3-out flash
if signed request presence
Creating a self-signed certificate else
Key generates a certificate signing request and creates a self-signed certificate
Creating trust store for flash to verify authenticity of keys
Openssl PKCS 12-export-in Flume02.Crt uses keys and certificates to create PKCS12 files to encrypt the Flume channel
end if
end while;
Considering that streaming data generally comes from a plurality of data sources, in order to ensure more efficient transmission and storage of the data, a plurality of flash processes are used, event data of the plurality of data sources are collected and summarized and are stored in a large data storage system in a centralized mode, kafka is adopted to interface encrypted data transmitted by a flash channel, the data are serialized, and when the data transmission amount is large, caching of the data is achieved, so that the streaming data can be stored in a distributed storage system efficiently and stably. The core of the source data storage is a reasonable docking scheme designed for Flume and Kafka, so that a transmission channel is ensured to rapidly and stably store data into a storage system at a high throughput according to a specific sequence, kafka can transmit streaming data in a serialized mode, in addition, under the condition that the data volume is large, kafka can play a caching role, the probability of writing failure of HDFS data at the time of data peak is reduced, and the core design code of the Flume docking Kafka is as follows:
definition of # sink type
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.bootstrap.servers = ****
Name of topic for distinguishing the acquired message
a1.sinks.k1.kafka.topic = test
a1.sinks.k1.serializer.class = kafka.serializer.StringEncoder
# Channel based on memory
a1.channels.c1.type = memory
Thus, the Flume and Kafka docking scheme has the following functions in transmitting data:
(1) Buffering and peak clipping: when the data transmitted by the upstream flash has burst flow, the data throughput of a downstream storage system can be exceeded, so that the data is lost, kafka is used in the middle stage of data transmission to play a buffering role, the data is temporarily stored in the Kafka, and the downstream storage system can store the data at a set processing speed, so that the data can be smoothly transmitted to the storage system;
(2) And (3) robustness: the Kafka message queue can accumulate requests, and even if a downstream storage system fails for a short time, data loss cannot be caused;
(3) Ensuring the integrity of the data: in order to ensure that data sent by a producer can be completely and reliably sent to a specified topic, after each partition (partition) of the topic receives the data sent by the producer, an ack (acknowledgement confirmation receipt) is sent to the producer, if the producer receives the ack, the next round of data sending is carried out, otherwise, the data is sent again, after the producer sends a message, a feeder of the specified topic synchronizes the data received by a Leader, and after all the feeders complete data synchronization, the topic sends the ack to the producer, and confirms that all the data are received; after the producer receives the ack confirmation, other data are continuously sent; if an ack acknowledgment is not received, the producer determines that the piece of data was not completely delivered and will resend the data until an ack acknowledgment is received, ensuring the integrity and reliability of the streaming data transmission.
It can be understood that the flash is a distributed data acquisition system for efficiently collecting, aggregating and transmitting massive streaming data from a plurality of different data sources to a designated storage system, and the design principle is based on streaming data, and log information, event data and the like can be collected from various website servers and stored in a centralized storage system such as HDFS, HBase and the like. The Flume architecture processes streaming data by generating events (events) and agents (agents).
It can be understood that Kafka is a distributed publish-subscribe message system, which can efficiently process streaming data, subscribers subscribe interested Topic according to requirements and acquire messages specifying Topic, and compared with the conventional message system, kafka can well ensure the ordering of streaming data, and has a function of temporarily storing data.
And thirdly, constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning mode based on an LZO compression technology.
In the embodiment of the invention, a Hadoop is used for constructing a data lake, the bottom storage of the data lake adopts HDFS, in order to ensure that the subsequent streaming data stored in the data lake can be efficiently searched, a scheme of partitioning according to time is designed, namely a timestamp of data transmission is used as a partition name, in addition, the occupied space of the data is reduced through an LZO compression technology, in the distributed storage, the most commonly used compression modes are LZO and Snapp, when the Hadoop processes the data, a slice is one of operations for improving the operation efficiency of the Hadoop, but the Snapp does not support the slice operation, the LZO supports the slice and has a higher compression speed, but the Hadoop does not support the LZO compression, so that a Hadoop source code is needed to be used for recompiling the LZO algorithm, the functions of the Hadoop are redefined according to the partition and compression requirements, the data can be compressed with higher efficiency, and the recompiled core design code is as follows:
# sink description
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://Hadoop1:9000/flume/%Y%m%d/%H
Prefix of # upload File
a2.sinks.k2.hdfs.filePrefix = logs-
Whether to scroll the folder by time # or not
a2.sinks.k2.hdfs.round = true
How many time units are required to create a new folder
a2.sinks.k2.hdfs.roundValue = 1
# defines the time unit
a2.sinks.k2.hdfs.roundUnit = hour
Whether to use local time stamp
a2.sinks.k2.hdfs.useLocalTimeStamp = true
# setting File types to support compression
a2.sinks.k2.hdfs.fileType = CompressedStream
a2.sinks.k2.hdfs.codeC = lzop。
It can be understood that the data lake is a novel big data storage scheme, can store massive raw data, supports any data format, has better data analysis and processing capacity, and is mainly characterized in that: the method has the advantages of low cost of data storage, high data fidelity, good accessibility and flexible data analysis, and can directly store the original data into the data lake without specifying the storage purpose of the data.
It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (10)
1. A streaming data safe transmission and storage method based on a data lake is characterized by specifically comprising the following steps:
acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data;
constructing a multi-source streaming data transmission channel by using flash and Kafka, and carrying out encryption transmission on encrypted data of a plurality of data source ends through the multi-source streaming data transmission channel;
and constructing a data lake by using Hadoop, and compressing and storing the encrypted data obtained by encryption transmission in a partitioning manner based on an LZO compression technology.
2. The streaming data safe transmission and storage method based on the data lake as claimed in claim 1, wherein the steps of acquiring Event data at a plurality of data source ends, filtering and cleaning the Event data to obtain key data, and encrypting the key data to obtain encrypted data specifically include:
acquiring Event data of a plurality of data source ends in real time;
filtering and cleaning Event data acquired in real time in a stream processing mode to obtain key data;
and based on an ECC (error correction code) lightweight encryption algorithm, carrying out data encryption processing on the key data to obtain encrypted data.
3. The streaming data safe transmission and storage method based on the data lake as claimed in claim 2, wherein the ECC lightweight encryption algorithm is used to encrypt the key data, and the obtaining of the encrypted data specifically includes the following steps:
generating an elliptic curve E based on an ECC lightweight encryption algorithm, acquiring an elliptic group, and finishing initialization of the encryption algorithm;
Mapping byte information of the key data to an elliptic curve E to realize data encryption processing and obtain encrypted data.
4. A method for secure transmission and storage of streaming data based on a data lake according to claim 3, wherein the elliptic curve E is defined as:
wherein the content of the first and second substances,and is,Is a field of rational numbers that is defined,
6. The method for securely transmitting and storing streaming data based on data lake according to claim 1, wherein the constructing a multi-source streaming data transmission channel by using flash and Kafka, and the encrypted data from multiple data sources is transmitted through the multi-source streaming data transmission channel by encryption, specifically comprising the following steps:
a flash proxy server at a data receiving end generates a private key;
a Flume proxy server at a data receiving end generates a certificate signed by a trusted key;
a Flume receiving end proxy server establishes a trust repository;
and (3) creating a PKCS12 file by using the key and the certificate to encrypt the flash channel, and carrying out encrypted transmission on the encrypted data of a plurality of data source ends.
7. The method for secure transmission and storage of streaming data based on a data lake of claim 6, wherein the constructing a multi-source streaming data transmission channel using flash and Kafka, the encrypted transmission of the encrypted data at the plurality of data sources via the multi-source streaming data transmission channel further comprises:
and collecting and summarizing the Event data lines corresponding to the data source ends by using a plurality of flash processes, and storing the Event data lines into the big data storage system in a centralized manner.
8. The method for secure transmission and storage of streaming data based on a data lake of claim 6, wherein the constructing a multi-source streaming data transmission channel using flash and Kafka, the encrypted transmission of the encrypted data at the plurality of data sources via the multi-source streaming data transmission channel further comprises:
and (3) adopting Kafka to butt joint the encrypted data transmitted by the Flume channel, serializing the encrypted data, and realizing the caching of the data when the data transmission quantity is large.
9. The method for securely transmitting and storing streaming data based on a data lake of claim 6, wherein the Flume and Kafka are used to construct a multi-source streaming data transmission channel, and the encrypted data at multiple data sources is transmitted by encrypting the encrypted data at multiple data sources through the multi-source streaming data transmission channel specifically comprises:
and in the intermediate stage of the transmission of the encrypted data, using Kafka to buffer the data, temporarily storing the encrypted data in the Kafka, and storing the data in a downstream storage system according to a set processing speed.
10. The streaming data safe transmission and storage method based on the data lake as claimed in claim 1, wherein the step of constructing the data lake by using Hadoop and compressing and storing the encrypted data obtained by encrypted transmission in a partitioned manner based on LZO compression technology specifically comprises the following steps:
constructing a data lake by using Hadoop, and adopting HDFS for bottom storage;
taking a timestamp of data transmission as a partition name, and performing partition storage on encrypted data;
and recompiling the LZO algorithm by using a Hadoop source code, and compressing the encrypted data stored in the partition according to the recompiled LZO algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211429921.0A CN115499244A (en) | 2022-11-16 | 2022-11-16 | Streaming data safe transmission and storage method based on data lake |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211429921.0A CN115499244A (en) | 2022-11-16 | 2022-11-16 | Streaming data safe transmission and storage method based on data lake |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115499244A true CN115499244A (en) | 2022-12-20 |
Family
ID=85115757
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211429921.0A Pending CN115499244A (en) | 2022-11-16 | 2022-11-16 | Streaming data safe transmission and storage method based on data lake |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115499244A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116737349A (en) * | 2023-08-16 | 2023-09-12 | 中国移动紫金(江苏)创新研究院有限公司 | Stream data processing method, system and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107317838A (en) * | 2017-05-24 | 2017-11-03 | 重庆邮电大学 | A kind of astronomical metadata archiving method and system based on stream data processing framework |
US20210397738A1 (en) * | 2020-06-22 | 2021-12-23 | Sophos Limited | Filtered data lake for enterprise security |
-
2022
- 2022-11-16 CN CN202211429921.0A patent/CN115499244A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107317838A (en) * | 2017-05-24 | 2017-11-03 | 重庆邮电大学 | A kind of astronomical metadata archiving method and system based on stream data processing framework |
US20210397738A1 (en) * | 2020-06-22 | 2021-12-23 | Sophos Limited | Filtered data lake for enterprise security |
Non-Patent Citations (1)
Title |
---|
张聪辉: ""Hadoop架构下的大数据安全存储技术研究"", 《万方平台在线出版》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116737349A (en) * | 2023-08-16 | 2023-09-12 | 中国移动紫金(江苏)创新研究院有限公司 | Stream data processing method, system and storage medium |
CN116737349B (en) * | 2023-08-16 | 2023-11-03 | 中国移动紫金(江苏)创新研究院有限公司 | Stream data processing method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111448781B (en) | Computer-implemented method for communicating shared blockchain data | |
US10789215B1 (en) | Log-structured storage systems | |
EP3673446B1 (en) | Managing blockchain-based centralized ledger systems | |
US11423015B2 (en) | Log-structured storage systems | |
EP3673376B1 (en) | Log-structured storage systems | |
US10896006B1 (en) | Log-structured storage systems | |
US10885022B1 (en) | Log-structured storage systems | |
KR102412024B1 (en) | Indexing and recovery of encoded blockchain data | |
EP3695303B1 (en) | Log-structured storage systems | |
CN111902817A (en) | Block chain data storage based on shared nodes and error correction coding | |
CN111095218B (en) | Method, system and device for storing shared block chain data based on error correction coding | |
CN111523133A (en) | Block chain and cloud data collaborative sharing method | |
US20200213331A1 (en) | Data service system | |
US11250428B2 (en) | Managing transaction requests in ledger systems | |
WO2019228574A2 (en) | Log-structured storage systems | |
EP3682340A2 (en) | Log-structured storage systems | |
CN111526197A (en) | Cloud data secure sharing method | |
CN111095210A (en) | Storing shared blockchain data based on error correction coding | |
CN112307501B (en) | Big data system based on block chain technology, storage method and using method | |
CN112732695B (en) | Cloud storage data security deduplication method based on block chain | |
US11455631B2 (en) | Managing transaction requests in ledger systems | |
CN111033491A (en) | Storing shared blockchain data based on error correction coding | |
CN115499244A (en) | Streaming data safe transmission and storage method based on data lake | |
CN116304265A (en) | Electronic file management method and system based on blockchain | |
US11455297B2 (en) | Managing transaction requests in ledger systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20221220 |
|
RJ01 | Rejection of invention patent application after publication |