CN111694693A

CN111694693A - Data stream storage method and device and computer storage medium

Info

Publication number: CN111694693A
Application number: CN201910184336.0A
Authority: CN
Inventors: 唐英荣
Original assignee: Shanghai Jingzan Rongxuan Technology Co ltd
Current assignee: Shanghai Jingzan Rongxuan Technology Co ltd
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2020-09-22

Abstract

A data stream storage method, apparatus and computer storage medium, the method comprising: acquiring a data stream; determining keywords of data in the data stream; distributing the data into partitions according to the keywords of the data; the data in each zone is stored. By adopting the scheme, the situation that data is lost due to untimely processing can be avoided, and when the data storage fails, the data can be easily partitioned according to the keywords of the data, so that the corresponding data can be recovered.

Description

Data stream storage method and device and computer storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a data stream storage method and apparatus, and a computer storage medium.

Background

In data processing, there is a data stream consisting of a plurality of sets of data generated in continuous large quantities, which data stream continuously brings along the data. Data will be lost if the data in the data stream is not processed or stored in a timely manner.

In the prior art, a scheme for processing the data stream is to directly store data in the data stream.

However, with the above scheme, when a storage failure occurs to data, it is difficult to determine the location of data storage due to the excessive data volume of the data stream, and data recovery is affected.

Disclosure of Invention

The invention solves the technical problem of difficult data recovery.

To solve the foregoing technical problem, an embodiment of the present invention provides a data stream storage method, including: acquiring a data stream; determining keywords of data in the data stream; distributing the data to partitions according to the keywords of the data; and storing the data in each partition.

Optionally, a data stream consisting of a plurality of pieces of data is acquired by Kafka.

Optionally, a Hash algorithm is used to calculate the keywords of each piece of data.

Optionally, according to the number of preset partitions, calculating the partition sequence number corresponding to the data through the keywords of the data by using a Hash modulo algorithm.

Optionally, serializing the data in each partition; and storing the serialized data.

Optionally, snapshot storage is performed on data in each partition.

The present invention also provides a data stream storage apparatus, comprising: an acquisition unit configured to acquire a data stream; a determining unit for determining a keyword of data in the data stream; the distribution unit is used for distributing the data to the partitions according to the keywords of the data; and the storage unit is used for storing the data in each partition.

Optionally, the obtaining unit is further configured to obtain a data stream composed of a plurality of pieces of data by Kafka.

Optionally, the determining unit is further configured to calculate a keyword of each piece of data by using a Hash algorithm.

Optionally, the allocating unit is further configured to calculate, according to the number of preset partitions, partition sequence numbers corresponding to the data by using a Hash modulo algorithm through the keywords of the data.

Optionally, the storage unit is further configured to serialize data in each partition; and storing the serialized data.

Optionally, the storage unit is further configured to perform snapshot storage on data in each partition.

The present invention also provides a computer-readable storage medium, on which computer instructions are stored, where the computer instructions are a non-volatile storage medium or a non-transitory storage medium, and when executed, the computer instructions perform the steps of any one of the above data stream storage methods.

The invention also provides a data stream storage device, which comprises a memory and a processor, wherein the memory is stored with computer instructions, and the processor executes the steps of any one of the data stream storage methods when the computer instructions are executed.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

by acquiring a data stream; determining keywords of data in the data stream; distributing the data into partitions according to the keywords of the data; the data in each zone is stored. By adopting the scheme, the situation that data is lost due to untimely processing can be avoided, and when the data storage fails, the data can be easily partitioned according to the keywords of the data, so that the corresponding data can be recovered.

Drawings

Fig. 1 is a schematic flow chart of a data stream storage method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a data stream storage device according to an embodiment of the present invention.

Detailed Description

In the prior art, a scheme for processing a data stream is to directly store data in the data stream. However, with the above scheme, when a storage failure occurs to data, it is difficult to determine the location of data storage due to the excessive data volume of the data stream, and data recovery is affected.

In the embodiment of the invention, data flow is obtained; determining keywords of data in the data stream; distributing the data into partitions according to the keywords of the data; the data in each zone is stored. By adopting the scheme, the situation that data is lost due to untimely processing can be avoided, and when the data storage fails, the data can be easily partitioned according to the keywords of the data, so that the corresponding data can be recovered.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 1, a flow chart of a data stream storage method according to an embodiment of the present invention is schematically shown, and the following detailed description is made with reference to specific steps.

Step S101, a data stream is acquired.

In a particular implementation, the plurality of data generated constitutes a data stream as the data continues to be generated. Typically, the amount of data within a data stream is large. If the data in the data stream is classified, processed and analyzed in real time, the burden on computational power is large. However, if no processing is done on the data stream, data is lost. Therefore, in the embodiment of the invention, the data stream can be acquired and stored, and the loss of the data stream is avoided.

In the embodiment of the invention, a data stream consisting of a plurality of pieces of data can be acquired through Kafka.

In one implementation, Kafka is a high-throughput, distributed data platform that can be used to process all data generated by various data sources. The Kafka is used as a platform for acquiring the data stream, and the Kafka has the advantages that the data throughput is high, a large amount of continuously generated data can be handled, and the condition of data loss or system failure is avoided.

Step S102, determining keywords of data in the data stream.

In particular implementations, a key to data in a data stream may be used as an identification of the data for distinguishing the data from other data. Determining the keywords of the data in the data stream can improve the efficiency in the data query process and avoid the situation that the data storage position is difficult to determine.

In a specific implementation, the keyword of the data may be identification information for characterizing the corresponding data; or the identification information with distinguishing function can be realized by establishing the association relation with the data. In specific application, the user can make corresponding settings according to the actual application scenario.

In the embodiment of the invention, the keywords of each piece of data can be calculated by using a Hash algorithm.

In a specific implementation, the Hash algorithm can be used to compress data of any length into a data digest of a fixed length, and the data digests corresponding to different data are different. Therefore, the data abstract calculated by the Hash algorithm can be used for quickly identifying data, the efficiency in the data query process is improved, and the situation that the data storage position is difficult to determine is avoided.

And step S103, distributing the data to partitions according to the keywords of the data.

In specific implementation, because the data volume of the data stream is usually huge, the data in the data stream can be allocated to different partitions, and then the data of different partitions are processed, so that the data processing pressure of each partition node can be reduced, and a system fault is avoided.

In particular implementations, a key to the data may be used as a criterion to assign the data to different partitions. Therefore, in the data query process, the partition of the data can be determined according to the keywords of the data, and then the data query is further performed, so that the efficiency in the data query process can be improved, and the situation that the data storage position is difficult to determine is avoided.

In a specific implementation, a partition may be a data processing node at a software level or a data processor at a hardware level.

In the embodiment of the invention, according to the number of the preset partitions, a Hash modular algorithm can be used for calculating the partition serial number corresponding to the data through the keywords of the data.

In a specific implementation, the Hash touch algorithm may be configured to establish a mapping relationship between data and a preset partition according to a Hash value of the data, that is, a data digest calculated by the Hash algorithm, where the partition is represented as a partition number in the mapping relationship, that is, the Hash value of the data and the partition number establish the mapping relationship.

Step S104, storing the data in each partition.

In the embodiment of the present invention, before storing the data in each partition, the data in each partition may be serialized.

In particular implementations, Serialization (Serialization) is the process of converting state information of an object into a form that can be stored or transmitted. Therefore, the data is serialized so as to facilitate the storage of the data.

In a specific implementation, after the data is stored in a serialized manner, the serialized data can be read by deserialization when the data is read.

In the embodiment of the present invention, when the data in each partition is stored, Snapshot (Snapshot) storage may be performed on the data in each partition.

In specific implementation, the snapshot can realize rapid data storage to deal with a large amount of continuously generated data, and the situation that data is lost due to untimely processing is avoided.

From the above, by acquiring a data stream; determining keywords of data in the data stream; distributing the data into partitions according to the keywords of the data; the data in each zone is stored. By adopting the scheme, the situation that data is lost due to untimely processing can be avoided, and meanwhile, when data storage fails, the data can be easily partitioned according to the keyword positioning data of the data, so that the corresponding data can be recovered.

Referring to fig. 2, a schematic structural diagram of a data stream storage device 20 according to an embodiment of the present invention is shown, which specifically includes: an acquisition unit 201 for acquiring a data stream; a determining unit 202, configured to determine a keyword of data in a data stream; the allocation unit 203 is used for allocating the data to the partitions according to the keywords of the data; the storage unit 204 is configured to store data in each partition.

In this embodiment of the present invention, the obtaining unit 201 may be further configured to obtain a data stream composed of a plurality of pieces of data by Kafka.

In this embodiment of the present invention, the determining unit 202 may be further configured to calculate a keyword of each piece of data by using a Hash algorithm.

In this embodiment of the present invention, the allocating unit 203 may be further configured to calculate, according to the number of preset partitions, partition sequence numbers corresponding to the data by using a Hash modulo algorithm through the keywords of the data.

In this embodiment of the present invention, the storage unit 204 may be further configured to serialize data in each partition; and storing the serialized data.

In this embodiment of the present invention, the storage unit 204 may be further configured to perform snapshot storage on data in each partition.

The present invention also provides a computer-readable storage medium, on which computer instructions are stored, where the computer instructions are a non-volatile storage medium or a non-transitory storage medium, and when executed, the computer instructions perform the steps of the data stream storage method provided by the embodiment of the present invention.

The invention also provides a data stream storage device, which comprises a memory and a processor, wherein the memory is stored with computer instructions, and the processor executes the steps of the data stream storage method provided by the embodiment of the invention when the computer instructions are executed.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for storing a data stream, comprising:

acquiring a data stream;

determining keywords of data in the data stream;

distributing the data to partitions according to the keywords of the data;

and storing the data in each partition.

2. The data stream storage method of claim 1, wherein the obtaining the data stream comprises:

a data stream consisting of a plurality of pieces of data is acquired by Kafka.

3. The data stream storage method of claim 1, wherein determining the key of the data in the data stream comprises:

and calculating the key words of each piece of data by using a Hash algorithm.

4. The data stream storage method according to claim 3, wherein the allocating data to partitions according to the keywords of the data comprises:

and calculating the partition serial number corresponding to the data through the keywords of the data by using a Hash modular algorithm according to the number of preset partitions.

5. The data stream storage method according to claim 1, wherein the storing the data in each partition includes:

serializing the data in each partition;

and storing the serialized data.

6. The data stream storage method according to claim 1, wherein the storing the data in each partition includes:

and carrying out snapshot storage on the data in each partition.

7. A data stream storage device, comprising:

an acquisition unit configured to acquire a data stream;

a determining unit for determining a keyword of data in the data stream;

the distribution unit is used for distributing the data to the partitions according to the keywords of the data;

and the storage unit is used for storing the data in each partition.

8. The data stream storage device according to claim 7, wherein the obtaining unit is further configured to obtain a data stream composed of a plurality of pieces of data by Kafka.

9. The data stream storage device of claim 7, wherein the determining unit is further configured to calculate a key for each piece of data using a Hash algorithm.

10. The data stream storage device according to claim 9, wherein the allocating unit is further configured to calculate, according to the number of preset partitions, a partition number corresponding to the data by using a Hash modulo algorithm through a keyword of the data.

11. The data stream storage device of claim 7, wherein the storage unit is further configured to serialize data in each partition; and storing the serialized data.

12. The data stream storage device according to claim 7, wherein the storage unit is further configured to perform snapshot storage on data in each partition.

13. A computer readable storage medium having stored thereon computer instructions, the computer readable storage medium being a non-volatile storage medium or a non-transitory storage medium, wherein the computer instructions when executed perform the steps of the data stream storage method according to any one of claims 1 to 6.

14. A data stream storage device comprising a memory and a processor, the memory having stored thereon computer instructions, wherein the processor performs the steps of the data stream storage method of any one of claims 1 to 6 when the computer instructions are executed.