CN111352897A

CN111352897A - Real-time data storage method, equipment and storage medium

Info

Publication number: CN111352897A
Application number: CN202010135560.3A
Authority: CN
Inventors: 沈汉标; 王妙玉; 童威云; 吴宁泉; 周小桥
Original assignee: Guangdong Ketyoo Intelligent Technology Co Ltd
Current assignee: Guangdong Ketyoo Intelligent Technology Co Ltd
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2020-06-30

Abstract

The invention discloses a method, a storage medium and equipment for storing real-time data, which comprise the following steps: s1, reading external streaming data source data in real time, storing the data to a big data platform, and generating a plurality of small files; s2, starting a merging program at regular time, and merging the small files; s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size through multiple threads; and S4, automatically deleting the small files in the big data platform. The method, the storage medium and the equipment for storing the real-time data realize efficient real-time file storage, reduce the space occupied by the data files and improve the calculation speed of the calculation engine.

Description

Real-time data storage method, equipment and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a real-time data storage method, device, and storage medium.

Background

With the development of big data technology, the computational engine is continuously optimized and updated, and the bottleneck of the engine is basically solved. However, as the amount of data increases, how to store data efficiently is also an important topic, and real-time and efficient storage of data is also very important. There are also many data extraction tools, such as: sqoop, dataX, etc.; alternative file structures are partial, orc, txt, csv, etc., where partial is an industry-accepted large data file structure with high compression ratio and fast storage. If the data is stored by adopting streaming (real-time) calculation, the data can be stored as partial files and can be stored in a data warehouse quickly, but a plurality of small files can be formed at the same time, and basically, one record generates one small file, so that the occupied space of the data file is large, the calculation speed of a calculation engine is greatly reduced, and the data storage efficiency is reduced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a real-time data storage method, equipment and a storage medium thereof, which realize efficient real-time file storage, reduce the space occupied by data files and improve the calculation speed of a calculation engine.

One of the purposes of the invention is realized by adopting the following technical scheme:

a method of real-time data storage, comprising the steps of:

s1, reading external streaming data source data in real time, storing the data to a big data platform, and generating a plurality of small files;

s2, starting a merging program at regular time, and merging the small files;

s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size through multiple threads;

and S4, automatically deleting the small files in the big data platform.

Further, the external streaming data source is a Kafka cluster.

Further, the small file is of a partial file structure.

Further, in S3, the small files are combined in batches to 64M.

Further, the merging procedure is started every hour in S2.

Further, in S1, the data source data is read in real time by sparkStreaming concurrently.

Further, the data source is an external streaming data source.

Further, the big data platform reads external streaming data source data every second.

The second purpose of the invention is realized by adopting the following technical scheme:

an apparatus comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing a method of real-time data storage as described above when executing the computer program.

The third purpose of the invention is realized by adopting the following technical scheme:

a storage medium having stored thereon a computer program which, when executed, implements a method of real-time data storage as claimed above.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method, equipment and a storage medium for storing real-time data, which are used for reading data of an external data source in real time in the analysis and processing process of big data, generating a plurality of small files from the external data source, combining and regularly clearing the small files, improving the calculation rate, providing a better data environment for data analysis, shortening the data processing period and reducing the space occupied by the data files.

Drawings

FIG. 1 is a schematic flow chart of a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a first embodiment of the present invention;

fig. 3 is a schematic structural diagram of a second embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

As shown in fig. 1, the present invention provides a method for real-time data storage, comprising the following steps: s1, reading an external streaming data source in real time, storing the external streaming data source to a big data platform, and generating a plurality of small files;

s2, starting a merging program periodically, and merging the small files;

s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size at regular time;

and S4, automatically deleting the small files in the big data platform.

The application provides a method for storing real-time data, which is characterized in that in the process of analyzing and processing big data, data of an external data source is read in real time, the external data source is generated into a plurality of small files, the small files are combined and cleared at regular time, the calculation rate is improved, a better data environment is provided for data analysis, the data processing period is shortened, and the space occupied by the data files is reduced.

Specifically, the method is based on a Kafka platform, which is a high-throughput distributed publish-subscribe messaging system that can process all action flow data of a consumer in a website, unify online and offline message processing, and provide real-time messages through a cluster. And reading external Streaming data source data on the Kafka platform cluster through spark Streaming, and storing the external Streaming data source data into the big data platform. The Spark Streaming is an important framework in a Spark ecosystem, and is established on Spark Core, which is an extended application of Spark Core, and has the characteristics of extensibility, high throughput, fault tolerance for Streaming data, and the like. Data from the Kafka platform, etc. can be monitored, analyzed through complex algorithms and a series of calculations, and the results of the analysis can be stored in the HDFS file system, databases, and front-end pages. In the present application, with its streaming read data source, low latency, data can be processed on the order of seconds, up to seconds of reading. And after the data reading is finished, storing the data to a big data platform to generate a plurality of small files. In this embodiment, the small file has a partial file structure.

Fig. 2 is a schematic diagram of an application of the embodiment of the present application. And the enterprise server side pushes data to an agent of the Kafka platform, namely the Kafka-browser, for caching the data for the producer of the message. The ZooKeeper manages the Kafka-browser, and when an agent is newly added in the Kafka platform or a certain agent fails, the ZooKeeper service notifies a message producer and a message consumer. The message producer and the message consumer accordingly begin to work in concert with other agents. The message consumer program pulls data generated by the enterprise server from the kafka-browser to the big data platform, the data are read and stored by the big data platform, a large number of small files can be accumulated in the process, and the storage space of the data is occupied, so that a merging program needs to be started regularly, the small files are merged, and the small files are merged into data files with preset sizes in batches. If the files are not combined regularly, a plurality of small files can be accumulated, and a large amount of resources are occupied for combination in the later period. And reading the small files in a multithread manner every hour, combining the small files in batches to data files with the size of 64M, keeping the content in the data files and the content of the small files unchanged, wherein the storage space required by the combined files is far smaller than that of the small files before combination, and the analysis and calculation of the whole large data are not influenced. Compared with the prior art, the speed of the whole big data processing can be increased by the combined data file, the period of data processing is shortened, and excessive resources are prevented from being occupied in the later period. After the merging is finished, the small files in the big data platform are automatically deleted, only the merged data files are saved, and the required storage space is reduced.

In addition, the present invention also provides a storage medium, which stores a computer program, and the computer program realizes the steps of the method for storing real-time data when being executed by a processor.

Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, usb disk, removable hard disk, computer Memory (diskette), Read-Only Memory (ROM), Random Access Memory (RAM), distributed file system HDFS, etc. It should be noted that the computer-readable medium may contain any suitable combination of elements that may be modified in accordance with the requirements of statutory and patent practice in the jurisdiction, for example, in some jurisdictions, computer-readable media may not contain electrical carrier signals or telecommunications signals in accordance with statutory and patent practice.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like, as in embodiment two.

Example two

An apparatus, as shown in fig. 3, includes a memory, a processor, and a program stored in the memory, the program configured to be executed by the processor, the processor implementing the above-described method steps of real-time data storage when executing the program.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general processor can be a microprocessor or the processor can be any conventional processor, etc., the processor is the control center of the intelligent door lock, and various interfaces and lines are used for connecting the parts of the setting method of the intelligent door lock.

The memory may be used to store computer programs and/or modules, and the processor implements a method of real-time data storage by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) of the at least one function station advocate, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash-Card) at least one disk storage device, a Flash memory device, or other volatile solid state storage device.

The present invention can be used in numerous general purpose or special purpose cloud computing environments and big data environments. For example: the method is applied to scenes such as a big data platform, a server cluster environment, big data analysis and calculation, high user concurrency, cluster calculation and the like.

Claims

1. A method of real-time data storage, comprising the steps of:

s2, starting a merging program at regular time, and merging the small files;

and S4, automatically deleting the small files in the big data platform.

2. The method of claim 1, wherein the external streaming data source is a Kafka cluster.

3. The method of claim 1, wherein the small file has a partial file structure.

4. The method of claim 1, wherein in step S3, the small files are combined in batches to 64M.

5. The method of claim 1, wherein the merging procedure is initiated every hour in step S2.

6. The method of claim 1, wherein the data source data is concurrently streamed and read in real time in step S1 through sparkStreaming.

7. The method of claim 6, wherein the data source is an external streaming data source.

8. The method of claim 7, wherein the big data platform reads the external streaming data source data every second.

9. A storage medium having stored thereon a computer program which, when executed, implements a method of real-time data storage as claimed in any one of claims 1 to 8.