CN111352897A - Real-time data storage method, equipment and storage medium - Google Patents

Real-time data storage method, equipment and storage medium Download PDF

Info

Publication number
CN111352897A
CN111352897A CN202010135560.3A CN202010135560A CN111352897A CN 111352897 A CN111352897 A CN 111352897A CN 202010135560 A CN202010135560 A CN 202010135560A CN 111352897 A CN111352897 A CN 111352897A
Authority
CN
China
Prior art keywords
data
real
small files
time
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010135560.3A
Other languages
Chinese (zh)
Inventor
沈汉标
王妙玉
童威云
吴宁泉
周小桥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ketyoo Intelligent Technology Co Ltd
Original Assignee
Guangdong Ketyoo Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ketyoo Intelligent Technology Co Ltd filed Critical Guangdong Ketyoo Intelligent Technology Co Ltd
Priority to CN202010135560.3A priority Critical patent/CN111352897A/en
Publication of CN111352897A publication Critical patent/CN111352897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a storage medium and equipment for storing real-time data, which comprise the following steps: s1, reading external streaming data source data in real time, storing the data to a big data platform, and generating a plurality of small files; s2, starting a merging program at regular time, and merging the small files; s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size through multiple threads; and S4, automatically deleting the small files in the big data platform. The method, the storage medium and the equipment for storing the real-time data realize efficient real-time file storage, reduce the space occupied by the data files and improve the calculation speed of the calculation engine.

Description

Real-time data storage method, equipment and storage medium
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a real-time data storage method, device, and storage medium.
Background
With the development of big data technology, the computational engine is continuously optimized and updated, and the bottleneck of the engine is basically solved. However, as the amount of data increases, how to store data efficiently is also an important topic, and real-time and efficient storage of data is also very important. There are also many data extraction tools, such as: sqoop, dataX, etc.; alternative file structures are partial, orc, txt, csv, etc., where partial is an industry-accepted large data file structure with high compression ratio and fast storage. If the data is stored by adopting streaming (real-time) calculation, the data can be stored as partial files and can be stored in a data warehouse quickly, but a plurality of small files can be formed at the same time, and basically, one record generates one small file, so that the occupied space of the data file is large, the calculation speed of a calculation engine is greatly reduced, and the data storage efficiency is reduced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a real-time data storage method, equipment and a storage medium thereof, which realize efficient real-time file storage, reduce the space occupied by data files and improve the calculation speed of a calculation engine.
One of the purposes of the invention is realized by adopting the following technical scheme:
a method of real-time data storage, comprising the steps of:
s1, reading external streaming data source data in real time, storing the data to a big data platform, and generating a plurality of small files;
s2, starting a merging program at regular time, and merging the small files;
s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size through multiple threads;
and S4, automatically deleting the small files in the big data platform.
Further, the external streaming data source is a Kafka cluster.
Further, the small file is of a partial file structure.
Further, in S3, the small files are combined in batches to 64M.
Further, the merging procedure is started every hour in S2.
Further, in S1, the data source data is read in real time by sparkStreaming concurrently.
Further, the data source is an external streaming data source.
Further, the big data platform reads external streaming data source data every second.
The second purpose of the invention is realized by adopting the following technical scheme:
an apparatus comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing a method of real-time data storage as described above when executing the computer program.
The third purpose of the invention is realized by adopting the following technical scheme:
a storage medium having stored thereon a computer program which, when executed, implements a method of real-time data storage as claimed above.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a method, equipment and a storage medium for storing real-time data, which are used for reading data of an external data source in real time in the analysis and processing process of big data, generating a plurality of small files from the external data source, combining and regularly clearing the small files, improving the calculation rate, providing a better data environment for data analysis, shortening the data processing period and reducing the space occupied by the data files.
Drawings
FIG. 1 is a schematic flow chart of a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a first embodiment of the present invention;
fig. 3 is a schematic structural diagram of a second embodiment of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
As shown in fig. 1, the present invention provides a method for real-time data storage, comprising the following steps: s1, reading an external streaming data source in real time, storing the external streaming data source to a big data platform, and generating a plurality of small files;
s2, starting a merging program periodically, and merging the small files;
s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size at regular time;
and S4, automatically deleting the small files in the big data platform.
The application provides a method for storing real-time data, which is characterized in that in the process of analyzing and processing big data, data of an external data source is read in real time, the external data source is generated into a plurality of small files, the small files are combined and cleared at regular time, the calculation rate is improved, a better data environment is provided for data analysis, the data processing period is shortened, and the space occupied by the data files is reduced.
Specifically, the method is based on a Kafka platform, which is a high-throughput distributed publish-subscribe messaging system that can process all action flow data of a consumer in a website, unify online and offline message processing, and provide real-time messages through a cluster. And reading external Streaming data source data on the Kafka platform cluster through spark Streaming, and storing the external Streaming data source data into the big data platform. The Spark Streaming is an important framework in a Spark ecosystem, and is established on Spark Core, which is an extended application of Spark Core, and has the characteristics of extensibility, high throughput, fault tolerance for Streaming data, and the like. Data from the Kafka platform, etc. can be monitored, analyzed through complex algorithms and a series of calculations, and the results of the analysis can be stored in the HDFS file system, databases, and front-end pages. In the present application, with its streaming read data source, low latency, data can be processed on the order of seconds, up to seconds of reading. And after the data reading is finished, storing the data to a big data platform to generate a plurality of small files. In this embodiment, the small file has a partial file structure.
Fig. 2 is a schematic diagram of an application of the embodiment of the present application. And the enterprise server side pushes data to an agent of the Kafka platform, namely the Kafka-browser, for caching the data for the producer of the message. The ZooKeeper manages the Kafka-browser, and when an agent is newly added in the Kafka platform or a certain agent fails, the ZooKeeper service notifies a message producer and a message consumer. The message producer and the message consumer accordingly begin to work in concert with other agents. The message consumer program pulls data generated by the enterprise server from the kafka-browser to the big data platform, the data are read and stored by the big data platform, a large number of small files can be accumulated in the process, and the storage space of the data is occupied, so that a merging program needs to be started regularly, the small files are merged, and the small files are merged into data files with preset sizes in batches. If the files are not combined regularly, a plurality of small files can be accumulated, and a large amount of resources are occupied for combination in the later period. And reading the small files in a multithread manner every hour, combining the small files in batches to data files with the size of 64M, keeping the content in the data files and the content of the small files unchanged, wherein the storage space required by the combined files is far smaller than that of the small files before combination, and the analysis and calculation of the whole large data are not influenced. Compared with the prior art, the speed of the whole big data processing can be increased by the combined data file, the period of data processing is shortened, and excessive resources are prevented from being occupied in the later period. After the merging is finished, the small files in the big data platform are automatically deleted, only the merged data files are saved, and the required storage space is reduced.
In addition, the present invention also provides a storage medium, which stores a computer program, and the computer program realizes the steps of the method for storing real-time data when being executed by a processor.
Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, usb disk, removable hard disk, computer Memory (diskette), Read-Only Memory (ROM), Random Access Memory (RAM), distributed file system HDFS, etc. It should be noted that the computer-readable medium may contain any suitable combination of elements that may be modified in accordance with the requirements of statutory and patent practice in the jurisdiction, for example, in some jurisdictions, computer-readable media may not contain electrical carrier signals or telecommunications signals in accordance with statutory and patent practice.
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like, as in embodiment two.
Example two
An apparatus, as shown in fig. 3, includes a memory, a processor, and a program stored in the memory, the program configured to be executed by the processor, the processor implementing the above-described method steps of real-time data storage when executing the program.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general processor can be a microprocessor or the processor can be any conventional processor, etc., the processor is the control center of the intelligent door lock, and various interfaces and lines are used for connecting the parts of the setting method of the intelligent door lock.
The memory may be used to store computer programs and/or modules, and the processor implements a method of real-time data storage by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) of the at least one function station advocate, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash-Card) at least one disk storage device, a Flash memory device, or other volatile solid state storage device.
The present invention can be used in numerous general purpose or special purpose cloud computing environments and big data environments. For example: the method is applied to scenes such as a big data platform, a server cluster environment, big data analysis and calculation, high user concurrency, cluster calculation and the like.

Claims (9)

1. A method of real-time data storage, comprising the steps of:
s1, reading external streaming data source data in real time, storing the data to a big data platform, and generating a plurality of small files;
s2, starting a merging program at regular time, and merging the small files;
s3, reading the small files through multiple threads, and combining the small files into a data file with a preset size through multiple threads;
and S4, automatically deleting the small files in the big data platform.
2. The method of claim 1, wherein the external streaming data source is a Kafka cluster.
3. The method of claim 1, wherein the small file has a partial file structure.
4. The method of claim 1, wherein in step S3, the small files are combined in batches to 64M.
5. The method of claim 1, wherein the merging procedure is initiated every hour in step S2.
6. The method of claim 1, wherein the data source data is concurrently streamed and read in real time in step S1 through sparkStreaming.
7. The method of claim 6, wherein the data source is an external streaming data source.
8. The method of claim 7, wherein the big data platform reads the external streaming data source data every second.
9. A storage medium having stored thereon a computer program which, when executed, implements a method of real-time data storage as claimed in any one of claims 1 to 8.
CN202010135560.3A 2020-03-02 2020-03-02 Real-time data storage method, equipment and storage medium Pending CN111352897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010135560.3A CN111352897A (en) 2020-03-02 2020-03-02 Real-time data storage method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010135560.3A CN111352897A (en) 2020-03-02 2020-03-02 Real-time data storage method, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111352897A true CN111352897A (en) 2020-06-30

Family

ID=71197199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010135560.3A Pending CN111352897A (en) 2020-03-02 2020-03-02 Real-time data storage method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111352897A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181920A (en) * 2020-09-24 2021-01-05 陕西天行健车联网信息技术有限公司 Internet of vehicles big data high-performance compression storage method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045422A (en) * 2016-02-06 2017-08-15 华为技术有限公司 Distributed storage method and equipment
CN107391280A (en) * 2017-07-31 2017-11-24 郑州云海信息技术有限公司 A kind of reception of small documents and storage method and device
CN107577809A (en) * 2017-09-27 2018-01-12 北京锐安科技有限公司 Offline small documents processing method and processing device
CN109446165A (en) * 2018-10-11 2019-03-08 中盈优创资讯科技有限公司 The file mergences method and device of big data platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045422A (en) * 2016-02-06 2017-08-15 华为技术有限公司 Distributed storage method and equipment
CN107391280A (en) * 2017-07-31 2017-11-24 郑州云海信息技术有限公司 A kind of reception of small documents and storage method and device
CN107577809A (en) * 2017-09-27 2018-01-12 北京锐安科技有限公司 Offline small documents processing method and processing device
CN109446165A (en) * 2018-10-11 2019-03-08 中盈优创资讯科技有限公司 The file mergences method and device of big data platform

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181920A (en) * 2020-09-24 2021-01-05 陕西天行健车联网信息技术有限公司 Internet of vehicles big data high-performance compression storage method and system

Similar Documents

Publication Publication Date Title
CN110362544B (en) Log processing system, log processing method, terminal and storage medium
CN111209352B (en) Data processing method and device, electronic equipment and storage medium
US8725684B1 (en) Synchronizing data stores
US20150112934A1 (en) Parallel scanners for log based replication
CN109992469B (en) Method and device for merging logs
US11494437B1 (en) System and method for performing object-modifying commands in an unstructured storage service
CN111064712A (en) Game resource packaging method and system
CN112506950A (en) Data aggregation processing method, computing node, computing cluster and storage medium
CN112988741A (en) Real-time service data merging method and device and electronic equipment
CN112416654A (en) Database log replay method, device, equipment and storage medium
CN110019063B (en) Method for computing node data disaster recovery playback, terminal device and storage medium
CN112559857A (en) Redis-based crowd pack application method and system, electronic device and storage medium
CN112433921A (en) Method and apparatus for dynamic point burying
CN113468196B (en) Method, apparatus, system, server and medium for processing data
CN111259066A (en) Server cluster data synchronization method and device
CN111352897A (en) Real-time data storage method, equipment and storage medium
CN107329832B (en) Data receiving method and device
US9405786B2 (en) System and method for database flow management
CN110941597B (en) Method and device for cleaning decompressed file, computing equipment and computer storage medium
US20230031224A1 (en) Log compression
CN106934044B (en) Data processing method and device
CN116016508A (en) Storage system based on distributed object and control method thereof
CN114168607A (en) Global serial number generation method, device, equipment, medium and product
CN111367750A (en) Exception handling method, device and equipment
CN115600567B (en) Report export method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination