CN112328569A - Construction method based on Flume distributed data collection architecture - Google Patents

Construction method based on Flume distributed data collection architecture Download PDF

Info

Publication number
CN112328569A
CN112328569A CN202011221460.9A CN202011221460A CN112328569A CN 112328569 A CN112328569 A CN 112328569A CN 202011221460 A CN202011221460 A CN 202011221460A CN 112328569 A CN112328569 A CN 112328569A
Authority
CN
China
Prior art keywords
data
source
agent
flume
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011221460.9A
Other languages
Chinese (zh)
Inventor
李向佳
陈付祥
李鹏
黄洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yunman Intelligent Technology Co ltd
Original Assignee
Shandong Yunman Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yunman Intelligent Technology Co ltd filed Critical Shandong Yunman Intelligent Technology Co ltd
Publication of CN112328569A publication Critical patent/CN112328569A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Abstract

The invention provides a method for constructing a distributed data collection architecture based on flash, which can reduce the maintenance cost of log collection configuration in the later period and reduce the influence of a big data application program on a service system. Which comprises the following steps: s1, creating at least three servers, and building Flume on a Hadoop cluster; s2, determining the scale of the needed design according to the actual production requirement, and building a frame according to what the environment of the overall analysis system needs and how; s3, configuring the Source of the Agent in the Agent assembly of the fluorine core unit to define the type and the position of the accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving the data and writing the acquired data into a Channel; s4, using and distributing data on the channel by the sink by caching the data from the Source; s5.sink reads data from Channel and sends it to the next Agent or final destination.

Description

Construction method based on Flume distributed data collection architecture
Technical Field
The invention relates to a method for constructing a distributed data collection architecture based on Flume, and belongs to the technical field of information technology.
Background
In the twenty-first century today of the rapid development of the internet, the use of the internet gradually enters the lives, study and work of most people, and the internet becomes an indispensable key part. Not only can real-time and unimpeded communication be carried out between people, but also certain wonderful relationships can be generated between people and objects and between objects due to the rise of artificial intelligence AI. As early as 2013, China has provided four characteristics of high capacity, diversity, high speed and value of big data, people can generate a large amount of data while communicating with the Internet, and in the massive and complex unstructured and structured data generated in real time, most of the rest data except a part of a small amount of core business data are log data related to the core data. Due to the characteristics of real-time data flow, real-time data better meet the requirements of people, the requirements of people on the real-time data in actual production are higher and higher, only one part of the real-time data in mass data can meet the actual requirements in actual production, users who process and react the data more quickly, more efficiently and more accurately are needed, and the use experience and the interaction experience of the users on the system are well enhanced.
In the flash framework, the flash framework can be integrated with any data process, and the rate of reading data of the flash is greater than the rate of writing data, wherein the flash has an advanced buffer mechanism, the flash framework comprises two transaction models, namely data from Source to Channel and data from Channel to Sink, under the guarantee of the two transactions, the data can be successfully submitted, so that a part of data acquisition is not lost, and in the traditional distributed architecture, when the transaction is completely read or completely written, the flash belongs to a linear arrangement mode, namely the stability of the system is not high although the system can be expanded, the efficiency of data transmission cannot be guaranteed, and in addition, the framework is not beneficial to unified management.
It is common that many servers constitute one collection side, and the number of servers increases with the expansion of services and the accumulation of time.
Disclosure of Invention
The invention aims to provide a method for constructing a data collection architecture based on Flume distribution, which can reduce the maintenance cost of log collection configuration in the later period and reduce the influence of a big data application program on a service system.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a construction method based on a Flume distributed data collection architecture comprises the following steps:
s1, creating at least three servers, and building Flume on a Hadoop cluster;
s2, determining the scale of the needed design according to the actual production requirement, and building a frame according to what the environment of the overall analysis system needs and how;
s3, carrying out personalized configuration of the Flume framework acquisition end according to the format of the data source: configuring a Source of an Agent in an Agent assembly of a fluorine core unit to define the type and position of accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving data and writing the acquired data into a Channel;
s4, classifying and summarizing the upcoming data sources according to different requirements, configuring a plurality of Sink through a summarizer to issue data, and using and distributing the data on the channel by the Sink through caching the data from the Source;
s5.sink reads data from Channel and sends it to the next Agent or final destination.
According to the preferable scheme of the construction method based on the FLUME distributed data collection architecture, three servers are adopted, wherein two servers are used as collection ends, the third server is used as a summary layer, channels of Memory are configured for two agents of the collection layer, and Sink uniformly uses Avro; and configuring a Channel for the Agent of the summary layer.
According to the preferable scheme of the construction method based on the Flume distributed data collection architecture, a server of a summary layer distributes data to an HDFS and Kafka; performing offline batch processing by using MapReduce of Hadoop in the HDFS; kafka performs a real-time calculation process.
The invention has the advantages that: aiming at the object distributed data collection, the flash service is layered, and each part of data collection and data distribution is maintained separately, so that the system is more stable and the expansibility is greatly improved by maintaining the acquisition layer and the distribution layer.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a schematic flow chart of an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Abbreviations and key terms referred to in this disclosure are defined as follows:
data collection: the data collection is to combine a computer or other special sensors and a hardware product to access a data source to acquire data, process the data according to a certain format or a user-defined format, and finally store the data in a certain service;
flume: is a distributed, highly reliable and highly available service framework. As a distributed log collection framework, the system is mainly used for aggregating and transmitting massive log data and supports the transmission of streaming data;
HDFS (Hadoop Distributed File System): the Hadoop distributed file system is designed for accessing large files in a streaming manner. The method is suitable for the occasions of hundreds of MB, GB and TB and writing and reading for many times. Low latency data access, large numbers of small files, simultaneous writing, and arbitrary file modification are not well suited;
kafka: kafka is a common message buffer queue in a distributed framework, and can be used for data transfer and scheduling;
producer: kafka's producer, for producing data, i.e., acquiring data;
consumer: kafka's consumer, for consuming Kafka's production data;
channel: thread-safe caching sequences in flash;
agent: core element of Flume;
source: the component of the Agent is used for defining a data source;
avro: the method is high-performance middleware based on binary data transmission, and is a sub-item of Hadoop.
Examples
A construction method based on a Flume distributed data collection architecture comprises the following steps:
s1, three servers are created, wherein for convenience of business operation, Flume is built on a Hadoop cluster, two servers serve as acquisition ends, a third server serves as a summary layer, channels of Memory are configured for two agents of the acquisition layer, and Sink uniformly uses Avro; configuring a Channel for the Agent of the summary layer;
s2, determining the scale of the needed design according to the actual production requirement, and building a frame according to what the environment of the overall analysis system needs and how;
s3, carrying out personalized configuration of the Flume framework acquisition end according to the format of the data source: configuring a Source of an Agent in an Agent component of a fluorine core unit to define the type and position of accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving the data, writing the acquired data into a Channel, formatting the data, adding a self-defined Filter, and processing and transferring the data to the Channel;
s4, classifying and summarizing the upcoming data sources according to different requirements, configuring a plurality of Sink through a summarizer to issue data, and using and distributing the data on the channel by the Sink through caching the data from the Source;
s5, the sink reads data from the Channel and sends the data to the next Agent or the final destination, and the data have two purposes in most cases according to the actual production environment: offline batch processing and real-time stream computing.
In this embodiment, the server of the summary layer distributes data to the HDFS and Kafka; performing offline batch processing by using MapReduce of Hadoop in the HDFS; kafka performs a real-time calculation process.
The working principle of the invention is as follows: the whole Flume frame is designed in a plane mode, and the foundation is divided into three layers: the collection layer, the summary layer and the storage layer can be designed separately in each layer, and can be expanded continuously in each layer, and for the optimization of the system, the unique configuration is carried out in each layer to meet various requirements of customers and production environments. The data is formatted under the framework, so that a lot of subsequent data work is saved, and the data is saved to a background for processing after being reloaded and flushed. In the embodiment, collected data is not directly lost to the storage end, but the data of the two servers are collected to one server, and then the data is distributed through the server, so that if the HDFS and the Kafka servers have the condition of needing upgrading or maintenance, the flash can not be hung up or mistakenly, the collection end deployed on the application server is not affected, only the collection layer is needed to buffer the data stream, and the data can be continuously written after the storage section is recovered to be normal. Two servers of the acquisition layer are only responsible for acquiring data sources, the summary layer is only responsible for uniformly distributing data and distributing the data to the HDFS and the Kafka, a plurality of servers commonly form an acquisition end, the number of the servers is increased along with continuous expansion of services and time accumulation, once the distributed mode is successfully deployed, the maintenance cost of log acquisition configuration in the later period can be reduced, and meanwhile, the influence of a big data application program on a service system is reduced, so that the method is a perpetual one.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A construction method based on a Flume distributed data collection architecture is characterized by comprising the following steps:
s1, creating at least three servers, and building Flume on a Hadoop cluster;
s2, determining the scale of the needed design according to the actual production requirement, and building a frame according to what the environment of the overall analysis system needs and how;
s3, carrying out personalized configuration of the Flume framework acquisition end according to the format of the data source: configuring a Source of an Agent in an Agent assembly of a fluorine core unit to define the type and position of accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving data and writing the acquired data into a Channel;
s4, classifying and summarizing the upcoming data sources according to different requirements, configuring a plurality of Sink through a summarizer to issue data, and using and distributing the data on the channel by the Sink through caching the data from the Source;
s5.sink reads data from Channel and sends it to the next Agent or final destination.
2. The Flume distributed data collection architecture-based construction method according to claim 1, wherein: the method comprises the following steps that three servers are provided, wherein two servers serve as acquisition ends, the third server serves as a summary layer, channels of Memory are configured for two agents of the acquisition layer, and Sink uniformly uses Avro; and configuring a Channel for the Agent of the summary layer.
3. The Flume distributed data collection architecture-based construction method according to claim 2, wherein: the server of the summary layer distributes the data to the HDFS and the Kafka; performing offline batch processing by using MapReduce of Hadoop in the HDFS; kafka performs a real-time calculation process.
CN202011221460.9A 2020-07-31 2020-11-05 Construction method based on Flume distributed data collection architecture Pending CN112328569A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2020107478664 2020-07-31
CN202010747866 2020-07-31

Publications (1)

Publication Number Publication Date
CN112328569A true CN112328569A (en) 2021-02-05

Family

ID=74316064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011221460.9A Pending CN112328569A (en) 2020-07-31 2020-11-05 Construction method based on Flume distributed data collection architecture

Country Status (1)

Country Link
CN (1) CN112328569A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113067883A (en) * 2021-03-31 2021-07-02 建信金融科技有限责任公司 Data transmission method and device, computer equipment and storage medium
CN115086303A (en) * 2022-06-29 2022-09-20 徐工汉云技术股份有限公司 Multi-data-source data repeater and design method thereof
CN117198474A (en) * 2023-11-06 2023-12-08 天河超级计算淮海分中心 Medical image data real-time acquisition method, system, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160204998A1 (en) * 2015-01-09 2016-07-14 Lg Cns Co., Ltd. Method of constructing data collector, server performing the same and storage medium for the same
CN106790572A (en) * 2016-12-27 2017-05-31 广州华多网络科技有限公司 The system and method that a kind of distributed information log is collected
CN107908690A (en) * 2017-11-01 2018-04-13 南京欣网互联网络科技有限公司 A kind of data processing method based on big data OA operation analysis
WO2018216828A1 (en) * 2017-05-24 2018-11-29 재단법인차세대융합기술연구원 Energy big data management system and method therefor
CN111327681A (en) * 2020-01-21 2020-06-23 北京工业大学 Cloud computing data platform construction method based on Kubernetes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160204998A1 (en) * 2015-01-09 2016-07-14 Lg Cns Co., Ltd. Method of constructing data collector, server performing the same and storage medium for the same
CN106790572A (en) * 2016-12-27 2017-05-31 广州华多网络科技有限公司 The system and method that a kind of distributed information log is collected
WO2018216828A1 (en) * 2017-05-24 2018-11-29 재단법인차세대융합기술연구원 Energy big data management system and method therefor
CN107908690A (en) * 2017-11-01 2018-04-13 南京欣网互联网络科技有限公司 A kind of data processing method based on big data OA operation analysis
CN111327681A (en) * 2020-01-21 2020-06-23 北京工业大学 Cloud computing data platform construction method based on Kubernetes

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113067883A (en) * 2021-03-31 2021-07-02 建信金融科技有限责任公司 Data transmission method and device, computer equipment and storage medium
CN115086303A (en) * 2022-06-29 2022-09-20 徐工汉云技术股份有限公司 Multi-data-source data repeater and design method thereof
CN117198474A (en) * 2023-11-06 2023-12-08 天河超级计算淮海分中心 Medical image data real-time acquisition method, system, electronic equipment and storage medium
CN117198474B (en) * 2023-11-06 2024-03-01 天河超级计算淮海分中心 Medical image data real-time acquisition method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112328569A (en) Construction method based on Flume distributed data collection architecture
Shree et al. KAFKA: The modern platform for data management and analysis in big data domain
US10122788B2 (en) Managed function execution for processing data streams in real time
CN103390038B (en) A kind of method of structure based on HBase and retrieval increment index
CN111400326B (en) Smart city data management system and method thereof
US10853306B2 (en) Cloud-based distributed persistence and cache data model
US9230002B2 (en) High performant information sharing and replication for single-publisher and multiple-subscriber configuration
CN104036025A (en) Distribution-base mass log collection system
CN106339509A (en) Power grid operation data sharing system based on large data technology
CN109189835A (en) The method and apparatus of the wide table of data are generated in real time
CN106708993A (en) Spatial data storage processing middleware framework realization method based on big data technology
CN109063196A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN109299056B (en) A kind of method of data synchronization and device based on distributed file system
CN107343021A (en) A kind of Log Administration System based on big data applied in state's net cloud
CN103699660A (en) Large-scale network streaming data cache-write method
Arputhamary et al. Data integration in Big Data environment
CN109669975B (en) Industrial big data processing system and method
CN108595605A (en) A kind of construction method of car networking platform database
CN115292414A (en) Method for synchronizing service data to data bins
US11243777B2 (en) Process stream replication for content management system synchronization
CN108763562A (en) A kind of construction method based on big data skill upgrading data exchange efficiency
CN107357919A (en) User behaviors log inquiry system and method
CN111597157A (en) Method for improving log processing system architecture
CN111049898A (en) Method and system for realizing cross-domain architecture of computing cluster resources
Suguna et al. Improvement of Hadoop ecosystem and their pros and cons in Big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210205

RJ01 Rejection of invention patent application after publication