CN112328569A

CN112328569A - Construction method based on Flume distributed data collection architecture

Info

Publication number: CN112328569A
Application number: CN202011221460.9A
Authority: CN
Inventors: 李向佳; 陈付祥; 李鹏; 黄洋
Original assignee: Shandong Yunman Intelligent Technology Co ltd
Current assignee: Shandong Yunman Intelligent Technology Co ltd
Priority date: 2020-07-31
Filing date: 2020-11-05
Publication date: 2021-02-05

Abstract

The invention provides a method for constructing a distributed data collection architecture based on flash, which can reduce the maintenance cost of log collection configuration in the later period and reduce the influence of a big data application program on a service system. Which comprises the following steps: s1, creating at least three servers, and building Flume on a Hadoop cluster; s2, determining the scale of the needed design according to the actual production requirement, and building a frame according to what the environment of the overall analysis system needs and how; s3, configuring the Source of the Agent in the Agent assembly of the fluorine core unit to define the type and the position of the accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving the data and writing the acquired data into a Channel; s4, using and distributing data on the channel by the sink by caching the data from the Source; s5.sink reads data from Channel and sends it to the next Agent or final destination.

Description

Construction method based on Flume distributed data collection architecture

Technical Field

The invention relates to a method for constructing a distributed data collection architecture based on Flume, and belongs to the technical field of information technology.

Background

In the twenty-first century today of the rapid development of the internet, the use of the internet gradually enters the lives, study and work of most people, and the internet becomes an indispensable key part. Not only can real-time and unimpeded communication be carried out between people, but also certain wonderful relationships can be generated between people and objects and between objects due to the rise of artificial intelligence AI. As early as 2013, China has provided four characteristics of high capacity, diversity, high speed and value of big data, people can generate a large amount of data while communicating with the Internet, and in the massive and complex unstructured and structured data generated in real time, most of the rest data except a part of a small amount of core business data are log data related to the core data. Due to the characteristics of real-time data flow, real-time data better meet the requirements of people, the requirements of people on the real-time data in actual production are higher and higher, only one part of the real-time data in mass data can meet the actual requirements in actual production, users who process and react the data more quickly, more efficiently and more accurately are needed, and the use experience and the interaction experience of the users on the system are well enhanced.

In the flash framework, the flash framework can be integrated with any data process, and the rate of reading data of the flash is greater than the rate of writing data, wherein the flash has an advanced buffer mechanism, the flash framework comprises two transaction models, namely data from Source to Channel and data from Channel to Sink, under the guarantee of the two transactions, the data can be successfully submitted, so that a part of data acquisition is not lost, and in the traditional distributed architecture, when the transaction is completely read or completely written, the flash belongs to a linear arrangement mode, namely the stability of the system is not high although the system can be expanded, the efficiency of data transmission cannot be guaranteed, and in addition, the framework is not beneficial to unified management.

It is common that many servers constitute one collection side, and the number of servers increases with the expansion of services and the accumulation of time.

Disclosure of Invention

The invention aims to provide a method for constructing a data collection architecture based on Flume distribution, which can reduce the maintenance cost of log collection configuration in the later period and reduce the influence of a big data application program on a service system.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a construction method based on a Flume distributed data collection architecture comprises the following steps:

s1, creating at least three servers, and building Flume on a Hadoop cluster;

s2, determining the scale of the needed design according to the actual production requirement, and building a frame according to what the environment of the overall analysis system needs and how;

s3, carrying out personalized configuration of the Flume framework acquisition end according to the format of the data source: configuring a Source of an Agent in an Agent assembly of a fluorine core unit to define the type and position of accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving data and writing the acquired data into a Channel;

s4, classifying and summarizing the upcoming data sources according to different requirements, configuring a plurality of Sink through a summarizer to issue data, and using and distributing the data on the channel by the Sink through caching the data from the Source;

s5.sink reads data from Channel and sends it to the next Agent or final destination.

According to the preferable scheme of the construction method based on the FLUME distributed data collection architecture, three servers are adopted, wherein two servers are used as collection ends, the third server is used as a summary layer, channels of Memory are configured for two agents of the collection layer, and Sink uniformly uses Avro; and configuring a Channel for the Agent of the summary layer.

According to the preferable scheme of the construction method based on the Flume distributed data collection architecture, a server of a summary layer distributes data to an HDFS and Kafka; performing offline batch processing by using MapReduce of Hadoop in the HDFS; kafka performs a real-time calculation process.

The invention has the advantages that: aiming at the object distributed data collection, the flash service is layered, and each part of data collection and data distribution is maintained separately, so that the system is more stable and the expansibility is greatly improved by maintaining the acquisition layer and the distribution layer.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Abbreviations and key terms referred to in this disclosure are defined as follows:

data collection: the data collection is to combine a computer or other special sensors and a hardware product to access a data source to acquire data, process the data according to a certain format or a user-defined format, and finally store the data in a certain service;

flume: is a distributed, highly reliable and highly available service framework. As a distributed log collection framework, the system is mainly used for aggregating and transmitting massive log data and supports the transmission of streaming data;

HDFS (Hadoop Distributed File System): the Hadoop distributed file system is designed for accessing large files in a streaming manner. The method is suitable for the occasions of hundreds of MB, GB and TB and writing and reading for many times. Low latency data access, large numbers of small files, simultaneous writing, and arbitrary file modification are not well suited;

kafka: kafka is a common message buffer queue in a distributed framework, and can be used for data transfer and scheduling;

producer: kafka's producer, for producing data, i.e., acquiring data;

consumer: kafka's consumer, for consuming Kafka's production data;

channel: thread-safe caching sequences in flash;

agent: core element of Flume;

source: the component of the Agent is used for defining a data source;

avro: the method is high-performance middleware based on binary data transmission, and is a sub-item of Hadoop.

Examples

s1, three servers are created, wherein for convenience of business operation, Flume is built on a Hadoop cluster, two servers serve as acquisition ends, a third server serves as a summary layer, channels of Memory are configured for two agents of the acquisition layer, and Sink uniformly uses Avro; configuring a Channel for the Agent of the summary layer;

s3, carrying out personalized configuration of the Flume framework acquisition end according to the format of the data source: configuring a Source of an Agent in an Agent component of a fluorine core unit to define the type and position of accessed data, wherein the Source of the Agent is mainly responsible for connecting to a data Source, receiving the data, writing the acquired data into a Channel, formatting the data, adding a self-defined Filter, and processing and transferring the data to the Channel;

s5, the sink reads data from the Channel and sends the data to the next Agent or the final destination, and the data have two purposes in most cases according to the actual production environment: offline batch processing and real-time stream computing.

In this embodiment, the server of the summary layer distributes data to the HDFS and Kafka; performing offline batch processing by using MapReduce of Hadoop in the HDFS; kafka performs a real-time calculation process.

The working principle of the invention is as follows: the whole Flume frame is designed in a plane mode, and the foundation is divided into three layers: the collection layer, the summary layer and the storage layer can be designed separately in each layer, and can be expanded continuously in each layer, and for the optimization of the system, the unique configuration is carried out in each layer to meet various requirements of customers and production environments. The data is formatted under the framework, so that a lot of subsequent data work is saved, and the data is saved to a background for processing after being reloaded and flushed. In the embodiment, collected data is not directly lost to the storage end, but the data of the two servers are collected to one server, and then the data is distributed through the server, so that if the HDFS and the Kafka servers have the condition of needing upgrading or maintenance, the flash can not be hung up or mistakenly, the collection end deployed on the application server is not affected, only the collection layer is needed to buffer the data stream, and the data can be continuously written after the storage section is recovered to be normal. Two servers of the acquisition layer are only responsible for acquiring data sources, the summary layer is only responsible for uniformly distributing data and distributing the data to the HDFS and the Kafka, a plurality of servers commonly form an acquisition end, the number of the servers is increased along with continuous expansion of services and time accumulation, once the distributed mode is successfully deployed, the maintenance cost of log acquisition configuration in the later period can be reduced, and meanwhile, the influence of a big data application program on a service system is reduced, so that the method is a perpetual one.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A construction method based on a Flume distributed data collection architecture is characterized by comprising the following steps:

s1, creating at least three servers, and building Flume on a Hadoop cluster;

2. The Flume distributed data collection architecture-based construction method according to claim 1, wherein: the method comprises the following steps that three servers are provided, wherein two servers serve as acquisition ends, the third server serves as a summary layer, channels of Memory are configured for two agents of the acquisition layer, and Sink uniformly uses Avro; and configuring a Channel for the Agent of the summary layer.

3. The Flume distributed data collection architecture-based construction method according to claim 2, wherein: the server of the summary layer distributes the data to the HDFS and the Kafka; performing offline batch processing by using MapReduce of Hadoop in the HDFS; kafka performs a real-time calculation process.