CN110647407A

CN110647407A - Data configuration method and system

Info

Publication number: CN110647407A
Application number: CN201910816319.4A
Authority: CN
Inventors: 刘洋
Original assignee: Beijing Inspur Data Technology Co Ltd
Current assignee: Beijing Inspur Data Technology Co Ltd
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2020-01-03

Abstract

The invention discloses a data configuration method and a system, wherein Source is configured in a Flume configuration file, and the number of thread numbers of the Source is set to be multiple; setting the type of the channel as a multithreading file channel MultireadingFileChannel, wherein the MultireadingFileChannel comprises a plurality of file channels; and creating a plurality of channel consumer ChannelConsumer examples corresponding to the sink for the sink, so that the sink realizes multithreading. Through the data configuration mode, the data Source, the Channel and the data pool sink in the distributed Flume system are set and configured, so that the distributed Flume system can process data in multiple threads, and the data processing efficiency is improved.

Description

Data configuration method and system

Technical Field

The invention relates to the technical field of data transmission, in particular to a data configuration method and a data configuration system.

Background

The Apache flash is a distributed, highly available and highly reliable mass log aggregation system, by which large amounts of log data can be efficiently collected, aggregated and moved from many different sources to a centralized data storage area. The Apache flux distributed system mainly comprises three blocks, namely a Source, a Channel and a sink, wherein the source is responsible for acquiring data from a data source, producing the data to the Channel, the Channel is used as a message queue, and finally the sink consumes the data in the Channel.

However, in the Apache flux distributed system, only the Souce has a multithreading method, so that data acquired by the Souce from a data source can only be produced to a channel one by one, the channel transmits the data to a sink, and the sink sequentially consumes the data in the channel one by one, so that the data processing efficiency in the Apache flux distributed system is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data configuration method and system, so as to solve the problem of low data processing efficiency in an Apache flux distributed system.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

the invention discloses a data configuration method in a first aspect, which comprises the following steps:

configuring the Source in a configuration file of Flume, and setting the thread number workthreads of the Source to be multiple;

setting the type of the channel as a multithreading file channel MultireadingFileChannel, wherein the MultireadingFileChannel comprises a plurality of file channels;

and creating a plurality of channel consumer ChannelConsumer examples corresponding to the sink for the sink, so that the sink realizes multithreading.

Preferably, the configuring the data Source in the Flume configuration file includes:

configuring the type of the Source as a scribes Source in a configuration file of the Flume; wherein the Scribe source is used for receiving a data source of Scribe;

the port of Source is configured as the target port in the configuration file of Flume.

Preferably, the setting of the type of the channel as a multithreading file channel multitreadingfilechannel includes:

defining a multitreadingFileChannel based on a custom Channel mechanism;

setting the type of the channel to the defined MultithreadingFileChannel;

setting the name of the channel as a preset name;

and setting the number of the channels as a preset number.

Preferably, the defining a multitreadingfilechannel based on the custom Channel mechanism includes:

based on a user-defined Channel mechanism, realizing user-defined multitreadingFileChannel by inheriting a basic Channel semantic basicChannelSemantics method;

creating a list of FileChannels, and storing a user-defined number of FileChannels in the list;

creating a transaction and obtaining the FileChannel in the list.

Preferably, the creating a plurality of channel consumer ChannelConsumer instances corresponding to the sink for the sink includes:

defining and realizing multi-thread sink;

and setting the name of the sink as a preset name, setting the number of ChannelconSumers of the sink as a preset number, and setting the type of the sink as the MultithreeadingKafkassink in a configuration file corresponding to the sink.

Preferably, the defining and implementing multitreadingkafkassink includes:

setting sink as a multithreading data pool of kafka type;

initializing a preset number of kafkassink instances, storing the instances in a thread pool, and acquiring the kafkassink instances from the thread pool by using a multithreading technology to realize multithreading kafkassink.

The second aspect of the present invention discloses a data configuration system, which is suitable for a distributed Flume system, and the system at least comprises three modules: the method comprises a data Source, a Channel and a data pool sink, wherein the data Source is used for receiving data, the Channel is used for transmitting the data received by the data Source to the data pool sink for consumption, and the method comprises the following steps:

the first configuration module is used for configuring the Source in a Flume configuration file and setting a plurality of thread numbers, namely, the threads of the Source;

a second configuration module, configured to set a type of the channel as a multithreading file channel polytreadingfilechannel, where the polytreadingfilechannel includes a plurality of file channels;

and the third configuration module is used for creating a plurality of channel consumer ChannelConsumer instances corresponding to the sink for the sink so as to enable the sink to realize multithreading.

Preferably, the first configuration module includes:

a first configuration unit, configured to configure the type of Source as scribe Source in the configuration file of Flume; wherein the Scribe source is used for receiving a data source of Scribe;

and the second configuration unit is used for configuring the port of the Source as the target port in the configuration file of the flash.

Preferably, the second configuration module includes:

a first defining unit, configured to define a multitreadingfilechannel based on a custom Channel mechanism;

a second defining unit configured to set a type of the channel to the defined multitreadingfilechannel;

the first setting unit is used for setting the name of the channel as a preset name;

and the second setting unit is used for setting the number of the channels to be a preset number.

Preferably, the third configuration module includes:

the third definition unit is used for defining and realizing multi-thread sink;

and the third setting unit is used for setting the name of the sink as a preset name, setting the number of ChannelconSumers of the sink as a preset number and setting the type of the sink as the MultithreeadingKafkassink in a configuration file corresponding to the sink.

From the above, the invention discloses a data configuration method and system, wherein the Source is configured in the Flume configuration file, and the number of thread threads of the Source is set to be multiple; setting the type of the channel as a multithreading file channel MultireadingFileChannel, wherein the MultireadingFileChannel comprises a plurality of file channels; and creating a plurality of channel consumer ChannelConsumer examples corresponding to the sink for the sink, so that the sink realizes multithreading. Through the data configuration mode, the data Source, the Channel and the data pool sink in the distributed Flume system are set and configured, so that the distributed Flume system can process data in multiple threads, and the data processing efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a data configuration method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a Source configuration method according to an embodiment of the present invention;

fig. 3 is a flowchart of a Channel configuration method according to an embodiment of the present invention;

fig. 4 is a flowchart of a sink configuration method according to an embodiment of the present invention;

fig. 5 is a schematic diagram of prior art Source, Channel, and sink configuration connections according to an embodiment of the present invention;

fig. 6 is a schematic diagram of Source, Channel, and sink configuration connections according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a data configuration system according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a first configuration module 701 according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a second configuration module 702 according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a third configuration module 703 according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

An embodiment of the present invention provides a data configuration method, and referring to fig. 1, the method at least includes step S101 to step S103.

Step S101: and configuring the Source in a configuration file of Flume, and setting the thread number workthreads of the Source to be multiple.

In step S101, the flute configuration file is a sink configuration file in an Apache flute distribution system, and the Souce, Channel, and sink may be configured in the flute configuration file.

It should be noted that the Souce has a multithreading method, so that a plurality of threads can be directly set for the Souce, and therefore, the Souce multithreading can be realized.

In the process of executing step S101, as shown in fig. 2, the specific execution process includes step S201 to step S202.

Step S201: the type of Source is configured as scribes Source in the Flume's configuration file.

In step S201, a data source of the script may be received through the script source.

It should be noted that the Scribe Source method of the Source can receive a data Source of the Scribe, where the Scribe is a Facebook open Source log collection system, and can collect logs from various log sources and store the logs on a central storage system, so as to perform centralized statistical analysis processing.

Step S202: the port of Source is configured as the target port in the configuration file of Flume.

In step S202, the port of the Source is configured as a target port, so that the Source can receive data, where the target port is a preset port.

To facilitate understanding of how the Source is configured in step S101, a Source specific configuration statement is shown below.

a1.sources＝scribe_source

V/set Source name to script _ Source

a1.sources.scribe_source.type＝org.apache.flume.source.scribe.ScribeSource

V/set the type of script _ Source to ScribSource

a1.sources.scribe_source.port＝1466

V/set the port of script _ source to 1466

a1.sources.scribe_source.workerThreads＝10

// set the number of threads for script _ source to 10

Step S102: and setting the type of the channel as a multithreading file channel multitreadingfilechannel.

In step S102, the multitreadingfilechannel includes a plurality of file channels filechannels.

It should be noted that the channel does not have a multithreading method, and therefore, the channel needs to be set to have a multithreading file channel multitreadingfilechannel, so that the channel has a multithreading method.

In the process of executing step S102, as shown in fig. 3, steps S301 to S304 are specifically included.

Step S301: a multitreadingfilechannel is defined based on a custom Channel mechanism.

In step S301, the Channel has a customization mechanism, so that a multitreadingfilechannel can be defined by the customization mechanism of the Channel.

It should be noted that, because the Channel does not have a multithreading function, a multitreadingfilechannel, that is, a multithreading file Channel, needs to be defined first through a customization mechanism of the Channel.

It should be further noted that, based on the customized Channel mechanism, a multitheradingfilechannel is defined, and the customized multitheradingfilechannel can be realized by inheriting a basic Channel semantic basicchannels method based on the customized Channel mechanism; then creating a list of FileChannels, and storing a user-defined number of FileChannels in the list; and finally, creating a transaction and acquiring the FileChannel in the list.

Step S302: setting the type of the channel to the defined MultithreadingFileChannel.

In step S302, since the polytreadingfilechannel is a FileChannel containing a plurality of file channels, the type of the Channel needs to be set as the defined polytreadingfilechannel, so that the Channel becomes a Channel with one multi-thread file Channel, and thus the Channel has a multi-thread function.

Step S303: setting the name of the Channel as a preset name.

In step S303, since the type of the Channel is set, the Channel needs to be named.

Step S304: and setting the number of the channels as a preset number.

In step S304, since the type of the channel is a polytreadingfilechannel, the channel has a multithreading function, and therefore, the number of threads of the channel needs to be set to specify the number of threads of the channel.

To facilitate understanding of how step S102 configures the channel, a channel specific configuration statement is shown below.

a1.channels＝file_channel

V/naming channels as file _ channel

a1.channels.file_channel.type＝org.apache.flume.extension.channel.Multithrea dingFileChannel

// set the type of file _ channel to multitreadingFileChannel

a1.channels.file_channel.channels＝10

// set the thread count for the file _ channel to 10

a1.channels.file_channel.checkpointDir＝/data0/flume/checkpoint

// set the checkpoint directory for file _ channel to checkpoint

a1.channels.file_channel.dataDir＝/data0/flume/data

Setting directory for storing data of file _ channel as data

Step S103: and creating a plurality of channel consumer ChannelConsumer examples corresponding to the sink for the sink, so that the sink realizes multithreading.

In step S103, since the sink does not have a multithreading function, a plurality of ChannelConsumer instances of the channel corresponding to the sink need to be created for the sink, so that the sink realizes multithreading.

When step S103 is executed, as shown in fig. 4, step S401 to step S402 are specifically included.

Step S401: and defining and realizing the multi-thread sink.

In step S401, defining and implementing a multi-threaded sink, a multi-threaded data pool with the sink being kafka type may be set first; and initializing a preset number of kafkassink instances to be stored in a thread pool, and acquiring the kafkassink instances from the thread pool by using a multithreading technology to realize multithreading sink.

It should be noted that, setting sink to Kafka is required to implement multi-threaded sink by self-definition.

Step S402: and setting the name of the sink as a preset name, setting the number of ChannelconSumers of the sink as a preset number, and setting the type of the sink as the MultithreeadingKafkassink in a configuration file corresponding to the sink.

In step S402, the number of channelnconsumers refers to the number of threads of the sink.

It should be noted that, by setting the type of the sink as the polytreadingkafkassink, the sink can be provided with multiple threads.

It should be noted that, when configuring a sink, in addition to setting the name of the sink, the number and type of channelconSumers, the topicHeaderName, brokerList, and bankSize of the sink may also be set.

To facilitate understanding of how step S103 configures the sink, a sink specific configuration statement is shown below.

a1.sinks＝kafka_sink

// set the name of sink to kafka _ sink

a1.sinks.kafka_sink.type＝org.apache.flume.extension.sink.MultithreadingKaf kasink

V/set the type of sink to predefined MultithreadingKafkassink

a1.sinks.kafka_sink.topicHeaderName＝category

// set the subject title name of sink to category

a1.sinks.kafka_sink.consumers＝10

// set the number of threads on sink to 10

a1.sinks.kafka_sink.brokerList＝kafkaHost:9092

// set the sink's Kafka Server List as kafkaHost 9092

a1.sinks.kafka_sink.batchSize＝1000

V/set the number of batches processed to 1000

It should be noted that, in step S101, step S102, and step S103 of the present application, the data Source, the Channel, and the data pool sink in the distributed Flume system are mainly configured, and the data Source, the Channel, and the data pool sink all have a multithreading function, so that the distributed Flume system can perform multithreading processing on data.

It should be noted that step S101, step S102, and step S103 are not limited to a sequential order, and may be executed simultaneously.

Configuring the Source in a configuration file of Flume, and setting a plurality of thread numbers, namely, worerthreads, of the Source; setting the type of the channel as a multithreading file channel MultireadingFileChannel, wherein the MultireadingFileChannel comprises a plurality of file channels; and creating a plurality of channel consumer ChannelConsumer examples corresponding to the sink for the sink, so that the sink realizes multithreading. Through the data configuration mode, the data Source, the Channel and the data pool sink in the distributed Flume system are set and configured, so that the distributed Flume system can process data in multiple threads, and the data processing efficiency is improved.

It should be noted that, as shown in fig. 5, in the prior art, in the method, the number of threads of a data Source with multiple threads is set to N, and then N channels and N sinks are set in a distributed Flume system, so that the distributed Flume system is finally implemented to process data with multiple threads.

Compared with the prior art, when the distributed Flume system is configured for the data Source, the Channel and the data pool sink, only simple program codes are needed to complete the configuration, and only the number of threads in the program needs to be modified during maintenance, so that the configuration is simplified while the data processing of the distributed Flume system is improved.

Corresponding to the data configuration method provided in the embodiment of the present application, a corresponding data configuration system is also provided in the embodiment of the present application, as shown in fig. 7, for the data configuration system provided in the embodiment of the present application, the data configuration system includes:

a first configuration module 701, configured to configure the Source in a Flume configuration file, and set a plurality of thread numbers worerthreads of the Source;

a second configuration module 702, configured to set the type of the channel as a multithreading file channel polytreadingfilechannel, where the polytreadingfilechannel includes multiple file channels;

a third configuration module 703, configured to create, for the sink, multiple channel consumer ChannelConsumer instances corresponding to the sink, so that the sink realizes multithreading.

Preferably, as shown in fig. 8, the first configuration module 701 includes:

a first configuration unit 801, configured to configure the type of Source as scribes Source in the configuration file of Flume; wherein the Scribe source is used for receiving a data source of Scribe;

a second configuration unit 802, configured to configure the port of Source as a target port in the configuration file of Flume.

Preferably, as shown in fig. 9, the second configuration module 702 includes:

a first defining unit 901, configured to define a multitreadingfilechannel based on a custom Channel mechanism;

a second defining unit 902, configured to set the type of the channel as the defined multitreadingfilechannel;

a first setting unit 903, configured to set a name of the channel as a preset name;

a second setting unit 904, configured to set the number of channels to a preset number.

Preferably, the first defining unit 901 includes:

the first acquisition subunit is used for realizing self-defined multitreadingFileChannel by inheriting a basic Channel semantic basicChannelSemantics method based on a self-defined Channel mechanism;

the first creating subunit is used for creating a list of the FileChannels and storing the FileChannels with the user-defined number in the list;

and the second creating subunit is used for creating the transaction and acquiring the FileChannel in the list.

Preferably, as shown in fig. 10, the third configuration module 703 includes:

a third defining unit 1001 for defining and implementing a multi-thread sink;

a third setting unit 1002, configured to set, in a configuration file corresponding to the sink, the name of the sink as a preset name, set the number of channelnconsumers of the sink as a preset number, and set the type of the sink as the multitreadingkafkassink.

Preferably, the third defining unit 1001 includes:

the setting subunit is used for setting the sink as a multithreading data pool of the kafka type;

and the initialization subunit is used for initializing a preset number of kafkassink instances to be stored in a thread pool, and acquiring the kafkassink instances from the thread pool by using a multithreading technology so as to realize multithreading sink.

It should be noted that, for the specific implementation process and implementation principle of each module and unit in the data configuration system disclosed in the foregoing embodiment of the present application, reference may be made to corresponding parts related to data configuration in the data configuration method disclosed in the foregoing embodiment of the present application, and details are not described here again.

Configuring the Source in a configuration file of the Flume through a first configuration module, and setting a plurality of thread numbers, namely, worerthreads, of the Source; the second configuration module sets the type of the channel as a multithreading file channel multitreadingfilechannel, wherein the multitreadingfilechannel comprises a plurality of file channels; and a third configuration module creates a plurality of channel consumer ChannelConsumer instances corresponding to the sink for the sink, so that the sink realizes multithreading. Through the data configuration system, the data Source, the Channel and the data pool sink in the distributed Flume system are set and configured, so that the distributed Flume system can process data in multiple threads, and the data processing efficiency is improved.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data configuration method is applicable to a distributed Flume system, and the system at least comprises three modules: the method comprises the following steps that a data Source, a Channel and a data pool sink are provided, wherein the data Source is used for receiving data, the Channel is used for transmitting the data received by the data Source to the data pool sink for consumption, and the method comprises the following steps:

2. The method according to claim 1, wherein the configuring the data Source in the Flume configuration file comprises:

3. The method of claim 1, wherein the setting the type of the channel to a multithreaded filechannel comprises:

defining a multitreadingFileChannel based on a custom Channel mechanism;

setting the type of the channel to the defined MultithreadingFileChannel;

setting the name of the channel as a preset name;

and setting the number of the channels as a preset number.

4. The method of claim 3, wherein defining a MultithreadingFileChannel based on a custom Channel mechanism comprises:

creating a transaction and obtaining the FileChannel in the list.

5. The method of claim 1, wherein creating a plurality of channel consumer ChannelConsumer instances corresponding to the sink for the sink comprises:

defining and realizing multi-thread sink;

6. The method of claim 5, wherein defining and implementing a multi-threaded sink comprises:

setting sink as a multithreading data pool of kafka type;

7. A data configuration system, adapted for a distributed flash system, the distributed flash system comprising at least three modules: the data configuration system comprises a data Source, a Channel and a data pool sink, wherein the data Source is used for receiving data, the Channel is used for transmitting the data received by the data Source to the data pool sink for consumption, and the data configuration system comprises:

8. The system of claim 7, wherein the first configuration module comprises:

9. The system of claim 7, wherein the second configuration module comprises:

10. The system of claim 7, wherein the third configuration module comprises:

the third definition unit is used for defining and realizing multi-thread sink;