CN103401934A

CN103401934A - Method and system for acquiring log data

Info

Publication number: CN103401934A
Application number: CN2013103404125A
Authority: CN
Inventors: 姚仁捷
Original assignee: Guangzhou Vipshop Information And Technology Co Ltd
Current assignee: Guangzhou Vipshop Information And Technology Co Ltd
Priority date: 2013-08-06
Filing date: 2013-08-06
Publication date: 2013-11-20

Abstract

The invention discloses a method and a system for acquiring log data. The method comprises the following steps: first Flume acquires log data from an application server; the first Flume transmits the acquired log data to Kafka, and the Kafka converts the received log data into a Kafka message queue. According to the method and the system for acquiring log data, the log data in the application server are transmitted to the Kafka via the first Flume, the log data are converted into the Kafka message queue via the Kafka, and a user only needs to link to the Kafka during acquisition of the log data from the Kafka, so that complex restart and insertion operation is not needed, and the acquisition flexibility of the log data can be enhanced.

Description

Obtain the method and system of daily record data

Technical field

The present invention relates to data communication technology field, particularly relate to a kind of method and system that obtains daily record data.

Background technology

Development along with e-commerce technology, the pressure of the back-end server carrying of network is also increasing, need simultaneously " data " to be processed also be how much levels and increase, collect accurately, transmit, calculate in real time massive logs becomes an urgent demand in ecommerce thereupon.The flume-ng technology that relates to while mainly using the twitter collector journal in prior art.Flume is distributed, a reliable high-performance instrument, is used for from different data source collections, polymerization, a large amount of daily record datas to of a transmission central data source.Flume has three important concepts, source, and channel, sink, these three logical concept form the agency of a Flume.Source has defined the source (such as file) of data, and Sink has defined the outlet of data, and Channel is the passage in the middle of Source and Sink.Source wherein, Channel, Sink are horizontal extension, can adjust according to performance.

But, flume is a java process, just be written into the file format jar(Java Archive of the various and platform independence in lib when starting, the Java archive file),, if there is program need to read wherein information, need to write the plug-in unit (using java) of a flume, and it is just passable to restart flume, complicated operation, very flexible; Too pay attention to the reliability of message, throughput is low, is not easy to the user from flume quick obtaining daily record data.

Summary of the invention

Based on this, being necessary provides a kind of method and system that obtains daily record data for above-mentioned Flume collector journal data very flexible and the low problem of throughput.

A kind of method of obtaining daily record data comprises the following steps:

The one Flume obtains daily record data from application server;

The daily record data that a described Flume will obtain is sent to Kafka, and the daily record data that described Kafka will receive is converted to the Kafka message queue.

A kind of system of obtaining daily record data, comprise application server, a Flume and Kafka, wherein:

A described Flume is sent to described Kafka for the daily record data that obtains daily record data from described application server and will obtain;

The daily record data that described Kafka is used for receiving is converted to the Kafka message queue.

The above-mentioned method and system that obtains daily record data, by a Flume, the daily record data in application server is sent to Kafka, and by Kafka, daily record data is converted to the Kafka message queue, when the user obtains daily record data from Kafka, only need to be connected to Kafka gets final product, do not need to carry out loaded down with trivial detailsly to restart and update, can improve the flexibility of obtaining daily record data.

Description of drawings

Fig. 1 is the schematic flow sheet that the present invention obtains method first execution mode of daily record data;

Fig. 2 is the schematic flow sheet that the present invention obtains method second execution mode of daily record data;

Fig. 3 is the schematic flow sheet that the present invention obtains method the 3rd execution mode of daily record data;

Fig. 4 is the structural representation that the present invention obtains system first execution mode of daily record data;

Fig. 5 is the structural representation that the present invention obtains system second execution mode of daily record data;

Fig. 6 is the structural representation that the present invention obtains system the 3rd execution mode of daily record data.

Embodiment

See also Fig. 1, Fig. 1 is the schematic flow sheet that the present invention obtains method first execution mode of daily record data.

The described method of obtaining daily record data of present embodiment comprises the following steps:

Step 101, a Flume obtains daily record data from application server.

Step 102, the daily record data that a described Flume will obtain is sent to Kafka, and the daily record data that described Kafka will receive is converted to the Kafka message queue.

The described method of obtaining daily record data of present embodiment, by a Flume, the daily record data in application server is sent to Kafka, and by Kafka, daily record data is converted to the Kafka message queue, when the user obtains daily record data from Kafka, only need to be connected to Kafka gets final product, do not need to carry out loaded down with trivial detailsly to restart and update, can improve the flexibility of obtaining daily record data.

Wherein, for step 101, described Flume preferably can by self three logical gate source, channel and sink from application server crawl log data.The number of described application server and type can need according to the user in advance and user type is set.

For step 102, described Kafka is a kind of distributed post subscribe message system of high-throughput, and at first, the file cache of the operating system of described Kafka enough improves and be powerful, and only otherwise write at random, the performance of order read-write is very efficient.The data of described Kafka only can sequentially be inserted, and the deletion strategy of data is to be accumulated to a certain degree or to surpass certain hour to delete again.Another unique characteristic of described Kafka is that user profile is kept at client rather than MQ server, server is just without the delivery process of recording messages like this, where where each client is known oneself oneself next time should from reading message, the delivery process of message is also to adopt the client Active model, has so greatly alleviated the burden of server.Described Kafka also emphasizes to reduce serializing and the copy expense of data, and it can be made into some message groups message queue and does storage in batches and send.

Preferably, described Kafka has following characteristic:

1, the persistence that gives information of the data in magnetic disk structure by O (1), this structure also can keep long stability for instant data with the message stores of TB.

2, high-throughput: even very common hardware kafka also can support the message that per second is hundreds thousand of.

3, support to carry out subregion message by kafka server and charge machine cluster.

4, support the Hadoop parallel data to load.

In one embodiment, the described Flume daily record data that will the obtain step that is sent to Kafka comprises the following steps:

Step 1021, a described Flume passes through the data transmission channel between a Github system made and described Kafka, and carries out parameter configuration take described Kafka as data receiving terminal.

Step 1022, a described Flume carries out being sent to described Kafka by the data transmission channel of setting up after preliminary treatment to every the daily record data that obtains.

Step 1023, when the daily record data that described Kafka receives reaches the preset data amount, carry out the data packing to the daily record data of preset data amount, stores the storage area of described Kafka into and be converted to described Kafka message queue.

Wherein, in the present embodiment, but the various Git of described Github trustship storehouse, and a web interface is provided, but different from other service as SourceForge or Google Code, the unique distinction of Github is to carry out from the another one project simplification of branch.Described Git is a distributed version control system, is write by Linus Torvalds at first, as the management of linux kernel code.

Preferably, a described Flume and described Kafka are by a flume plug-in unit flume-kakfa(data transmission channel) carry out the daily record data transmission, described flume-kafka is hosted in a described Github system.Described flume-kafka supports from described Kafka crawl log data, also supports to push daily record data to described Kafka.

Further, before the transmission daily record data, at first a described Flume is defined as described Kafka the code snippet of data source, and preliminary treatment, configuration and shut-down operation that between right by process software program, congfigure software program and stop, program is preset each daily record data, concrete operation code is as follows:

Preferably, take described Kafka as data source, and the code of a described Kafka of Flume connection is as follows:

In above-mentioned code, props.put is for the attribute of the connection to described Kafka, below respectively each attribute in above-mentioned code is described:

Roupid: the title of connection

Autocommit.enable: inform automatically described Kafka consumption at present is to which bar log information

Autooffset.reset: the log information that automatic acquisition is up-to-date

Socket.buffersize: the buffer sizes of port communication.

According to the definition of these attributes, a described Flume and described Kafka connect finally.Next illustrate, after a described Flume was connected to described Kafka, how described Kafka received data.We have defined the quantity of message in a batch data.A so-called batch data, refer to described Kafka with the packing of the data of some, the disposable data destination that sends to, rather than from a described Flume, receive a secondary data, send out once toward the storage area of described Kafka.Batch sending can be saved the expense of Internet Transmission.

Same, be below the code that the described Kafka of connection and described Kafka think the data purpose:

Connect and send daily record data with take described Kafka as data source the time and have difference:, for the batch process of daily record data, during take described Kafka as the data purpose, by described Kafka oneself, controlled, rather than a described Flume.According to batch.size in above-mentioned code, described Kafka gets the log information of some, in a collection of mode, sends.

In another embodiment, after the described daily record data that will receive by Kafka is converted to the step of Kafka message queue, further comprising the steps of:

The jmx monitor-interface that provides by described Kafka obtains the service data of Kafka.

Described Kafka compares a described Flume, and having more monitor data can obtain, the convenient health condition that monitors whole system.

See also Fig. 2, Fig. 2 is the schematic flow sheet that the present invention obtains method second execution mode of daily record data.

Present embodiment described obtains the method for daily record data and the difference of the first execution mode is: after the daily record data that described Kafka will receive is converted to the step of Kafka message queue, further comprising the steps of:

Step 201, Storm calculates cluster in real time by the data transmission channel between Storm system made and described Kafka.

Step 202, described Storm calculates in real time cluster and obtain the daily record data that needs by the data channel of setting up from described Kafka message queue.

Wherein, for step 201, described Storm, for distributed real-time calculating provides one group of generic primitives, can be used among " stream is processed " processing messages and more new database in real time.Described Storm is a kind of mode of administration queue and worker's cluster.Described Storm also can be used to " calculating continuously " (continuous computation), data flow is done continuous-query, when calculating just with result with the formal output of stream to the user, also can be used to " distributed RPC ", move the computing of costliness in parallel mode

For step 202, Storm calculates in real time cluster and can connect described Kafka with the java method of storm-contrib in described Storm and obtain log information.The other system that calculates in real time cluster except described Storm also can obtain log information (being the daily record of every delegation) from the described Kafka as message-oriented middleware.

At present, described Kafka has affinity for a lot of language, as: java, python, the language that ruby etc. are popular, have the storehouse of supporting described Kafka.

The described method of obtaining daily record data of present embodiment, can, from described Kafka quick obtaining daily record data, need not repeatedly to restart described Kafka as the user by distinctive connected mode between self and described Kafka.

See also Fig. 3, Fig. 3 is the schematic flow sheet that the present invention obtains method the 3rd execution mode of daily record data.

The 2nd Flume obtains the daily record data that the user needs from described Kafka message queue.

For the other system that calculates in real time except described Storm cluster, as: HAFS cluster, full-text search cluster etc., if itself does not have intrinsic connected mode with described Kafka, can connect by the 2nd Flume and described Kafka, and obtain log information from the described Kafka as message-oriented middleware.

Described the 2nd Flume step of obtaining the daily record data that the user needs from described Kafka message queue comprises the following steps in one embodiment:

Step 301, described the 2nd Flume passes through the data transmission channel between the 2nd Github system made and described Kafka, and take described Kafka as the data transmitting terminal, carries out parameter configuration.

Step 302, described the 2nd Flume sends log request by the data transmission channel of setting up to described Kafka.

Step 303, described Kafka, according to described log request, obtains corresponding daily record data from described Kafka message queue, and by described data transmission channel, the daily record data of described correspondence is sent to described the 2nd Flume in batches.

The mode that the 2nd Flume described in the present embodiment and described Kafka connect can with the mode that connects by described flume-kafka take described Kafka as the data purpose in the first execution mode in to be connected code identical.

The described method of obtaining daily record data of present embodiment, connect by described the 2nd Flume and described Kafka, and be other system quick obtaining log information from the described Kafka as message-oriented middleware.

See also Fig. 4, Fig. 4 is this structural representation that obtains system first execution mode of daily record data.

The described system of obtaining daily record data of present embodiment comprises application server 100, a Flume200 and Kafka300, wherein:

The one Flume200, be sent to Kafka300 for the daily record data that obtains daily record data from application server 100 and will obtain.

Kafka300, the daily record data that is used for receiving is converted to the Kafka message queue.

The described system of obtaining daily record data of present embodiment, by a Flume, the daily record data in application server is sent to Kafka, and by Kafka, daily record data is converted to the Kafka message queue, when the user obtains daily record data from Kafka, only need to be connected to Kafka gets final product, do not need to carry out loaded down with trivial detailsly to restart and update, can improve the flexibility of obtaining daily record data.

Wherein, for application server 100, its number and type can need according to the user in advance and user type is set.

For a Flume200, described Flume preferably can by self three logical gate source, channel and sink from application server crawl log data.

For Kafka300, be a kind of distributed post subscribe message system of high-throughput in Kafka300, at first, the file cache of its operating system enough improves and is powerful, and only otherwise write at random, the performance of order read-write is very efficient.The data of Kafka300 only can sequentially be inserted, and the deletion strategy of data is to be accumulated to a certain degree or to surpass certain hour to delete again.Another unique characteristic of Kafka300 is that user profile is kept at client rather than MQ server, server is just without the delivery process of recording messages like this, where where each client is known oneself oneself next time should from reading message, the delivery process of message is also to adopt the client Active model, has so greatly alleviated the burden of server.Kafka300 also emphasizes to reduce serializing and the copy expense of data, and it can be made into some message groups message queue and does storage in batches and send.

Preferably, Kafka300 has following characteristic:

4, support the Hadoop parallel data to load.

In one embodiment, the described system of obtaining daily record data of present embodiment also comprises a Github system, wherein:

The one Flume200 also is used for by the data transmission channel between a Github system made and Kafka300, and carry out parameter configuration take Kafka300 as data receiving terminal, and every the daily record data that obtains is carried out being sent to Kafka300 by the data transmission channel of setting up after preliminary treatment.

Kafka300 also is used for when the daily record data that receives reaches the preset data amount, and the daily record data of preset data amount is carried out the data packing, stores the storage area of Kafka300 into and is converted to described Kafka message queue.

Preferably, a Flume200 and Kafka300 are by a flume plug-in unit flume-kakfa(data transmission channel) carry out the daily record data transmission, described flume-kafka is hosted in a described Github system.Described flume-kafka supports from Kafka300 crawl log data, also supports to push daily record data to Kafka300.

Further, before the transmission daily record data, at first the one Flume200 is defined as Kafka300 the code snippet of data source, and preliminary treatment, configuration and shut-down operation that between right by process software program, congfigure software program and stop, program is preset each daily record data, concrete operation code is as follows:

Preferably, take Kafka300 as data source, and the code of a Flume200 connection Kafka300 is as follows:

In above-mentioned code, props.put is for the attribute of the connection to Kafka300, below respectively each attribute in above-mentioned code is described:

Roupid: the title of connection

Autocommit.enable: inform automatically Kafka300 consumption at present is to which bar log information

Autooffset.reset: the log information that automatic acquisition is up-to-date

Socket.buffersize: the buffer sizes of port communication.

According to the definition of these attributes, a Flume200 and Kafka300 connect finally.Next illustrate, after a Flume200 was connected to Kafka300, how Kafka300 received data.We have defined the quantity of message in a batch data.A so-called batch data, refer to Kafka300 with the packing of the data of some, the disposable data destination that sends to, rather than from a Flume200, receive a secondary data, send out once toward the storage area of Kafka300.Batch sending can be saved the expense of Internet Transmission.

Same, be below the code that connection Kafka300 and Kafka300 think the data purpose:

Connect and send daily record data with take Kafka300 as data source the time and have difference:, for the batch process of daily record data, during take Kafka300 as the data purpose, by the Kafka300 oneself of institute, controlled, rather than a Flume200.According to batch.size in above-mentioned code, Kafka300 gets the log information of some, in a collection of mode, sends.

In another embodiment, the described system of obtaining daily record data of present embodiment can also comprise a monitoring unit, described monitoring unit is used for after the described daily record data that will receive by Kafka is converted to the Kafka message queue, and the jmx monitor-interface that provides by Kafka300 obtains the service data of Kafka300.

Kafka300 compares a Flume200, and having more monitor data can obtain, the convenient health condition that monitors whole system.

See also Fig. 5, Fig. 5 is the structural representation that the present invention obtains system second execution mode of daily record data.

Present embodiment described obtains the system of daily record data and the difference of the first execution mode is: also comprise that Storm calculates cluster 400 and Storm system 500 in real time, Storm calculates in real time cluster 400 and is used for obtaining by the data transmission channel between 500 foundation of Storm system and Kafka300 the daily record data that needs by the data channel of setting up from described Kafka message queue.

Wherein, for Storm system 500, described Storm, for distributed real-time calculating provides one group of generic primitives, can be used among " stream is processed " processing messages and more new database in real time.Described Storm is a kind of mode of administration queue and worker's cluster.Described Storm also can be used to " calculating continuously " (continuous computation), data flow is done continuous-query, when calculating just with result with the formal output of stream to the user, also can be used to " distributed RPC ", move the computing of costliness in parallel mode

Calculate in real time cluster 400 for Storm, it can connect Kafka300 and obtain log information with the java method of storm-contrib in described Storm.Other system except Storm calculates cluster 400 in real time also can obtain log information (being the daily record of every delegation) from the Kafka300 as message-oriented middleware.

At present, Kafka300 has affinity for a lot of language, as: java, python, the language that ruby etc. are popular, have the storehouse of supporting Kafka300.

The described system of obtaining daily record data of present embodiment, can, from described Kafka quick obtaining daily record data, need not repeatedly to restart described Kafka as the user by distinctive connected mode between self and described Kafka.

See also Fig. 6, Fig. 6 is the structural representation that the present invention obtains system the 3rd execution mode of daily record data.

Present embodiment described obtains the system of daily record data and the difference of the first execution mode is: also comprise the 2nd Flume600, be used for obtaining from described Kafka message queue the daily record data that the user needs.

For the other system except Storm calculates cluster 400 in real time, as: HAFS cluster, full-text search cluster etc., if itself does not have intrinsic connected mode with Kafka300, can connect by the 2nd Flume600 and Kafka300, and obtain log information from the Kafka300 as message-oriented middleware.

In one embodiment, the described system of obtaining daily record data of present embodiment also comprises the 2nd Github system, wherein:

The 2nd Flume600 also is used for by the data transmission channel between the 2nd Github system made and Kafka300, take Kafka300 as the data transmitting terminal, carries out parameter configuration, and to Kafka300, sends log request.

Kafka300 also is used for according to described log request, obtains corresponding daily record data from described Kafka message queue, and by described data transmission channel, the daily record data of described correspondence is sent to described the 2nd Flume600 in batches.

The mode that in the present embodiment, the 2nd Flume600 and Kafka300 connect can with the mode that connects by described flume-kafka take described Kafka300 as the data purpose in the first execution mode in the first execution mode in to be connected code identical.

The described system of obtaining daily record data of present embodiment, connect by described the 2nd Flume and described Kafka, and be other system quick obtaining log information from the described Kafka as message-oriented middleware.

The above embodiment has only expressed several execution mode of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to the scope of the claims of the present invention.Should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection range of patent of the present invention should be as the criterion with claims.

Claims

1. a method of obtaining daily record data, is characterized in that, comprises the following steps:

The one Flume obtains daily record data from application server;

2. the method for obtaining daily record data according to claim 1, is characterized in that, the step that the daily record data that a described Flume will obtain is sent to Kafka comprises the following steps:

A described Flume passes through the data transmission channel between a Github system made and described Kafka, and carries out parameter configuration take described Kafka as data receiving terminal;

A described Flume carries out being sent to described Kafka by the data transmission channel of setting up after preliminary treatment to every the daily record data that obtains;

When the daily record data that described Kafka receives reaches the preset data amount, described daily record data is carried out the data packing, store the storage area of described Kafka into and be converted to described Kafka message queue.

3. the method for obtaining daily record data according to claim 1, is characterized in that, and is after the daily record data that described Kafka will receive is converted to the step of Kafka message queue, further comprising the steps of:

Storm calculates cluster in real time by the data transmission channel between Storm system made and described Kafka;

Described Storm calculates in real time cluster and obtain the daily record data that needs by the data channel of setting up from described Kafka message queue.

4. the described method of obtaining daily record data of any one according to claim 1 to 3, is characterized in that, and is after the daily record data that described Kafka will receive is converted to the step of Kafka message queue, further comprising the steps of:

5. the method for obtaining daily record data according to claim 4, is characterized in that, described the 2nd Flume obtains the daily record data of user's needs from described Kafka message queue step comprises the following steps:

Described the 2nd Flume passes through the data transmission channel between the 2nd Github system made and described Kafka, and take described Kafka as the data transmitting terminal, carries out parameter configuration;

Described the 2nd Flume sends log request by the data transmission channel of setting up to described Kafka;

Described Kafka, according to described log request, obtains corresponding daily record data from described Kafka message queue, and by described data transmission channel, the daily record data of described correspondence is sent to described the 2nd Flume in batches.

6. a system of obtaining daily record data, is characterized in that, comprises application server, a Flume and Kafka, wherein:

7. the system of obtaining daily record data according to claim 6, is characterized in that, also comprises a Github system, wherein:

A described Flume also is used for by the data transmission channel between a described Github system made and described Kafka, and carry out parameter configuration take described Kafka as data receiving terminal, and every the daily record data that obtains is carried out being sent to described Kafka by the data transmission channel of setting up after preliminary treatment;

Described Kafka also is used for, when the daily record data that receives reaches the preset data amount, described daily record data is carried out the data packing, stores the storage area of described Kafka into and is converted to described Kafka message queue.

8. the system of obtaining daily record data according to claim 6, it is characterized in that, also comprise that Storm calculates cluster and Storm system in real time, described Storm calculates in real time cluster and is used for by the data transmission channel between described Storm system made and described Kafka, obtains the daily record data that needs from described Kafka message queue by the data channel of setting up.

9. the described system of obtaining daily record data of any one according to claim 6 to 8, is characterized in that, also comprises the 2nd Flume, is used for obtaining from described Kafka message queue the daily record data that the user needs.

10. the system of obtaining daily record data according to claim 9, is characterized in that, also comprises the 2nd Github system, wherein:

Described the 2nd Flume also is used for by the data transmission channel between the 2nd Github system made and described Kafka, take described Kafka as the data transmitting terminal, carries out parameter configuration, and to described Kafka, sends log request;

Described Kafka also is used for according to described log request, obtains corresponding daily record data from described Kafka message queue, and by described data transmission channel, the daily record data of described correspondence is sent to described the 2nd Flume in batches.