CN106021580B

CN106021580B - Method and system for analyzing cluster logs of Impala based on Hadoop

Info

Publication number: CN106021580B
Application number: CN201610385810.2A
Authority: CN
Inventors: 肖松林
Original assignee: Yonyou Network Technology Co Ltd
Current assignee: Yonyou Network Technology Co Ltd
Priority date: 2016-06-03
Filing date: 2016-06-03
Publication date: 2019-12-20
Anticipated expiration: 2036-06-03
Also published as: CN106021580A

Abstract

The invention discloses an Impala cluster log analysis method and system based on Hadoop, wherein the Impala cluster log analysis method based on Hadoop comprises the steps of setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system below the directory; setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive; after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes; and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated. The advantage of improving data processing efficiency is achieved.

Description

Method and system for analyzing cluster logs of Impala based on Hadoop

Technical Field

The invention relates to the field of internet, in particular to an Impala cluster log analysis method and system based on Hadoop.

Background

The popularity of the internet has made the web the largest information system today in today's highly information-oriented society. The Web log contains a large amount of information accessed by users, the Web log contains the most important information of the website, and the log analysis can know the access amount of the website, the number of people accessing the webpage is the largest, the webpage is the most valuable, and the like. Generally, a medium-sized website (above PV of 10W) generates more than 1G of Web log files every day. Large or very large web sites may generate 10G of data per hour.

However, the Map/Reduce program model of Hadoop is at a relatively low level, developers need to develop client programs, which tend to be difficult to maintain and reuse, and running the Map/Reduce program is inefficient.

Disclosure of Invention

The invention aims to provide an Impala cluster log analysis method and system based on Hadoop to achieve the advantage of improving data processing efficiency.

In order to achieve the purpose, the invention adopts the technical scheme that:

an Impala cluster log analysis method based on Hadoop comprises the following steps,

setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory;

setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive;

after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;

and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.

Preferably, in the step of setting the web server to generate a new directory every day and generating a plurality of log files generated by the Application service system under the directory, each log file has a size of 64M.

Preferably, the system timer CRON is set to be 0 night and later, and the log file generated in the previous day is imported into the HDFS in the Hadoop at regular time.

Meanwhile, the technical scheme of the invention also discloses an Impala cluster log analysis system based on Hadoop, which comprises a log acquisition module, a storage processing module, an inquiry analysis module and a result display module;

the log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script;

the storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table;

the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;

the result display module: and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala for the user to check.

Preferably, the transmission mode of the logs in the log acquisition module adopts a timing transmission mode of rsync.

Preferably, the result presentation module presents the query result in a form including a chart or a table.

Preferably, in the storage processing module, the HDFS of Hadoop is used to store data.

The technical scheme of the invention has the following beneficial effects:

according to the technical scheme, the mass log data stored on Hdfs are analyzed in real time on line, the PV value (pageView, page access amount) and the independent IP number of the website are obtained, the keyword ranking list searched by the user, the page with the highest user retention time and the like can be calculated, and the corresponding analysis result is obtained through query in an Impala SQL (structured query language) statement mode. The purpose of improving the data processing efficiency is achieved.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a schematic block diagram of an Impala cluster log analysis method based on Hadoop according to an embodiment of the present invention;

fig. 2 is a schematic block diagram of an Impala cluster log analysis system based on Hadoop according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

As shown in FIG. 1, the left side is the Application service system, and the right side is HDFS, YARN, Hive and Impala of Hadoop.

1. The log is generated by the business system, and the web server can be set to generate a new directory every day, and a plurality of log files are generated under the directory, wherein each log file is 64M.

2. A system timer CRON is set, at 0 point in the night, a yesterday log file is imported into the HDFS, and data are loaded into hive.

3. And after the loading is finished, setting a system timer, updating hive metadata, starting an Impala query program, and extracting and calculating statistical indexes.

4. After the calculation is finished, a system timer is set, statistical index data are exported from the HDFS to the database, and the query is facilitated in the future.

As shown in fig. 2, an Impala cluster log analysis system based on Hadoop includes a log collection module, a storage processing module, an inquiry analysis module and a result display module;

a log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script;

a storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table;

and a result display module: and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala for the user to check.

Specifically, the method comprises the following steps: a log collection module:

and the system is responsible for transmitting the logs in each front-end web server to a log receiving node. The log transmission mode adopts a timing transmission mode of rsync, and the logs in the servers are transmitted to the receiving node regularly every day. And the receiving node imports the logs transmitted by each server into Hive through a background script.

A storage processing module:

the HDFS of Hadoop is used for storing actual data, specifically executing a map-reduce task submitted by Hive, and loading and mapping a log file on HDFS into a Hive data table.

The query analysis module:

and the query analysis module and the storage processing module are deployed and completed in a cluster system. In an actual architecture, Hive is deployed on a NameNode, that is, a master node in a Hadoop cluster, and functionally divides the NameNode, that is, the master node into two modules for description.

The query module mainly accomplishes two functions: firstly, structuring the log data in a log acquisition module into a database, and mapping the log data of each website into a Hive database table; secondly, receiving an Impala query request sent by a user to provide a large-scale query analysis function, and returning a query result to the result output module.

The query processing process of Impala: receiving Impalad connected to the client, namely, serving as the Coordinator of the query, the Coordinator analyzes the query SQL of the user to generate an execution plan tree, and different operations correspond to different plannodes, for example: the method comprises the steps of selecting a node, ScanNode, SortNode and the like, wherein each atomic operation of an execution Plan tree is represented by a Plan Fragment, usually, one query statement is composed of a plurality of Plan fragments, Plan Fragment 0 represents the root of the execution tree, an aggregation result is returned to a user, and leaf nodes of the execution tree are generally scan operations and are executed in a distributed mode.

And a result display module:

and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala in a certain form for the user to view. The user inputs or selects the content to be queried on the browser client provided by the user, the background transmits the query requests to the query analysis module by using an interface provided by the Impala, and the tasks are finally processed by the Impala query processing module and then returned to the result output module by the query analysis module. The returned results are represented in various forms such as charts, tables and the like, and the user can view the results in various forms.

In a Web log, each log usually represents an access behavior of a user, for example, the following is a nginx log:

22.68.172.10 - - [18/Sep/2014:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

the disassembly is as follows:

recording the ip address of the client, 22.68.172.10;

record client user name, -;

record access time and time zone, [18/Sep/2014:06:49:57 +0000 ];

recording url and HTTP protocol of the request, "GET/images/my. jpg HTTP/1.1";

recording the request state, wherein the success is 200, 200;

recording the size of the file body content sent to the client, 19939;

used to record http:// www.angularjs.cn/A00n accessed from that page link;

the relevant information of the client browser, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" is recorded.

In summary, the technical scheme of the invention also has the following characteristics:

1. yesterday's log file is imported from the web server to the HDFS at a daily timing.

2. And loading the log file on the Hdfs into Hive.

3. And dispatching an Impala query statement by using Oozie or setting a system timer, and extracting and calculating a statistical index.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An Impala cluster log analysis method based on Hadoop is characterized by comprising the following steps,

setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory; setting a web server to generate a new directory every day, wherein in the step of generating a plurality of log files generated by an Application service system under the directory, the size of each log file is 64M;

setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive; setting a system timer CRON, leading the system timer CRON into a log file generated in the previous day to an HDFS in Hadoop at regular time, and setting the system timer CRON to be 0 point at night;

2. An Impala cluster log analysis system based on Hadoop is characterized by comprising a log acquisition module, a storage processing module, an inquiry analysis module and a result display module;

the log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script; the transmission mode of the logs in the log acquisition module adopts a timing transmission mode of rsync;

the storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table; in the storage processing module, the HDFS of Hadoop is used for storing data;

the result display module: the result display module is responsible for submitting a query request of a user to the Impala and displaying a query result returned by the Impala for the user to view, and the form of the query result displayed by the result display module comprises a chart or a table.