CN106021580B - Method and system for analyzing cluster logs of Impala based on Hadoop - Google Patents

Method and system for analyzing cluster logs of Impala based on Hadoop Download PDF

Info

Publication number
CN106021580B
CN106021580B CN201610385810.2A CN201610385810A CN106021580B CN 106021580 B CN106021580 B CN 106021580B CN 201610385810 A CN201610385810 A CN 201610385810A CN 106021580 B CN106021580 B CN 106021580B
Authority
CN
China
Prior art keywords
log
impala
query
setting
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610385810.2A
Other languages
Chinese (zh)
Other versions
CN106021580A (en
Inventor
肖松林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yonyou Network Technology Co Ltd
Original Assignee
Yonyou Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yonyou Network Technology Co Ltd filed Critical Yonyou Network Technology Co Ltd
Priority to CN201610385810.2A priority Critical patent/CN106021580B/en
Publication of CN106021580A publication Critical patent/CN106021580A/en
Application granted granted Critical
Publication of CN106021580B publication Critical patent/CN106021580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Abstract

The invention discloses an Impala cluster log analysis method and system based on Hadoop, wherein the Impala cluster log analysis method based on Hadoop comprises the steps of setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system below the directory; setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive; after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes; and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated. The advantage of improving data processing efficiency is achieved.

Description

Method and system for analyzing cluster logs of Impala based on Hadoop
Technical Field
The invention relates to the field of internet, in particular to an Impala cluster log analysis method and system based on Hadoop.
Background
The popularity of the internet has made the web the largest information system today in today's highly information-oriented society. The Web log contains a large amount of information accessed by users, the Web log contains the most important information of the website, and the log analysis can know the access amount of the website, the number of people accessing the webpage is the largest, the webpage is the most valuable, and the like. Generally, a medium-sized website (above PV of 10W) generates more than 1G of Web log files every day. Large or very large web sites may generate 10G of data per hour.
However, the Map/Reduce program model of Hadoop is at a relatively low level, developers need to develop client programs, which tend to be difficult to maintain and reuse, and running the Map/Reduce program is inefficient.
Disclosure of Invention
The invention aims to provide an Impala cluster log analysis method and system based on Hadoop to achieve the advantage of improving data processing efficiency.
In order to achieve the purpose, the invention adopts the technical scheme that:
an Impala cluster log analysis method based on Hadoop comprises the following steps,
setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory;
setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive;
after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;
and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.
Preferably, in the step of setting the web server to generate a new directory every day and generating a plurality of log files generated by the Application service system under the directory, each log file has a size of 64M.
Preferably, the system timer CRON is set to be 0 night and later, and the log file generated in the previous day is imported into the HDFS in the Hadoop at regular time.
Meanwhile, the technical scheme of the invention also discloses an Impala cluster log analysis system based on Hadoop, which comprises a log acquisition module, a storage processing module, an inquiry analysis module and a result display module;
the log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script;
the storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table;
the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;
the result display module: and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala for the user to check.
Preferably, the transmission mode of the logs in the log acquisition module adopts a timing transmission mode of rsync.
Preferably, the result presentation module presents the query result in a form including a chart or a table.
Preferably, in the storage processing module, the HDFS of Hadoop is used to store data.
The technical scheme of the invention has the following beneficial effects:
according to the technical scheme, the mass log data stored on Hdfs are analyzed in real time on line, the PV value (pageView, page access amount) and the independent IP number of the website are obtained, the keyword ranking list searched by the user, the page with the highest user retention time and the like can be calculated, and the corresponding analysis result is obtained through query in an Impala SQL (structured query language) statement mode. The purpose of improving the data processing efficiency is achieved.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a schematic block diagram of an Impala cluster log analysis method based on Hadoop according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of an Impala cluster log analysis system based on Hadoop according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
An Impala cluster log analysis method based on Hadoop comprises the following steps,
setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory;
setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive;
after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;
and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.
As shown in FIG. 1, the left side is the Application service system, and the right side is HDFS, YARN, Hive and Impala of Hadoop.
1. The log is generated by the business system, and the web server can be set to generate a new directory every day, and a plurality of log files are generated under the directory, wherein each log file is 64M.
2. A system timer CRON is set, at 0 point in the night, a yesterday log file is imported into the HDFS, and data are loaded into hive.
3. And after the loading is finished, setting a system timer, updating hive metadata, starting an Impala query program, and extracting and calculating statistical indexes.
4. After the calculation is finished, a system timer is set, statistical index data are exported from the HDFS to the database, and the query is facilitated in the future.
As shown in fig. 2, an Impala cluster log analysis system based on Hadoop includes a log collection module, a storage processing module, an inquiry analysis module and a result display module;
a log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script;
a storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table;
the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;
and a result display module: and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala for the user to check.
Specifically, the method comprises the following steps: a log collection module:
and the system is responsible for transmitting the logs in each front-end web server to a log receiving node. The log transmission mode adopts a timing transmission mode of rsync, and the logs in the servers are transmitted to the receiving node regularly every day. And the receiving node imports the logs transmitted by each server into Hive through a background script.
A storage processing module:
the HDFS of Hadoop is used for storing actual data, specifically executing a map-reduce task submitted by Hive, and loading and mapping a log file on HDFS into a Hive data table.
The query analysis module:
and the query analysis module and the storage processing module are deployed and completed in a cluster system. In an actual architecture, Hive is deployed on a NameNode, that is, a master node in a Hadoop cluster, and functionally divides the NameNode, that is, the master node into two modules for description.
The query module mainly accomplishes two functions: firstly, structuring the log data in a log acquisition module into a database, and mapping the log data of each website into a Hive database table; secondly, receiving an Impala query request sent by a user to provide a large-scale query analysis function, and returning a query result to the result output module.
The query processing process of Impala: receiving Impalad connected to the client, namely, serving as the Coordinator of the query, the Coordinator analyzes the query SQL of the user to generate an execution plan tree, and different operations correspond to different plannodes, for example: the method comprises the steps of selecting a node, ScanNode, SortNode and the like, wherein each atomic operation of an execution Plan tree is represented by a Plan Fragment, usually, one query statement is composed of a plurality of Plan fragments, Plan Fragment 0 represents the root of the execution tree, an aggregation result is returned to a user, and leaf nodes of the execution tree are generally scan operations and are executed in a distributed mode.
And a result display module:
and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala in a certain form for the user to view. The user inputs or selects the content to be queried on the browser client provided by the user, the background transmits the query requests to the query analysis module by using an interface provided by the Impala, and the tasks are finally processed by the Impala query processing module and then returned to the result output module by the query analysis module. The returned results are represented in various forms such as charts, tables and the like, and the user can view the results in various forms.
In a Web log, each log usually represents an access behavior of a user, for example, the following is a nginx log:
22.68.172.10 - - [18/Sep/2014:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
the disassembly is as follows:
recording the ip address of the client, 22.68.172.10;
record client user name, -;
record access time and time zone, [18/Sep/2014:06:49:57 +0000 ];
recording url and HTTP protocol of the request, "GET/images/my. jpg HTTP/1.1";
recording the request state, wherein the success is 200, 200;
recording the size of the file body content sent to the client, 19939;
used to record http:// www.angularjs.cn/A00n accessed from that page link;
the relevant information of the client browser, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" is recorded.
In summary, the technical scheme of the invention also has the following characteristics:
1. yesterday's log file is imported from the web server to the HDFS at a daily timing.
2. And loading the log file on the Hdfs into Hive.
3. And dispatching an Impala query statement by using Oozie or setting a system timer, and extracting and calculating a statistical index.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (2)

1. An Impala cluster log analysis method based on Hadoop is characterized by comprising the following steps,
setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory; setting a web server to generate a new directory every day, wherein in the step of generating a plurality of log files generated by an Application service system under the directory, the size of each log file is 64M;
setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive; setting a system timer CRON, leading the system timer CRON into a log file generated in the previous day to an HDFS in Hadoop at regular time, and setting the system timer CRON to be 0 point at night;
after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;
and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.
2. An Impala cluster log analysis system based on Hadoop is characterized by comprising a log acquisition module, a storage processing module, an inquiry analysis module and a result display module;
the log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script; the transmission mode of the logs in the log acquisition module adopts a timing transmission mode of rsync;
the storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table; in the storage processing module, the HDFS of Hadoop is used for storing data;
the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;
the result display module: the result display module is responsible for submitting a query request of a user to the Impala and displaying a query result returned by the Impala for the user to view, and the form of the query result displayed by the result display module comprises a chart or a table.
CN201610385810.2A 2016-06-03 2016-06-03 Method and system for analyzing cluster logs of Impala based on Hadoop Active CN106021580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610385810.2A CN106021580B (en) 2016-06-03 2016-06-03 Method and system for analyzing cluster logs of Impala based on Hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610385810.2A CN106021580B (en) 2016-06-03 2016-06-03 Method and system for analyzing cluster logs of Impala based on Hadoop

Publications (2)

Publication Number Publication Date
CN106021580A CN106021580A (en) 2016-10-12
CN106021580B true CN106021580B (en) 2019-12-20

Family

ID=57090484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610385810.2A Active CN106021580B (en) 2016-06-03 2016-06-03 Method and system for analyzing cluster logs of Impala based on Hadoop

Country Status (1)

Country Link
CN (1) CN106021580B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547883B (en) * 2016-11-03 2021-02-19 北京集奥聚合科技有限公司 Method and system for processing User Defined Function (UDF) running condition
CN108536810A (en) * 2018-03-30 2018-09-14 四川斐讯信息技术有限公司 Data visualization methods of exhibiting and system
CN110362456A (en) * 2018-04-10 2019-10-22 挖财网络技术有限公司 A kind of method and device obtaining server-side performance data
CN111352963A (en) * 2018-12-24 2020-06-30 北京奇虎科技有限公司 Data statistical method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955502A (en) * 2014-04-24 2014-07-30 科技谷(厦门)信息技术有限公司 Visualized on-line analytical processing (OLAP) application realizing method and system
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955502A (en) * 2014-04-24 2014-07-30 科技谷(厦门)信息技术有限公司 Visualized on-line analytical processing (OLAP) application realizing method and system
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于HDFS和IMPALA的碰撞比对分析;王艳等;《视频应用于工程》;20151231;第94-98页 *

Also Published As

Publication number Publication date
CN106021580A (en) 2016-10-12

Similar Documents

Publication Publication Date Title
CN109582660B (en) Data blood margin analysis method, device, equipment, system and readable storage medium
US8856183B2 (en) Database access using partitioned data areas
CN106021580B (en) Method and system for analyzing cluster logs of Impala based on Hadoop
CN102880685B (en) Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size
US8713424B1 (en) Asynchronous loading of scripts in web pages
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN106682147A (en) Mass data based query method and device
AU2014299245B1 (en) Improvements in website traffic optimization
US20130232157A1 (en) Systems and methods for processing unstructured numerical data
CN110704411A (en) Knowledge graph building method and device suitable for art field and electronic equipment
JP2016519810A (en) Scalable analysis platform for semi-structured data
WO2017170459A1 (en) Method, program, and system for automatic discovery of relationship between fields in environment where different types of data sources coexist
CN111046041B (en) Data processing method and device, storage medium and processor
CN111522905A (en) Document searching method and device based on database
US20200226130A1 (en) Vertical union of feature-based datasets
JPWO2017170459A6 (en) Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment
US9390131B1 (en) Executing queries subject to different consistency requirements
CN106959995A (en) Compatible two-way automatic web page contents acquisition method
CN113779349A (en) Data retrieval system, apparatus, electronic device, and readable storage medium
CN113515564A (en) Data access method, device, equipment and storage medium based on J2EE
CN110955855A (en) Information interception method, device and terminal
CN114443599A (en) Data synchronization method and device, electronic equipment and storage medium
CN104216901B (en) The method and system of information search
CN117076742A (en) Data blood edge tracking method and device and electronic equipment
US20210089527A1 (en) Incremental addition of data to partitions in database tables

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant