CN106021580B - Method and system for analyzing cluster logs of Impala based on Hadoop - Google Patents
Method and system for analyzing cluster logs of Impala based on Hadoop Download PDFInfo
- Publication number
- CN106021580B CN106021580B CN201610385810.2A CN201610385810A CN106021580B CN 106021580 B CN106021580 B CN 106021580B CN 201610385810 A CN201610385810 A CN 201610385810A CN 106021580 B CN106021580 B CN 106021580B
- Authority
- CN
- China
- Prior art keywords
- log
- impala
- query
- setting
- hadoop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
Abstract
The invention discloses an Impala cluster log analysis method and system based on Hadoop, wherein the Impala cluster log analysis method based on Hadoop comprises the steps of setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system below the directory; setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive; after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes; and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated. The advantage of improving data processing efficiency is achieved.
Description
Technical Field
The invention relates to the field of internet, in particular to an Impala cluster log analysis method and system based on Hadoop.
Background
The popularity of the internet has made the web the largest information system today in today's highly information-oriented society. The Web log contains a large amount of information accessed by users, the Web log contains the most important information of the website, and the log analysis can know the access amount of the website, the number of people accessing the webpage is the largest, the webpage is the most valuable, and the like. Generally, a medium-sized website (above PV of 10W) generates more than 1G of Web log files every day. Large or very large web sites may generate 10G of data per hour.
However, the Map/Reduce program model of Hadoop is at a relatively low level, developers need to develop client programs, which tend to be difficult to maintain and reuse, and running the Map/Reduce program is inefficient.
Disclosure of Invention
The invention aims to provide an Impala cluster log analysis method and system based on Hadoop to achieve the advantage of improving data processing efficiency.
In order to achieve the purpose, the invention adopts the technical scheme that:
an Impala cluster log analysis method based on Hadoop comprises the following steps,
setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory;
setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive;
after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;
and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.
Preferably, in the step of setting the web server to generate a new directory every day and generating a plurality of log files generated by the Application service system under the directory, each log file has a size of 64M.
Preferably, the system timer CRON is set to be 0 night and later, and the log file generated in the previous day is imported into the HDFS in the Hadoop at regular time.
Meanwhile, the technical scheme of the invention also discloses an Impala cluster log analysis system based on Hadoop, which comprises a log acquisition module, a storage processing module, an inquiry analysis module and a result display module;
the log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script;
the storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table;
the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;
the result display module: and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala for the user to check.
Preferably, the transmission mode of the logs in the log acquisition module adopts a timing transmission mode of rsync.
Preferably, the result presentation module presents the query result in a form including a chart or a table.
Preferably, in the storage processing module, the HDFS of Hadoop is used to store data.
The technical scheme of the invention has the following beneficial effects:
according to the technical scheme, the mass log data stored on Hdfs are analyzed in real time on line, the PV value (pageView, page access amount) and the independent IP number of the website are obtained, the keyword ranking list searched by the user, the page with the highest user retention time and the like can be calculated, and the corresponding analysis result is obtained through query in an Impala SQL (structured query language) statement mode. The purpose of improving the data processing efficiency is achieved.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a schematic block diagram of an Impala cluster log analysis method based on Hadoop according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of an Impala cluster log analysis system based on Hadoop according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
An Impala cluster log analysis method based on Hadoop comprises the following steps,
setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory;
setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive;
after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;
and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.
As shown in FIG. 1, the left side is the Application service system, and the right side is HDFS, YARN, Hive and Impala of Hadoop.
1. The log is generated by the business system, and the web server can be set to generate a new directory every day, and a plurality of log files are generated under the directory, wherein each log file is 64M.
2. A system timer CRON is set, at 0 point in the night, a yesterday log file is imported into the HDFS, and data are loaded into hive.
3. And after the loading is finished, setting a system timer, updating hive metadata, starting an Impala query program, and extracting and calculating statistical indexes.
4. After the calculation is finished, a system timer is set, statistical index data are exported from the HDFS to the database, and the query is facilitated in the future.
As shown in fig. 2, an Impala cluster log analysis system based on Hadoop includes a log collection module, a storage processing module, an inquiry analysis module and a result display module;
a log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script;
a storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table;
the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;
and a result display module: and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala for the user to check.
Specifically, the method comprises the following steps: a log collection module:
and the system is responsible for transmitting the logs in each front-end web server to a log receiving node. The log transmission mode adopts a timing transmission mode of rsync, and the logs in the servers are transmitted to the receiving node regularly every day. And the receiving node imports the logs transmitted by each server into Hive through a background script.
A storage processing module:
the HDFS of Hadoop is used for storing actual data, specifically executing a map-reduce task submitted by Hive, and loading and mapping a log file on HDFS into a Hive data table.
The query analysis module:
and the query analysis module and the storage processing module are deployed and completed in a cluster system. In an actual architecture, Hive is deployed on a NameNode, that is, a master node in a Hadoop cluster, and functionally divides the NameNode, that is, the master node into two modules for description.
The query module mainly accomplishes two functions: firstly, structuring the log data in a log acquisition module into a database, and mapping the log data of each website into a Hive database table; secondly, receiving an Impala query request sent by a user to provide a large-scale query analysis function, and returning a query result to the result output module.
The query processing process of Impala: receiving Impalad connected to the client, namely, serving as the Coordinator of the query, the Coordinator analyzes the query SQL of the user to generate an execution plan tree, and different operations correspond to different plannodes, for example: the method comprises the steps of selecting a node, ScanNode, SortNode and the like, wherein each atomic operation of an execution Plan tree is represented by a Plan Fragment, usually, one query statement is composed of a plurality of Plan fragments, Plan Fragment 0 represents the root of the execution tree, an aggregation result is returned to a user, and leaf nodes of the execution tree are generally scan operations and are executed in a distributed mode.
And a result display module:
and the system is responsible for submitting the query request of the user to the Impala and expressing the query result returned by the Impala in a certain form for the user to view. The user inputs or selects the content to be queried on the browser client provided by the user, the background transmits the query requests to the query analysis module by using an interface provided by the Impala, and the tasks are finally processed by the Impala query processing module and then returned to the result output module by the query analysis module. The returned results are represented in various forms such as charts, tables and the like, and the user can view the results in various forms.
In a Web log, each log usually represents an access behavior of a user, for example, the following is a nginx log:
22.68.172.10 - - [18/Sep/2014:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
the disassembly is as follows:
recording the ip address of the client, 22.68.172.10;
record client user name, -;
record access time and time zone, [18/Sep/2014:06:49:57 +0000 ];
recording url and HTTP protocol of the request, "GET/images/my. jpg HTTP/1.1";
recording the request state, wherein the success is 200, 200;
recording the size of the file body content sent to the client, 19939;
used to record http:// www.angularjs.cn/A00n accessed from that page link;
the relevant information of the client browser, "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" is recorded.
In summary, the technical scheme of the invention also has the following characteristics:
1. yesterday's log file is imported from the web server to the HDFS at a daily timing.
2. And loading the log file on the Hdfs into Hive.
3. And dispatching an Impala query statement by using Oozie or setting a system timer, and extracting and calculating a statistical index.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (2)
1. An Impala cluster log analysis method based on Hadoop is characterized by comprising the following steps,
setting a web server to generate a new directory every day, and generating a plurality of log files generated by an Application service system under the directory; setting a web server to generate a new directory every day, wherein in the step of generating a plurality of log files generated by an Application service system under the directory, the size of each log file is 64M;
setting a system timer CRON, leading in a log file generated in the previous day to an HDFS in Hadoop at regular time, and loading log file data to hive; setting a system timer CRON, leading the system timer CRON into a log file generated in the previous day to an HDFS in Hadoop at regular time, and setting the system timer CRON to be 0 point at night;
after the live data is loaded, setting a system timer CRON again, updating live metadata at regular time, starting an Impala query program, extracting the live metadata and calculating statistical indexes;
and after the calculation and statistics are completed, setting a system timer CRON again, and exporting statistical index data from the HDFS to the database at regular time, so that the steps of later query are facilitated.
2. An Impala cluster log analysis system based on Hadoop is characterized by comprising a log acquisition module, a storage processing module, an inquiry analysis module and a result display module;
the log collection module: the logs in the front-end web servers are transmitted to a log receiving node, and the receiving node imports the logs transmitted by the web servers into Hive through a background script; the transmission mode of the logs in the log acquisition module adopts a timing transmission mode of rsync;
the storage processing module: storing log file data, loading the log file and mapping the log file into a hive data table; in the storage processing module, the HDFS of Hadoop is used for storing data;
the query analysis module: receiving an Impala query request sent by a user, thereby providing a query analysis function and returning a query result to the result display module;
the result display module: the result display module is responsible for submitting a query request of a user to the Impala and displaying a query result returned by the Impala for the user to view, and the form of the query result displayed by the result display module comprises a chart or a table.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610385810.2A CN106021580B (en) | 2016-06-03 | 2016-06-03 | Method and system for analyzing cluster logs of Impala based on Hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610385810.2A CN106021580B (en) | 2016-06-03 | 2016-06-03 | Method and system for analyzing cluster logs of Impala based on Hadoop |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021580A CN106021580A (en) | 2016-10-12 |
CN106021580B true CN106021580B (en) | 2019-12-20 |
Family
ID=57090484
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610385810.2A Active CN106021580B (en) | 2016-06-03 | 2016-06-03 | Method and system for analyzing cluster logs of Impala based on Hadoop |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021580B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106547883B (en) * | 2016-11-03 | 2021-02-19 | 北京集奥聚合科技有限公司 | Method and system for processing User Defined Function (UDF) running condition |
CN108536810A (en) * | 2018-03-30 | 2018-09-14 | 四川斐讯信息技术有限公司 | Data visualization methods of exhibiting and system |
CN110362456A (en) * | 2018-04-10 | 2019-10-22 | 挖财网络技术有限公司 | A kind of method and device obtaining server-side performance data |
CN111352963A (en) * | 2018-12-24 | 2020-06-30 | 北京奇虎科技有限公司 | Data statistical method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955502A (en) * | 2014-04-24 | 2014-07-30 | 科技谷(厦门)信息技术有限公司 | Visualized on-line analytical processing (OLAP) application realizing method and system |
CN105528367A (en) * | 2014-09-30 | 2016-04-27 | 华东师范大学 | A method for storage and near-real time query of time-sensitive data based on open source big data |
-
2016
- 2016-06-03 CN CN201610385810.2A patent/CN106021580B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955502A (en) * | 2014-04-24 | 2014-07-30 | 科技谷(厦门)信息技术有限公司 | Visualized on-line analytical processing (OLAP) application realizing method and system |
CN105528367A (en) * | 2014-09-30 | 2016-04-27 | 华东师范大学 | A method for storage and near-real time query of time-sensitive data based on open source big data |
Non-Patent Citations (1)
Title |
---|
基于HDFS和IMPALA的碰撞比对分析;王艳等;《视频应用于工程》;20151231;第94-98页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106021580A (en) | 2016-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109582660B (en) | Data blood margin analysis method, device, equipment, system and readable storage medium | |
US8856183B2 (en) | Database access using partitioned data areas | |
CN106021580B (en) | Method and system for analyzing cluster logs of Impala based on Hadoop | |
CN102880685B (en) | Method for interval and paging query of time-intensive B/S (Browser/Server) with large data size | |
US8713424B1 (en) | Asynchronous loading of scripts in web pages | |
CN106126648B (en) | It is a kind of based on the distributed merchandise news crawler method redo log | |
CN106682147A (en) | Mass data based query method and device | |
AU2014299245B1 (en) | Improvements in website traffic optimization | |
US20130232157A1 (en) | Systems and methods for processing unstructured numerical data | |
CN110704411A (en) | Knowledge graph building method and device suitable for art field and electronic equipment | |
JP2016519810A (en) | Scalable analysis platform for semi-structured data | |
WO2017170459A1 (en) | Method, program, and system for automatic discovery of relationship between fields in environment where different types of data sources coexist | |
CN111046041B (en) | Data processing method and device, storage medium and processor | |
CN111522905A (en) | Document searching method and device based on database | |
US20200226130A1 (en) | Vertical union of feature-based datasets | |
JPWO2017170459A6 (en) | Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment | |
US9390131B1 (en) | Executing queries subject to different consistency requirements | |
CN106959995A (en) | Compatible two-way automatic web page contents acquisition method | |
CN113779349A (en) | Data retrieval system, apparatus, electronic device, and readable storage medium | |
CN113515564A (en) | Data access method, device, equipment and storage medium based on J2EE | |
CN110955855A (en) | Information interception method, device and terminal | |
CN114443599A (en) | Data synchronization method and device, electronic equipment and storage medium | |
CN104216901B (en) | The method and system of information search | |
CN117076742A (en) | Data blood edge tracking method and device and electronic equipment | |
US20210089527A1 (en) | Incremental addition of data to partitions in database tables |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |