WO2017092444A1

WO2017092444A1 - Log data mining method and system based on hadoop

Info

Publication number: WO2017092444A1
Application number: PCT/CN2016/097363
Authority: WO
Inventors: 惠羿; 熊伟; 哈景楠
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-12-02
Filing date: 2016-08-30
Publication date: 2017-06-08
Also published as: CN106815274B; CN106815274A

Abstract

A log data mining method based on Hadoop, comprising: saving a first log data set over an acquired current time period into a Hadoop database (101); if the number of first log data sets saved in the Hadoop database satisfies a pre-set numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database using a pre-set parallel arithmetic model, so as to get a second log data set (102); and according to the dimension of the log data in the second log data set, performing dimension division on the log data in the second log data set, and saving a third log data set corresponding to obtained different dimensions into the Hadoop database(103). The method can rapidly and effectively realize mass data mining, and satisfy the storage and operation requirement for the mass data mining.

Description

Hadoop-based log data mining method and system

Technical field

The present invention relates to the field of computer data processing, and in particular, to a log data mining method and system based on Hadoop.

Background technique

Since entering the Internet era, how to quickly find more appropriate, quantifiable and predictable precision marketing strategies in the ever-increasing mass of user information has become the core demand of many enterprises including operators.

However, traditional databases have limited computing power and are expensive to store, which cannot meet the needs of massive data mining.

The above content is only used to assist in understanding the technical solutions of the present invention, and does not constitute an admission that the above is related art.

Summary of the invention

The main purpose of the embodiments of the present invention is to provide a log data mining method and system based on Hadoop, which aims to solve the technical problem that the traditional database has limited computing power, high storage cost, and cannot provide massive data mining.

To achieve the above objective, a Hadoop-based log data mining method provided by an embodiment of the present invention includes:

Saving the first log data set in the current time period obtained to the Hadoop database;

If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second Log data collection;

Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.

Optionally, the method further includes:

Obtaining log data in the current time period from the network side;

Aggregating processing the log data in the current time period to obtain a first log data set in the current time period.

Optionally, after the step of acquiring the log data in the current time period from the network side, the method further includes:

Performing data cleaning on the log data in the current time period to obtain log data after cleaning in the current time period;

And the step of performing the aggregation processing on the log data in the current time period to obtain the first log data set in the current time period includes:

Performing aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.

Optionally, the method further includes:

If the data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;

Data analysis is performed on the third log data set, and the result of the data analysis is displayed on the display interface.

Optionally, the performing data analysis on the third log data set includes:

Performing user grouping on users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;

And obtaining, according to the log data of the user in the user group list, a level configuration table corresponding to the at least two user dimensions, where the user dimension is preset, and the level configuration table includes the user in the user group list according to the user. The level at which the dimension is graded.

To achieve the above objective, an embodiment of the present invention further provides a log data mining system based on Hadoop, including:

The first saving module is configured to save the first log data set in the obtained current time period to the Hadoop database;

a parallel aggregation module, configured to: if the number of the first log data set saved by the Hadoop database satisfies a preset value, parallelize the first log data set in the Hadoop database by using a preset parallel computing model Aggregating processing to obtain a second log data set;

Dividing a save module, according to the dimension of the log data in the second log data set The log data in the second log data set is dimensioned, and the obtained third log data set corresponding to different dimensions is saved in the Hadoop database.

Optionally, the system further includes:

Obtaining a module, configured to obtain log data in a current time period from a network side;

The first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.

Optionally, the system further includes a cleaning module;

The cleaning module is configured to perform data cleaning on the log data in the current time period after the obtaining module obtains the log data in the current time period, and obtain the cleaned log data in the current time period;

The first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.

Optionally, the system further includes:

The reading module is configured to: if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction;

The analysis module is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.

Optionally, the analyzing module includes:

a clustering module, configured to group users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;

Obtaining a display module, configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user group list, where the user dimension is preset, and the level configuration table includes the user grouping list The user is ranked according to the user dimension.

In the embodiment of the present invention, a computer storage medium is further provided, and the computer storage medium may store an execution instruction for executing the Hadoop-based log data mining method in the foregoing embodiment.

The embodiment of the present invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the Hadoop database has saved the first log data set, the number of the first log data set is satisfied. The set value is used to perform parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel computing model to obtain a second log data set, according to the dimension of the log data in the second log data set. The log data in the second log data set is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.

DRAWINGS

1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention;

2 is a schematic flow chart of an additional step before step 101 of the first embodiment of FIG. 1;

3 is a schematic flow chart of an additional step after step 103 of the first embodiment in FIG. 1;

4 is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention;

Figure 5 is a schematic diagram of additional functional modules in the second embodiment of Figure 4;

Figure 6 is a schematic illustration of additional functional modules in the second embodiment of Figure 4.

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

detailed description

It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies the preset Pre-defined parallel computing model Performing parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and performing maintenance division on the log data in the second log data set according to the dimension of the log data in the second log data set The third log data set corresponding to the different dimensions is saved to the Hadoop database to complete the mining of the log data. Because the Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using the preset parallel computing model in Hadoop, which can realize massive data quickly and efficiently. Mining to meet the storage and computing needs of mining massive data.

1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention, including:

Step 101: Save the first log data set in the obtained current time period to the Hadoop database.

In the embodiment of the present invention, the Hadoop-based log data mining method can be applied to a Hadoop-based log data mining system (hereinafter referred to as: mining system), and the mining system saves the first log data set in the current time period acquired. To the Hadoop database.

The mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.

The time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.

Among them, Hadoop can implement Distributed File System (HDFS). The core of Hadoop framework is Hadoop database and parallel computing model. Among them, Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive. The data provides parallel operations.

Preferably, the parallel computing model is a mapreduce computing model.

Step 102: If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second log. Data collection

In the embodiment of the present invention, the mining system saves the acquired first log data set to the Hadoop database in each time period, and if the Hadoop database has saved the first log data. If the number of sets satisfies a preset value, the first log data set in the Hadoop database may be aggregated by using a preset parallel computing model in the Hadoop framework to obtain a second log data set.

In the actual application, the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.

It can be understood that, based on the above-mentioned aggregation processing, the mining system can also obtain a log data set in different time periods in a similar manner, for example, it can be obtained within one hour by using the first log data set of 15 time periods of 15 minutes. The log data set can be obtained by using the log data set within 24 hours to obtain the log data set within one day. The log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data at different times to meet different needs.

In the embodiment of the present invention, when the mining system performs parallel aggregation processing using the preset parallel computing model, the same log data count value is accumulated.

Step 103: Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.

In the embodiment of the present invention, after obtaining the second log data set, the mining system divides the log data in the second log data set according to the dimension of the log data in the second log data set, and the obtained The third log data set corresponding to different dimensions is saved in the Hadoop database to implement massive log data mining, and the saved third log data set can be used as the data source of the user data query, and supports icons, graphic queries and display interfaces of the display interface. Dimensional query enables data to be displayed from multiple angles to achieve data mining.

There are many dimensions of the log data, including but not limited to the content of the Internet, the location of the Internet, and the time of the Internet. The content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on. The Internet access location refers to the geographical location of the IP location used by the user. The Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user. It should be noted that for Different types of log data, the dimensions of the log data are also different. For example, when the data of the user in the log data is used for data mining by using the technical solution in the embodiment of the present invention, the dimension is in addition to the above-mentioned Internet content. In addition to the Internet access location and the Internet access time, the Internet access frequency, user age, monthly consumption, and the like may be included. Therefore, in practical applications, the dimension may be divided according to specific needs, which is not limited herein.

Preferably, in the embodiment of the present invention, after the third log data set corresponding to the different dimensions is saved in the Hadoop database, the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.

Preferably, since the mining system performs the parallel aggregation processing and the dimension division operation in the case where the number of the first log data sets saved in the Hadoop database satisfies the preset value, the third obtained is obtained. The log data set actually corresponds to a time period. When the mining system saves, it can save the correspondence between the dimension, the time segment and the third log data set.

In the embodiment of the present invention, the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set. The preset parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set. The log data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.

2 is a schematic flowchart of an additional step before step 101 in the first embodiment shown in FIG. 1 of the present invention, including:

Step 201: Obtain log data in a current time period from a network side.

In the embodiment of the present invention, the mining system obtains the log data in the current time period from the network side. Specifically, the mining system may obtain the current time period from the network side by using the log data to be extracted. The log data in the log data may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or the network may be accepted. Log data in the current time period provided by the third-party vendor on the side, or log data in the current time period in combination with at least two methods described above.

Step 202: Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.

In the embodiment of the present invention, after acquiring the log data in the current time period, the mining system performs aggregation processing on the log data in the current time period to obtain a first log data set of the current time period.

The aggregation in step 202 may be performed according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained after the aggregation is of an order of magnitude far. It is much lower than the order of the log data in the current time period obtained, and the meaning of the data is completely saved.

In the embodiment of the present invention, the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side. The order of log data reduces the amount of storage space required in the Hadoop database, saving storage space.

Preferably, in the embodiment of the present invention, the mining system may further perform the following steps before performing step 202:

Data cleaning of the log data in the current time period to obtain log data after cleaning in the current time period;

In the embodiment of the present invention, before the collection of the log data in the current time period, the mining system may perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.

And if the mining system performs the above steps, the adaptive adjustment of step 202 is also required, and the adaptive adjustment of step 202 is:

The log data after the cleaning in the current time period is aggregated to obtain the first log data set in the current time period.

The cleaning of the log data may be to remove some data types that do not meet the preset settings. Log data, and/or, identify and correct identifiable errors in the log data, and correct or delete identifiable log data.

In the embodiment of the present invention, the mining system performs data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.

Please refer to FIG. 3 , which is a schematic flowchart of an additional step after step 103 in the first embodiment shown in FIG. 1 , which includes:

Step 301: If a data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction.

In the embodiment of the present invention, after the mining system saves the obtained third log data to the Hadoop database, the user may request the query data by inputting the data query instruction, and if the mining system receives the data query instruction, according to the data. The query dimension contained in the query instruction reads the third log data set corresponding to the dimension from the Hadoop database.

Preferably, the data query instruction may further include a certain time period, and the mining system will read the third log data set corresponding to the query dimension in the time period.

Step 302: Perform data analysis on the third log data set, and display the result of the data analysis on the display interface.

In the embodiment of the present invention, the mining system further performs data analysis on the third log data set, and displays the result of the data analysis on the display interface. Specifically, the mining system performs the third log data set according to the preset clustering algorithm. The user is grouped by the user to obtain a user grouping list; the level configuration table corresponding to at least two user dimensions is obtained according to the log data of the user in the user grouping list, and the level configuration table is displayed on the display interface; the user dimension is preset The level configuration table includes the levels determined by the users in the user group list according to the user dimension.

Among them, user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions. For example, the user group obtained by the mining system includes: all user groups and microblog user groups. For all user groups, all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups. This is the horizontal dimension rating. For the users in the Weibo user combination, according to the amount of traffic generated after the user starts the Weibo Ranks are ranked, the top 20% are five-star users, the top 20% to 40% are four-star users, and so on, determine the star rating of each user in the Weibo user group. This is a vertical dimension rating. Through the horizontal dimension rating and the vertical dimension rating, it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.

Preferably, the preset clustering algorithm may be a K-means algorithm.

The query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database. For example, the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.

In the embodiment of the present invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.

It should be noted that, in the embodiment of the present invention, the log data mining method based on the Hadoop database can be applied to the accurate marketing system of the traffic data, for example, the technical solution described in the embodiments shown in FIG. 1 to FIG. To achieve the mining of target users and the mining of marketing sites, etc., to provide operators with a data base for targeted and refined marketing of target users or target base station cells.

If the target user needs to be determined, in the step 301 in the embodiment shown in FIG. 3, the query dimension may be the Internet content or the Internet traffic. If the target base station cell needs to be determined, the query dimension may be the Internet access location.

In practical applications, the user can select the query dimension according to specific needs, which is not limited here.

Please refer to FIG. 4 , which is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention, including:

The first saving module 401 is configured to save the first log data set in the acquired current time period to the Hadoop database;

Preferably, the parallel computing model is a mapreduce computing model.

The parallel aggregation module 402 is configured to: if the number of the first log data sets saved by the Hadoop database meets a preset value, perform the first log data set in the Hadoop database by using a preset parallel computing model. Parallel aggregation processing to obtain a second log data set;

It can be understood that, based on the foregoing aggregation processing, the parallel aggregation module 402 can also obtain a log data set in different time periods in a similar manner. For example, a first log data set of 15 time periods can be used to obtain one. The log data collection in the hour can be obtained by using the log data set within 24 hours to obtain the log data set within one day. The log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data can be obtained at different times to meet different needs.

The partitioning and saving module 403 divides the log data in the second log data set according to the dimension of the log data in the second log data set, and saves the obtained third log data set corresponding to different dimensions to the In the Hadoop database.

There are many dimensions of the log data, including but not limited to the content of the Internet, the location of the Internet, and the time of the Internet. The content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on. The Internet access location refers to the geographical location of the IP location used by the user. The Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user. It should be noted that the dimensions of the log data are different for different types of log data, for example, the number of logs in the log. According to the traffic data of the user in the data mining method according to the embodiment of the present invention, the dimension may include the Internet access frequency, the user age, the monthly consumption, and the like in addition to the above-mentioned Internet content, the Internet access location, and the Internet access time. Therefore, in practical applications, the dimension division may be performed according to specific needs, which is not limited herein.

In the embodiment of the present invention, the first saving module 401 saves the first log data set in the current time period acquired to the Hadoop database, and if the number of the first log data set saved in the Hadoop database meets the preset And the parallel aggregation module 402 performs parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, and finally the partition save module 403 is configured according to the second log. The dimension of the log data in the data set is dimension-divided to the log data in the second log data set, and the obtained third log data set corresponding to the different dimensions is saved in the Hadoop database.

In the embodiment of the present invention, the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set. The parallel computing model in the Hadoop database performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set. The log data in the maintenance is divided and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.

Referring to FIG. 5, it is a schematic diagram of a function module added in the second embodiment shown in FIG. 4, including:

The obtaining module 501 is configured to acquire log data in the current time period from the network side;

In the embodiment of the present invention, the obtaining module 501 is configured to obtain the log data in the current time period from the network side. Specifically, the obtaining module 501 can obtain the log data from the network side. The log data in the current time period may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or The log data in the current time period provided by the third-party vendor on the network side may be accepted, or the log data in the current time period may be acquired in combination with at least two methods described above.

The first aggregation module 502 is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.

The first aggregation module 502 may be classified according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained by the aggregation is of an order of magnitude. It is far below the order of the log data in the current time period obtained, and the meaning of the data is completely preserved.

In the embodiment of the present invention, the mining system does not start to execute the first saving module 401 in the embodiment shown in FIG. 4 after executing the first aggregation module 502.

In the embodiment of the present invention, the system further includes a cleaning module 503;

The cleaning module 503 is configured to perform data cleaning on the log data in the current time period after the obtaining module 501 obtains the log data in the current time period, and obtain the cleaned log data in the current time period;

And the first aggregation module 502 is configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.

In the embodiment of the present invention, the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side. The order of log data reduces the amount of storage space required in the Hadoop database, saving storage space. The mining system can also perform data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.

Referring to FIG. 6, a schematic diagram of a function module added to the second embodiment shown in FIG. 4 includes:

The reading module 601 is configured to: if the data query instruction is received, read the third corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction Log data collection;

The analyzing module 602 is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.

The analysis module 602 includes:

The clustering module 603 is configured to perform grouping of users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;

The obtaining display module 604 is configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user grouping list, and display the level configuration table on the display interface; the user dimension is preset. The level configuration table includes a level determined by the user in the user grouping list according to the user dimension.

Among them, user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions. For example, the user group obtained by the mining system includes: all user groups and microblog user groups. For all user groups, all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups. This is the horizontal dimension rating. For the users in the Weibo user portfolio, the rankings are ranked according to the traffic generated after the user starts the Weibo. The top 20% are five-star users, and the top 20% to 40% are four-star users. Such a push determines the star rating of each user in the Weibo user group. This is a vertical dimension rating. Through the horizontal dimension rating and the vertical dimension rating, it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.

Preferably, the preset clustering algorithm may be a K-means algorithm.

Embodiments of the present invention also provide a storage medium. Optionally, in the embodiment, the foregoing storage medium may be configured to store program code for performing the following steps:

S1: save the first log data set in the current time period acquired to the Hadoop database;

S2. If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelized and processed by the preset parallel computing model to obtain the second log data. set;

S3, dimensioning the log data in the second log data set according to the dimension of the log data in the second log data set, and saving the obtained third log data set corresponding to different dimensions to the Hadoop database.

Optionally, the storage medium is further arranged to store program code for performing the following steps:

S1: Obtain log data in a current time period from a network side;

S2: Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.

Optionally, in this embodiment, the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory. A variety of media that can store program code, such as a disc or a disc.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present invention in essence or the contribution to the related art can be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, CD-ROM). The instructions include a number of instructions for causing a terminal device (which may be a cell phone, computer, server, air conditioner, or network device, etc.) to perform the methods of various embodiments of the present invention.

The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention.

Industrial applicability

In the embodiment of the present invention, the first log data set in the current time period acquired is saved in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the preset is used. The parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the log in the second log data set according to the dimension of the log data in the second log data set The data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.

Claims

A Hadoop-based log data mining method, including:

Saving the first log data set in the current time period obtained to the Hadoop database;

If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second Log data collection;

Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.
The method of claim 1 wherein the method further comprises:

Obtaining log data in the current time period from the network side;

Aggregating processing the log data in the current time period to obtain a first log data set in the current time period.
The method of claim 2, wherein the step of obtaining log data in the current time period from the network side further comprises:

Performing data cleaning on the log data in the current time period to obtain log data after cleaning in the current time period;

And the step of performing the aggregation processing on the log data in the current time period to obtain the first log data set in the current time period includes:

Performing aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
The method of any one of claims 1 to 3, wherein the method further comprises:

If the data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;

Performing data analysis on the third log data set and displaying data analysis on the display interface the result of.
The method of claim 4, wherein the performing data analysis on the third log data set and displaying the result of the data analysis on the display interface comprises:

Performing user grouping on users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;

Obtaining a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user group list, and displaying the level configuration table on the display interface; the user dimension is preset, and the level configuration table includes The user in the user grouping list is hierarchically determined according to the user dimension.
A log data mining system based on Hadoop, including:

The first saving module is configured to save the first log data set in the obtained current time period to the Hadoop database;

a parallel aggregation module, configured to: if the number of the first log data set saved by the Hadoop database satisfies a preset value, parallelize the first log data set in the Hadoop database by using a preset parallel computing model Aggregating processing to obtain a second log data set;

The partitioning save module is configured to perform dimension partitioning on the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the In the Hadoop database.
The system of claim 6 wherein said system further comprises:

Obtaining a module, configured to obtain log data in a current time period from a network side;

The first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
The system of claim 7 wherein said system further comprises a cleaning module;

The cleaning module is configured to perform data cleaning on the log data in the current time period after the obtaining module obtains the log data in the current time period, and obtain the cleaned log data in the current time period;

The first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
The system of any of claims 6 to 8, wherein the system further comprises:

The reading module is configured to: if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction;

The analysis module is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
The system of claim 9 wherein said analyzing module comprises:

a clustering module, configured to group users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;

Obtaining a display module, configured to obtain a level configuration table corresponding to at least two user dimensions according to log data of the user in the user group list, and display the level configuration table on the display interface; the user dimension is preset The level configuration table includes a level determined by the users in the user group list according to the user dimension.