CN106815274B - Hadoop-based log data mining method and system - Google Patents

Hadoop-based log data mining method and system Download PDF

Info

Publication number
CN106815274B
CN106815274B CN201510875453.3A CN201510875453A CN106815274B CN 106815274 B CN106815274 B CN 106815274B CN 201510875453 A CN201510875453 A CN 201510875453A CN 106815274 B CN106815274 B CN 106815274B
Authority
CN
China
Prior art keywords
log data
time period
current time
data set
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510875453.3A
Other languages
Chinese (zh)
Other versions
CN106815274A (en
Inventor
惠羿
熊伟
哈景楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201510875453.3A priority Critical patent/CN106815274B/en
Priority to PCT/CN2016/097363 priority patent/WO2017092444A1/en
Publication of CN106815274A publication Critical patent/CN106815274A/en
Application granted granted Critical
Publication of CN106815274B publication Critical patent/CN106815274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a log data mining method based on Hadoop, which comprises the steps of storing a first log data set in the current time period into a Hadoop database; if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database by using a preset parallel operation model to obtain a second log data set; and performing dimensionality division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensionalities into a Hadoop database. The invention also discloses a log data mining system based on Hadoop. The invention can quickly and effectively realize the excavation of mass data and meet the storage and operation requirements for excavating the mass data.

Description

Hadoop-based log data mining method and system
Technical Field
The invention relates to the field of computer data processing, in particular to a log data mining method and system based on Hadoop.
Background
Since the internet era, how to quickly find a more appropriate, quantifiable, and predictable accurate marketing strategy in an ever-increasing mass of user information becomes a core demand of numerous enterprises including operators.
However, the traditional database has limited data operation capability and expensive storage cost, and cannot meet the requirement of mining mass data.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a log data mining method and system based on Hadoop, and aims to solve the technical problems that a traditional database is limited in data operation capacity, expensive in storage cost and incapable of providing massive data mining.
In order to achieve the above object, the invention provides a log data mining method based on Hadoop, comprising:
storing the acquired first log data set in the current time period into a Hadoop database;
if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database by using a preset parallel operation model to obtain a second log data set;
and performing dimension division on the log data in the second log data set according to the dimensions of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensions into the Hadoop database.
Preferably, the method further comprises:
acquiring log data in the current time period from a network side;
and carrying out aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
Preferably, the step of obtaining the log data in the current time period from the network side further includes:
performing data cleaning on the log data in the current time period to obtain cleaned log data in the current time period;
the step of performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period includes:
and carrying out aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
Preferably, the method further comprises:
if a data query instruction is received, reading a third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction;
and performing data analysis on the third log data set, and displaying the result of the data analysis on a display interface.
Preferably, the performing data analysis on the third log data set includes:
performing user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
obtaining a level configuration table corresponding to at least two user dimensions according to log data of users in a user grouping list, wherein the user dimensions are preset, and the level configuration table comprises levels determined by the users in the user grouping list according to the user dimensions in a grading manner.
In order to achieve the above object, the present invention further provides a log data mining system based on Hadoop, including:
the first storage module is used for storing the acquired first log data set in the current time period into a Hadoop database;
the parallel aggregation module is used for performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set if the number of the first log data sets stored in the Hadoop database meets a preset numerical value;
and the division and storage module is used for performing dimension division on the log data in the second log data set according to the dimension of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensions into the Hadoop database.
Preferably, the system further comprises:
the acquisition module is used for acquiring the log data in the current time period from a network side;
and the first aggregation module is used for performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
Preferably, the system further comprises a cleaning module;
the cleaning module is used for cleaning the log data in the current time period after the acquisition module acquires the log data in the current time period to obtain the cleaned log data in the current time period;
and the first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
Preferably, the system further comprises:
the reading module is used for reading a third log data set corresponding to a query dimension from the Hadoop database according to the query dimension contained in the data query instruction if the data query instruction is received;
and the analysis module is used for carrying out data analysis on the third log data set and displaying the result of the data analysis on a display interface.
Preferably, the analysis module comprises:
the clustering module is used for carrying out user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
an obtaining and displaying module, configured to obtain a level configuration table corresponding to at least two user dimensions according to log data of users in a user grouping list, where the user dimensions are preset, and the level configuration table includes levels determined by users in the user grouping list in a hierarchical manner according to the user dimensions
The invention provides a Hadoop-based log data mining method, which comprises the steps of storing a first log data set in the current time period into a Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, performing maintenance and division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing a third log data set corresponding to different dimensionalities into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by the parallel operation model, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Drawings
FIG. 1 is a schematic flow chart of a Hadoop-based log data mining method according to a first embodiment of the present invention;
FIG. 2 is a schematic flow chart showing additional steps before step 101 of the first embodiment of FIG. 1;
FIG. 3 is a flow chart illustrating additional steps after step 103 of the first embodiment of FIG. 1;
FIG. 4 is a diagram illustrating functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention;
FIG. 5 is a diagram of additional functional modules in the second embodiment of FIG. 4;
fig. 6 is a schematic diagram of additional functional modules in the second embodiment of fig. 4.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a Hadoop-based log data mining method, which comprises the steps of storing a first log data set in the current time period into a Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, performing maintenance and division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing a third log data set corresponding to different dimensionalities into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by a preset parallel operation model in the Hadoop, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Referring to fig. 1, a schematic flow chart of a Hadoop-based log data mining method according to a first embodiment of the present invention includes:
step 101, storing the acquired first log data set in the current time period into a Hadoop database;
in the embodiment of the invention, the log data mining method based on Hadoop can be applied to a log data mining system based on Hadoop (hereinafter referred to as mining system), and the mining system stores the acquired first log data set in the current time period into a Hadoop database.
The mining system acquires the first log data set according to a time period, for example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or acquires the first log data set in the current 30-minute time period.
The time period is a period for acquiring data, and the duration of the time period can be determined according to the size of the data volume.
The Hadoop can realize a Distributed File System (HDFS), and the frame core of the Hadoop is a Hadoop database and a parallel operation model, wherein the Hadoop database can provide Distributed storage for massive data, and the parallel operation model can provide parallel operation for the massive data.
Preferably, the parallel operation model is a mapreduce operation model.
102, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database by using a preset parallel operation model to obtain a second log data set;
in the embodiment of the invention, the mining system stores the acquired first log data set into the Hadoop database in each time period, and if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, the first log data set in the Hadoop database can be aggregated by using a preset parallel operation model in the Hadoop frame to obtain a second log data set.
In practical applications, the value may be preset according to specific needs, for example, if the time period is 15 minutes and aggregation processing needs to be performed on the first log data set within one hour, the preset value is 4; if the time period is 30 minutes and the aggregation process needs to be performed on the first log data set within 1 day, the preset value is 48.
It will be appreciated that based on the aggregation process described above, the mining system may also derive log data sets for different time periods in a similar manner, such as: the log data sets within one hour can be obtained by using the first log data sets with 4 time periods of 15 minutes, the log data sets within one day can be obtained by using the log data sets within 24 one hour, the log data sets within one month can be obtained by using the log data sets within 30 one day, and the like, the log data sets within different time periods can be obtained to meet different requirements.
In the embodiment of the invention, when the mining system carries out parallel aggregation processing by using the preset parallel operation model, the same count value of the log data is accumulated.
103, performing dimensionality division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensionalities into a Hadoop database.
In the embodiment of the invention, after the mining system obtains the second log data set, the dimension division is carried out on the log data in the second log data set according to the dimension of the log data in the second log data set, and the obtained third log data sets corresponding to different dimensions are stored in a Hadoop database so as to realize the mining of mass log data, and the stored third log data sets can be used as data sources for user data query and support icons, graphic query and multi-dimensional query of a display interface, so that the data can be displayed in multiple angles, and the display effect of data mining is achieved.
The dimensions of the log data are many, including but not limited to internet surfing content, internet surfing position and internet surfing time, where the internet surfing content refers to a browsing position of a user, and the browsing position may be a specific certain position, such as hundredth, fox search, new wave microblog, or a type of website, for example: music, movies, and the like. The internet surfing position refers to the geographical position range of the IP position used by the user, and the internet surfing time refers to the time for generating log data. And the dimension division is to finish further description of the whole behavior of the user through data on the dimension according to the requirements of the system. It should be noted that, for different types of log data, the dimensions of the log data are also different, for example: when the technical scheme of the embodiment of the invention is adopted to perform data mining on the traffic data of the user in the log data, the dimension of the data mining can also include the internet surfing frequency, the user age, the monthly consumption and the like besides the internet surfing content, the internet surfing position and the internet surfing time, so that in practical application, dimension division can be performed according to specific needs, and the dimension division is not limited here.
Preferably, in the embodiment of the present invention, after the mining system stores the third log data sets corresponding to different dimensions into the Hadoop database, the mining system may also store the third log data sets corresponding to different dimensions into the column storage array, so that cooperative work of the Hadoop database and the column storage array can be realized, and data requirements of different application scenarios can be met.
Preferably, the mining system executes the operations of parallel aggregation processing and dimension division only when the number of the first log data sets stored in the Hadoop database satisfies a preset value, so that the obtained third log data set actually corresponds to a time period, and the mining system can store the corresponding relationship among the dimension, the time period and the third log data set when storing the third log data set.
In the embodiment of the invention, the mining system stores the acquired first log data set in the current time period into the Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, the first log data set in the Hadoop database is subjected to parallel aggregation processing by using a preset parallel operation model to obtain a second log data set, the log data in the second log data set are maintained and divided according to the dimensionality of the log data in the second log data set, and the obtained third log data sets corresponding to different dimensionalities are stored into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by the parallel operation model in the Hadoop database, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Referring to fig. 2, a flow chart illustrating an additional step before step 101 in the first embodiment of fig. 1 according to the present invention includes:
step 201, obtaining log data in the current time period from a network side;
in the embodiment of the present invention, the mining system obtains log data in the current time period from the network side, specifically: the mining system may acquire the log data in the current time period from the network side by means of extraction of the log data, or may acquire the log data in the current time period from the network side by using a web crawler technology, or may acquire the log data in the current time period from a BOSS accounting database of the network side, or may receive the log data in the current time period provided by a third party vendor of the network side, or may acquire the log data in the current time period by combining at least two of the above-mentioned manners.
Step 202, performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
In the embodiment of the invention, after acquiring the log data in the current time period, the mining system performs aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
In step 202, the aggregation may be performed by classifying according to the content of the log data, and accumulating the log data of the same content or the same class of content as one piece of data in number, where the order of magnitude of the first log data set obtained after aggregation is far lower than the order of magnitude of the obtained log data in the current time period, and the meaning of the data at that time is completely preserved.
In the embodiment of the present invention, the mining system implements acquisition of the first log data set through the additional steps shown in fig. 2, and by aggregating the log data acquired from the network side in the current time period, the magnitude of the log data can be effectively reduced, so that the storage space required in the Hadoop database is reduced, and the storage space is saved.
Preferably, in the embodiment of the present invention, before performing step 202, the mining system may further perform the following steps:
performing data cleaning on the log data in the current time period to obtain cleaned log data in the current time period;
in the embodiment of the invention, before the mining system aggregates the acquired log data in the current time period, the mining system can also perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.
If the excavation system executes the above steps, the adaptive adjustment of step 202 is also required, and the adaptive adjustment of step 202 is:
and carrying out aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
The log data can be cleaned by removing some log data which do not meet the preset data type, and/or finding and correcting recognizable errors in the log data, and correcting or deleting the recognizable log data.
In the embodiment of the invention, the mining system can remove some useless or error log data by performing data cleaning on the log data in the current time period, reduce the number of log data processing and facilitate better data mining.
Referring to fig. 3, a flow chart illustrating additional steps after step 103 in the first embodiment of fig. 1 according to the present invention includes:
step 301, if a data query instruction is received, reading a third log data set corresponding to a query dimension from a Hadoop database according to the query dimension contained in the data query instruction;
in the embodiment of the invention, after the mining system stores the obtained third log data in the Hadoop database, a user can request to query the data by inputting a data query instruction, and if the mining system receives the data query instruction, the third log data set corresponding to the dimension is read from the Hadoop database according to the query dimension contained in the data query instruction.
Preferably, the data query instruction may further include a certain time period, and the mining system reads a third log data set corresponding to the query dimension in the time period.
And 302, performing data analysis on the third log data set, and displaying the result of the data analysis on a display interface.
In the embodiment of the present invention, the mining system further performs data analysis on the third log data set, and displays a result of the data analysis on a display interface, specifically: the mining system carries out user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list; obtaining a level configuration table corresponding to at least two user dimensions according to the log data of the users in the user grouping list, and displaying the level configuration table on a display interface; the user dimension is preset, and the level configuration table comprises the level determined by the user in the user grouping list according to the user dimension.
Wherein the user dimensions can be divided into a horizontal dimension and a vertical dimension, and the user is rated in different dimensions. For example: and mining the user groups obtained by the system, wherein the user groups comprise: and for all user groups and the microblog user group, ranking all users in the group according to the used flow, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star level of each user in all the user groups is determined by analogy. This is the lateral dimension rating. And for the users in the microblog user group, ranking the users in the rank ranking according to the traffic generated after the users start the microblog, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star ranking of each user in the microblog user group is determined by analogy. This is the vertical dimension rating. By means of the horizontal dimension rating and the vertical dimension rating, portrait display of user groups can be achieved, and a targeted scheme can be obtained by a service expert for specific grouped portraits.
Preferably, the preset clustering algorithm may be a K-means algorithm.
The query dimension is set based on a dimension corresponding to a third log data set stored in the Hadoop database, for example: the query dimension can be any one or more of internet surfing content, internet surfing time, internet surfing position and the like.
In the embodiment of the invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction, performs data analysis on the third log data set, and displays the result of the data analysis on the display interface, so that the result of the data mining can be effectively displayed to a user.
It should be noted that, in the embodiment of the present invention, the method for mining log data based on the Hadoop database may be applied to a precise marketing system of traffic data, for example, mining of a target user, mining of a marketing site, and the like may be implemented by using the technical solutions described in the embodiments shown in fig. 1 to fig. 3, so as to provide a data basis for targeted and refined marketing of the target user or a target base station cell by an operator.
If the target user needs to be determined, in step 301 in the embodiment shown in fig. 3, the query dimension may be internet content or internet traffic, and if the target base station cell needs to be determined, the query dimension may be an internet location.
In practical applications, the user may select the query dimension according to specific needs, which is not limited herein.
Referring to fig. 4, a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention includes:
the first saving module 401 is configured to save the acquired first log data set in the current time period to a Hadoop database;
the mining system acquires the first log data set according to a time period, for example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or acquires the first log data set in the current 30-minute time period.
The time period is a period for acquiring data, and the duration of the time period can be determined according to the size of the data volume.
The Hadoop can realize a Distributed File System (HDFS), and the frame core of the Hadoop is a Hadoop database and a parallel operation model, wherein the Hadoop database can provide Distributed storage for massive data, and the parallel operation model can provide parallel operation for the massive data.
Preferably, the parallel operation model is a mapreduce operation model.
A parallel aggregation module 402, configured to perform parallel aggregation processing on a first log data set in the Hadoop database by using a preset parallel operation model if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, so as to obtain a second log data set;
in practical applications, the value may be preset according to specific needs, for example, if the time period is 15 minutes and aggregation processing needs to be performed on the first log data set within one hour, the preset value is 4; if the time period is 30 minutes and the aggregation process needs to be performed on the first log data set within 1 day, the preset value is 48.
It is understood that based on the above aggregation process, the parallel aggregation module 402 can also obtain the log data sets in different time periods in a similar manner, for example: the log data sets within one hour can be obtained by using the first log data sets with 4 time periods of 15 minutes, the log data sets within one day can be obtained by using the log data sets within 24 one hour, the log data sets within one month can be obtained by using the log data sets within 30 one day, and the like, the log data sets within different time periods can be obtained to meet different requirements.
The division and storage module 403 is configured to perform dimension division on the log data in the second log data set according to the dimensions of the log data in the second log data set, and store the obtained third log data sets corresponding to different dimensions into the Hadoop database.
The dimensions of the log data are many, including but not limited to internet surfing content, internet surfing position and internet surfing time, where the internet surfing content refers to a browsing position of a user, and the browsing position may be a specific certain position, such as hundredth, fox search, new wave microblog, or a type of website, for example: music, movies, and the like. The internet surfing position refers to the geographical position range of the IP position used by the user, and the internet surfing time refers to the time for generating log data. And the dimension division is to finish further description of the whole behavior of the user through data on the dimension according to the requirements of the system. It should be noted that, for different types of log data, the dimensions of the log data are also different, for example: when the technical scheme of the embodiment of the invention is adopted to perform data mining on the traffic data of the user in the log data, the dimension of the data mining can also include the internet surfing frequency, the user age, the monthly consumption and the like besides the internet surfing content, the internet surfing position and the internet surfing time, so that in practical application, dimension division can be performed according to specific needs, and the dimension division is not limited here.
Preferably, in the embodiment of the present invention, after the mining system stores the third log data sets corresponding to different dimensions into the Hadoop database, the mining system may also store the third log data sets corresponding to different dimensions into the column storage array, so that cooperative work of the Hadoop database and the column storage array can be realized, and data requirements of different application scenarios can be met.
In this embodiment of the present invention, a first saving module 401 saves an acquired first log data set in a current time period to a Hadoop database, if the number of the first log data sets saved in the Hadoop database satisfies a preset numerical value, a parallel aggregation module 402 performs parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, and finally a division saving module 403 performs dimension division on the log data in the second log data set according to the dimension of the log data in the second log data set, and saves an acquired third log data set corresponding to different dimensions to the Hadoop database.
In the embodiment of the invention, the mining system stores the acquired first log data set in the current time period into a Hadoop database, if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, a parallel operation model in the Hadoop database is utilized to perform parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, the log data in the second log data set is maintained and divided according to the dimensionality of the log data in the second log data set, and a third log data set corresponding to different dimensionalities is stored into the Hadoop database to finish the mining of the log data. The Hadoop database has better distributed storage capacity and parallel operation capacity, so that the log data are stored in a distributed mode by the Hadoop database and parallel operation is performed by the parallel operation model in the Hadoop database, massive data can be mined quickly and effectively, and the storage and operation requirements for mining the massive data are met.
Please refer to fig. 5, which is a schematic diagram of additional functional modules in the second embodiment shown in fig. 4, including:
an obtaining module 501, configured to obtain log data in a current time period from a network side;
in this embodiment of the present invention, the obtaining module 501 obtains log data in the current time period from a network side, specifically: the obtaining module 501 may obtain log data in the current time period from the network side by extracting the log data, or may obtain the log data in the current time period from the network side by using a web crawler technology, or may obtain the log data in the current time period from a BOSS accounting database of the network side, or may receive the log data in the current time period provided by a third party vendor of the network side, or may obtain the log data in the current time period by combining at least two of the above manners.
A first aggregation module 502, configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
The first aggregation module 502 may classify the log data according to the content of the log data, and accumulate the log data of the same content or the same class of content as one piece of data in number, where the order of magnitude of the first log data set obtained after aggregation is far lower than the order of magnitude of the obtained log data in the current time period, and the meaning of the data at that time is completely stored.
In the embodiment of the present invention, the mining system will not start executing the first saving module 401 in the embodiment shown in fig. 4 until the first aggregation module 502 is executed.
In an embodiment of the present invention, the system further comprises a cleaning module 503;
the cleaning module 503 is configured to perform data cleaning on the log data in the current time period after the obtaining module 501 obtains the log data in the current time period, so as to obtain the cleaned log data in the current time period;
and if the mining system executes the cleaning module 503, the first aggregation module 502 is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
In the embodiment of the present invention, the mining system implements acquisition of the first log data set through the additional steps shown in fig. 2, and by aggregating the log data acquired from the network side in the current time period, the magnitude of the log data can be effectively reduced, so that the storage space required in the Hadoop database is reduced, and the storage space is saved. And the mining system can also remove some useless or error log data by carrying out data cleaning on the log data in the current time period, so that the processing quantity of the log data is reduced, and the data mining is facilitated to be better carried out.
Please refer to fig. 6, which is a schematic diagram of additional functional modules of the second embodiment shown in fig. 4, including:
a reading module 601, configured to, if a data query instruction is received, read a third log data set corresponding to a query dimension from the Hadoop database according to the query dimension included in the data query instruction;
an analysis module 602, configured to perform data analysis on the third log data set, and display a result of the data analysis on a display interface.
Wherein the analysis module 602 comprises:
a clustering module 603, configured to perform user grouping on users in the third log data set according to a preset clustering algorithm, so as to obtain a user grouping list;
an obtaining and displaying module 604, configured to obtain a level configuration table corresponding to at least two user dimensions according to log data of users in the user grouping list, and display the level configuration table on a display interface; the user dimension is preset, and the level configuration table comprises the level determined by the user in the user grouping list according to the user dimension in a grading way.
Wherein the user dimensions can be divided into a horizontal dimension and a vertical dimension, and the user is rated in different dimensions. For example: and mining the user groups obtained by the system, wherein the user groups comprise: and for all user groups and the microblog user group, ranking all users in the group according to the used flow, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star level of each user in all the user groups is determined by analogy. This is the lateral dimension rating. And for the users in the microblog user group, ranking the users in the rank ranking according to the traffic generated after the users start the microblog, wherein five-star users are ranked 20% at the top, four-star users are ranked 20% to 40% at the top, and the star ranking of each user in the microblog user group is determined by analogy. This is the vertical dimension rating. By means of the horizontal dimension rating and the vertical dimension rating, portrait display of user groups can be achieved, and a targeted scheme can be obtained by a service expert for specific grouped portraits.
Preferably, the preset clustering algorithm may be a K-means algorithm.
The query dimension is set based on a dimension corresponding to a third log data set stored in the Hadoop database, for example: the query dimension can be any one or more of internet surfing content, internet surfing time, internet surfing position and the like.
In the embodiment of the invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction, performs data analysis on the third log data set, and displays the result of the data analysis on the display interface, so that the result of the data mining can be effectively displayed to a user.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A log data mining method based on Hadoop is characterized by comprising the following steps:
storing the acquired first log data set in the current time period into a Hadoop database;
if the number of the first log data sets stored in the Hadoop database meets a preset numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database by using a preset parallel operation model to obtain a second log data set;
performing dimensionality division on the log data in the second log data set according to the dimensionality of the log data in the second log data set, and storing the obtained third log data sets corresponding to different dimensionalities into the Hadoop database;
if a data query instruction is received, reading a third log data set corresponding to the query dimension from the Hadoop database according to the query dimension contained in the data query instruction;
performing user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
obtaining a level configuration table corresponding to at least two user dimensions according to log data of users in a user grouping list, and displaying the level configuration table on a display interface; the user dimension is preset, and the level configuration table comprises the level determined by the user in the user grouping list according to the user dimension in a grading way.
2. The method of claim 1, further comprising:
acquiring log data in the current time period from a network side;
and carrying out aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
3. The method according to claim 2, wherein the step of obtaining the log data in the current time period from the network side further comprises:
performing data cleaning on the log data in the current time period to obtain cleaned log data in the current time period;
the step of performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period includes:
and carrying out aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
4. A Hadoop-based log data mining system, comprising:
the first storage module is used for storing the acquired first log data set in the current time period into a Hadoop database;
the parallel aggregation module is used for performing parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set if the number of the first log data sets stored in the Hadoop database meets a preset numerical value;
the division and storage module is used for carrying out dimension division on the log data in the second log data set according to the dimensions of the log data in the second log data set and storing the obtained third log data sets corresponding to different dimensions into the Hadoop database;
the reading module is used for reading a third log data set corresponding to a query dimension from the Hadoop database according to the query dimension contained in the data query instruction if the data query instruction is received;
the clustering module is used for carrying out user grouping on the users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
the acquisition and display module is used for acquiring a level configuration table corresponding to at least two user dimensions according to the log data of the users in the user grouping list and displaying the level configuration table on a display interface; the user dimension is preset, and the level configuration table comprises the level determined by the user in the user grouping list according to the user dimension in a grading way.
5. The system of claim 4, further comprising:
the acquisition module is used for acquiring the log data in the current time period from a network side;
and the first aggregation module is used for performing aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
6. The system of claim 5, further comprising a cleaning module;
the cleaning module is used for cleaning the log data in the current time period after the acquisition module acquires the log data in the current time period to obtain the cleaned log data in the current time period;
and the first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
CN201510875453.3A 2015-12-02 2015-12-02 Hadoop-based log data mining method and system Active CN106815274B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system
PCT/CN2016/097363 WO2017092444A1 (en) 2015-12-02 2016-08-30 Log data mining method and system based on hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Publications (2)

Publication Number Publication Date
CN106815274A CN106815274A (en) 2017-06-09
CN106815274B true CN106815274B (en) 2022-02-18

Family

ID=58796202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510875453.3A Active CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Country Status (2)

Country Link
CN (1) CN106815274B (en)
WO (1) WO2017092444A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391645B (en) * 2017-07-12 2018-04-10 广州市昊链信息科技股份有限公司 A kind of logistics information automatic push and practical operation specification form system and method
CN107241231B (en) * 2017-07-26 2020-04-03 成都科来软件有限公司 Rapid and accurate positioning method for original network data packet
CN112287208B (en) * 2019-09-30 2024-03-01 北京沃东天骏信息技术有限公司 User portrait generation method, device, electronic equipment and storage medium
WO2021102888A1 (en) * 2019-11-29 2021-06-03 京东方科技集团股份有限公司 Data processing device and method, and computer-readable storage medium
CN111597179B (en) * 2020-05-18 2023-12-05 北京思特奇信息技术股份有限公司 Method and device for automatically cleaning data, electronic equipment and storage medium
CN112632020B (en) * 2020-12-25 2022-03-18 中国电子科技集团公司第三十研究所 Log information type extraction method and mining method based on spark big data platform

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104616092A (en) * 2014-12-16 2015-05-13 国家电网公司 Distributed log analysis based distributed mode handling method

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732123B1 (en) * 1998-02-23 2004-05-04 International Business Machines Corporation Database recovery to any point in time in an online environment utilizing disaster recovery technology
US7552147B2 (en) * 2005-09-02 2009-06-23 International Business Machines Corporation System and method for minimizing data outage time and data loss while handling errors detected during recovery
CN100481077C (en) * 2006-01-12 2009-04-22 国际商业机器公司 Visual method and device for strengthening search result guide
KR20090050405A (en) * 2007-11-15 2009-05-20 한국전자통신연구원 Method and apparatus for classifying user behaviors based on the event log generated from the context aware system environment
CN101483557B (en) * 2009-03-03 2011-07-13 中兴通讯股份有限公司 Log statistic, storing method and system used for deep packet detection apparatus
US9178935B2 (en) * 2009-03-05 2015-11-03 Paypal, Inc. Distributed steam processing
CN103036921B (en) * 2011-09-29 2015-09-23 北京新媒传信科技有限公司 A kind of user behavior analysis system and method
US10223431B2 (en) * 2013-01-31 2019-03-05 Facebook, Inc. Data stream splitting for low-latency data access
US10069677B2 (en) * 2013-04-06 2018-09-04 Citrix Systems, Inc. Systems and methods to collect logs from multiple nodes in a cluster of load balancers
CN104301360B (en) * 2013-07-19 2019-03-12 阿里巴巴集团控股有限公司 A kind of method of logdata record, log server and system
US9569491B2 (en) * 2013-09-13 2017-02-14 Nec Corporation MISO (multistore-online-tuning) system
CN103955502B (en) * 2014-04-24 2017-07-28 科技谷(厦门)信息技术有限公司 A kind of visualization OLAP application realization method and system
CN104317958B (en) * 2014-11-12 2018-01-16 北京国双科技有限公司 A kind of real-time data processing method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104616092A (en) * 2014-12-16 2015-05-13 国家电网公司 Distributed log analysis based distributed mode handling method

Also Published As

Publication number Publication date
WO2017092444A1 (en) 2017-06-08
CN106815274A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
CN106815274B (en) Hadoop-based log data mining method and system
US20210211471A1 (en) Highly scalable four-dimensional web-rendering geospatial data system for simulated worlds
US9973522B2 (en) Identifying network security risks
CN105183912B (en) Abnormal log determines method and apparatus
WO2015085948A1 (en) Method, device, and server for friend recommendation
WO2019024496A1 (en) Enterprise recommendation method and application server
CN106202482B (en) Website optimization method and system based on user behavior analysis
CN103838867A (en) Log processing method and device
CN103678647A (en) Method and system for recommending information
CN103106285A (en) Recommendation algorithm based on information security professional social network platform
JP2013534334A (en) Method and apparatus for sorting query results
CN105354203A (en) Information display method and apparatus
US20140214632A1 (en) Smart Crowd Sourcing On Product Classification
RU2605041C2 (en) Methods and systems for displaying microblog topics
CN109359141B (en) Visual report data display method and device
WO2017023626A1 (en) Identification and presentation of changelogs relevant to a tenant of a multi-tenant cloud service
CN111460279A (en) Information recommendation method and device, storage medium and computer equipment
CN102982112A (en) Ranking list generation method and journal generation method and server
CN103475532A (en) Hardware detection method and system thereof
CN106372158A (en) Method and device for processing user behavior data
US20180285693A1 (en) Incremental update of a neighbor graph via an orthogonal transform based indexing
CN104808995A (en) Method and device for storing application contents over applications
CN105243131B (en) Path query method and device
CN104063456A (en) We media transmission atlas analysis method and device based on vector query
CN104123307A (en) Data loading method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant