WO2017092444A1 - Procédé et système d'exploration de données de journal reposant sur hadoop - Google Patents

Procédé et système d'exploration de données de journal reposant sur hadoop Download PDF

Info

Publication number
WO2017092444A1
WO2017092444A1 PCT/CN2016/097363 CN2016097363W WO2017092444A1 WO 2017092444 A1 WO2017092444 A1 WO 2017092444A1 CN 2016097363 W CN2016097363 W CN 2016097363W WO 2017092444 A1 WO2017092444 A1 WO 2017092444A1
Authority
WO
WIPO (PCT)
Prior art keywords
log data
data set
time period
current time
user
Prior art date
Application number
PCT/CN2016/097363
Other languages
English (en)
Chinese (zh)
Inventor
惠羿
熊伟
哈景楠
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017092444A1 publication Critical patent/WO2017092444A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of computer data processing, and in particular, to a log data mining method and system based on Hadoop.
  • the main purpose of the embodiments of the present invention is to provide a log data mining method and system based on Hadoop, which aims to solve the technical problem that the traditional database has limited computing power, high storage cost, and cannot provide massive data mining.
  • a Hadoop-based log data mining method provided by an embodiment of the present invention includes:
  • the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second Log data collection;
  • the method further includes:
  • the method further includes:
  • step of performing the aggregation processing on the log data in the current time period to obtain the first log data set in the current time period includes:
  • the method further includes:
  • the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;
  • Data analysis is performed on the third log data set, and the result of the data analysis is displayed on the display interface.
  • the performing data analysis on the third log data set includes:
  • obtaining according to the log data of the user in the user group list, a level configuration table corresponding to the at least two user dimensions, where the user dimension is preset, and the level configuration table includes the user in the user group list according to the user.
  • the level at which the dimension is graded is the level at which the dimension is graded.
  • an embodiment of the present invention further provides a log data mining system based on Hadoop, including:
  • the first saving module is configured to save the first log data set in the obtained current time period to the Hadoop database
  • a parallel aggregation module configured to: if the number of the first log data set saved by the Hadoop database satisfies a preset value, parallelize the first log data set in the Hadoop database by using a preset parallel computing model Aggregating processing to obtain a second log data set;
  • Dividing a save module according to the dimension of the log data in the second log data set
  • the log data in the second log data set is dimensioned, and the obtained third log data set corresponding to different dimensions is saved in the Hadoop database.
  • system further includes:
  • the first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the system further includes a cleaning module
  • the cleaning module is configured to perform data cleaning on the log data in the current time period after the obtaining module obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
  • the first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
  • system further includes:
  • the reading module is configured to: if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction;
  • the analysis module is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
  • the analyzing module includes:
  • a clustering module configured to group users in the third log data set according to a preset clustering algorithm to obtain a user grouping list
  • Obtaining a display module configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user group list, where the user dimension is preset, and the level configuration table includes the user grouping list The user is ranked according to the user dimension.
  • a computer storage medium is further provided, and the computer storage medium may store an execution instruction for executing the Hadoop-based log data mining method in the foregoing embodiment.
  • the embodiment of the present invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the Hadoop database has saved the first log data set, the number of the first log data set is satisfied.
  • the set value is used to perform parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel computing model to obtain a second log data set, according to the dimension of the log data in the second log data set.
  • the log data in the second log data set is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data.
  • Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.
  • FIG. 1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention
  • FIG. 2 is a schematic flow chart of an additional step before step 101 of the first embodiment of FIG. 1;
  • FIG. 3 is a schematic flow chart of an additional step after step 103 of the first embodiment in FIG. 1;
  • FIG. 4 is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention.
  • FIG. 5 is a schematic diagram of additional functional modules in the second embodiment of Figure 4.
  • Figure 6 is a schematic illustration of additional functional modules in the second embodiment of Figure 4.
  • the invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies the preset Pre-defined parallel computing model Performing parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and performing maintenance division on the log data in the second log data set according to the dimension of the log data in the second log data set The third log data set corresponding to the different dimensions is saved to the Hadoop database to complete the mining of the log data.
  • the Hadoop database has better distributed storage capacity and parallel computing capability
  • the Hadoop database can be used for distributed storage of log data and parallel computing using the preset parallel computing model in Hadoop, which can realize massive data quickly and efficiently. Mining to meet the storage and computing needs of mining massive data.
  • FIG. 1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention, including:
  • Step 101 Save the first log data set in the obtained current time period to the Hadoop database.
  • the Hadoop-based log data mining method can be applied to a Hadoop-based log data mining system (hereinafter referred to as: mining system), and the mining system saves the first log data set in the current time period acquired.
  • mining system Hadoop-based log data mining system
  • the mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.
  • the time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.
  • Hadoop can implement Distributed File System (HDFS).
  • HDFS Distributed File System
  • the core of Hadoop framework is Hadoop database and parallel computing model.
  • Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive.
  • the data provides parallel operations.
  • the parallel computing model is a mapreduce computing model.
  • Step 102 If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second log. Data collection
  • the mining system saves the acquired first log data set to the Hadoop database in each time period, and if the Hadoop database has saved the first log data. If the number of sets satisfies a preset value, the first log data set in the Hadoop database may be aggregated by using a preset parallel computing model in the Hadoop framework to obtain a second log data set.
  • the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.
  • the mining system can also obtain a log data set in different time periods in a similar manner, for example, it can be obtained within one hour by using the first log data set of 15 time periods of 15 minutes.
  • the log data set can be obtained by using the log data set within 24 hours to obtain the log data set within one day.
  • the log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data at different times to meet different needs.
  • the mining system when the mining system performs parallel aggregation processing using the preset parallel computing model, the same log data count value is accumulated.
  • Step 103 Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.
  • the mining system divides the log data in the second log data set according to the dimension of the log data in the second log data set, and the obtained
  • the third log data set corresponding to different dimensions is saved in the Hadoop database to implement massive log data mining, and the saved third log data set can be used as the data source of the user data query, and supports icons, graphic queries and display interfaces of the display interface. Dimensional query enables data to be displayed from multiple angles to achieve data mining.
  • the content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on.
  • the Internet access location refers to the geographical location of the IP location used by the user.
  • the Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user. It should be noted that for Different types of log data, the dimensions of the log data are also different.
  • the dimension is in addition to the above-mentioned Internet content.
  • the Internet access location and the Internet access time the Internet access frequency, user age, monthly consumption, and the like may be included. Therefore, in practical applications, the dimension may be divided according to specific needs, which is not limited herein.
  • the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
  • the mining system since the mining system performs the parallel aggregation processing and the dimension division operation in the case where the number of the first log data sets saved in the Hadoop database satisfies the preset value, the third obtained is obtained.
  • the log data set actually corresponds to a time period.
  • the mining system saves, it can save the correspondence between the dimension, the time segment and the third log data set.
  • the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set.
  • the preset parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set.
  • the log data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.
  • FIG. 2 is a schematic flowchart of an additional step before step 101 in the first embodiment shown in FIG. 1 of the present invention, including:
  • Step 201 Obtain log data in a current time period from a network side.
  • the mining system obtains the log data in the current time period from the network side.
  • the mining system may obtain the current time period from the network side by using the log data to be extracted.
  • the log data in the log data may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or the network may be accepted.
  • Step 202 Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the mining system after acquiring the log data in the current time period, performs aggregation processing on the log data in the current time period to obtain a first log data set of the current time period.
  • the aggregation in step 202 may be performed according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained after the aggregation is of an order of magnitude far. It is much lower than the order of the log data in the current time period obtained, and the meaning of the data is completely saved.
  • the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side.
  • the order of log data reduces the amount of storage space required in the Hadoop database, saving storage space.
  • the mining system may further perform the following steps before performing step 202:
  • the mining system may perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.
  • step 202 is also required, and the adaptive adjustment of step 202 is:
  • the log data after the cleaning in the current time period is aggregated to obtain the first log data set in the current time period.
  • the cleaning of the log data may be to remove some data types that do not meet the preset settings.
  • the mining system performs data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.
  • FIG. 3 is a schematic flowchart of an additional step after step 103 in the first embodiment shown in FIG. 1 , which includes:
  • Step 301 If a data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction.
  • the user may request the query data by inputting the data query instruction, and if the mining system receives the data query instruction, according to the data.
  • the query dimension contained in the query instruction reads the third log data set corresponding to the dimension from the Hadoop database.
  • the data query instruction may further include a certain time period, and the mining system will read the third log data set corresponding to the query dimension in the time period.
  • Step 302 Perform data analysis on the third log data set, and display the result of the data analysis on the display interface.
  • the mining system further performs data analysis on the third log data set, and displays the result of the data analysis on the display interface.
  • the mining system performs the third log data set according to the preset clustering algorithm.
  • the user is grouped by the user to obtain a user grouping list; the level configuration table corresponding to at least two user dimensions is obtained according to the log data of the user in the user grouping list, and the level configuration table is displayed on the display interface; the user dimension is preset
  • the level configuration table includes the levels determined by the users in the user group list according to the user dimension.
  • user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions.
  • the user group obtained by the mining system includes: all user groups and microblog user groups.
  • all user groups all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups.
  • the top 20% are five-star users
  • the top 20% to 40% are four-star users, and so on, determine the star rating of each user in the Weibo user group.
  • Through the horizontal dimension rating and the vertical dimension rating it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.
  • the preset clustering algorithm may be a K-means algorithm.
  • the query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database.
  • the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.
  • the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.
  • the log data mining method based on the Hadoop database can be applied to the accurate marketing system of the traffic data, for example, the technical solution described in the embodiments shown in FIG. 1 to FIG. To achieve the mining of target users and the mining of marketing sites, etc., to provide operators with a data base for targeted and refined marketing of target users or target base station cells.
  • the query dimension may be the Internet content or the Internet traffic. If the target base station cell needs to be determined, the query dimension may be the Internet access location.
  • the user can select the query dimension according to specific needs, which is not limited here.
  • FIG. 4 is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention, including:
  • the first saving module 401 is configured to save the first log data set in the acquired current time period to the Hadoop database;
  • the mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.
  • the time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.
  • Hadoop can implement Distributed File System (HDFS).
  • HDFS Distributed File System
  • the core of Hadoop framework is Hadoop database and parallel computing model.
  • Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive.
  • the data provides parallel operations.
  • the parallel computing model is a mapreduce computing model.
  • the parallel aggregation module 402 is configured to: if the number of the first log data sets saved by the Hadoop database meets a preset value, perform the first log data set in the Hadoop database by using a preset parallel computing model. Parallel aggregation processing to obtain a second log data set;
  • the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.
  • the parallel aggregation module 402 can also obtain a log data set in different time periods in a similar manner. For example, a first log data set of 15 time periods can be used to obtain one. The log data collection in the hour can be obtained by using the log data set within 24 hours to obtain the log data set within one day. The log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data can be obtained at different times to meet different needs.
  • the partitioning and saving module 403 divides the log data in the second log data set according to the dimension of the log data in the second log data set, and saves the obtained third log data set corresponding to different dimensions to the In the Hadoop database.
  • the content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on.
  • the Internet access location refers to the geographical location of the IP location used by the user.
  • the Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user.
  • the dimensions of the log data are different for different types of log data, for example, the number of logs in the log.
  • the dimension may include the Internet access frequency, the user age, the monthly consumption, and the like in addition to the above-mentioned Internet content, the Internet access location, and the Internet access time. Therefore, in practical applications, the dimension division may be performed according to specific needs, which is not limited herein.
  • the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
  • the first saving module 401 saves the first log data set in the current time period acquired to the Hadoop database, and if the number of the first log data set saved in the Hadoop database meets the preset
  • the parallel aggregation module 402 performs parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set
  • the partition save module 403 is configured according to the second log.
  • the dimension of the log data in the data set is dimension-divided to the log data in the second log data set, and the obtained third log data set corresponding to the different dimensions is saved in the Hadoop database.
  • the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set.
  • the parallel computing model in the Hadoop database performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set.
  • the log data in the maintenance is divided and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.
  • FIG. 5 it is a schematic diagram of a function module added in the second embodiment shown in FIG. 4, including:
  • the obtaining module 501 is configured to acquire log data in the current time period from the network side;
  • the obtaining module 501 is configured to obtain the log data in the current time period from the network side.
  • the obtaining module 501 can obtain the log data from the network side.
  • the log data in the current time period may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or
  • the log data in the current time period provided by the third-party vendor on the network side may be accepted, or the log data in the current time period may be acquired in combination with at least two methods described above.
  • the first aggregation module 502 is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the first aggregation module 502 may be classified according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained by the aggregation is of an order of magnitude. It is far below the order of the log data in the current time period obtained, and the meaning of the data is completely preserved.
  • the mining system does not start to execute the first saving module 401 in the embodiment shown in FIG. 4 after executing the first aggregation module 502.
  • the system further includes a cleaning module 503;
  • the cleaning module 503 is configured to perform data cleaning on the log data in the current time period after the obtaining module 501 obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
  • the first aggregation module 502 is configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
  • the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side.
  • the order of log data reduces the amount of storage space required in the Hadoop database, saving storage space.
  • the mining system can also perform data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.
  • FIG. 6 a schematic diagram of a function module added to the second embodiment shown in FIG. 4 includes:
  • the reading module 601 is configured to: if the data query instruction is received, read the third corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction Log data collection;
  • the analyzing module 602 is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
  • the analysis module 602 includes:
  • the clustering module 603 is configured to perform grouping of users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
  • the obtaining display module 604 is configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user grouping list, and display the level configuration table on the display interface; the user dimension is preset.
  • the level configuration table includes a level determined by the user in the user grouping list according to the user dimension.
  • user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions.
  • the user group obtained by the mining system includes: all user groups and microblog user groups.
  • all user groups all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups.
  • the rankings are ranked according to the traffic generated after the user starts the Weibo.
  • the top 20% are five-star users, and the top 20% to 40% are four-star users.
  • Such a push determines the star rating of each user in the Weibo user group.
  • Through the horizontal dimension rating and the vertical dimension rating it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.
  • the preset clustering algorithm may be a K-means algorithm.
  • the query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database.
  • the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.
  • the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be configured to store program code for performing the following steps:
  • the first log data set in the Hadoop database is parallelized and processed by the preset parallel computing model to obtain the second log data. set;
  • the storage medium is further arranged to store program code for performing the following steps:
  • S2 Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • a mobile hard disk e.g., a hard disk
  • magnetic memory e.g., a hard disk
  • the first log data set in the current time period acquired is saved in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the preset is used.
  • the parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the log in the second log data set according to the dimension of the log data in the second log data set The data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé d'exploration de données de journal reposant sur Hadoop, consistant à : enregistrer un premier ensemble de données de journal sur une période de temps courante acquise dans une base de données Hadoop (101) ; si le nombre de premiers ensembles de données de journal enregistrés dans la base de données Hadoop satisfait une valeur numérique prédéfinie, appliquer un traitement d'agrégation parallèle au premier ensemble de données de journal de la base de données Hadoop au moyen d'un modèle arithmétique parallèle prédéfini, de façon à obtenir un deuxième ensemble de données de journal (102) ; et, conformément à la dimension des données de journal du deuxième ensemble de données de journal, appliquer une division de dimension aux données de journal du deuxième ensemble de données de journal, et enregistrer un troisième ensemble de données de journal correspondant aux dimensions différentes obtenues dans la base de données Hadoop (103). Le procédé permet d'effectuer rapidement et efficacement une exploration de données de masse, et de satisfaire les exigences de mémorisation et d'exploitation de l'exploration de données de masse.
PCT/CN2016/097363 2015-12-02 2016-08-30 Procédé et système d'exploration de données de journal reposant sur hadoop WO2017092444A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510875453.3 2015-12-02
CN201510875453.3A CN106815274B (zh) 2015-12-02 2015-12-02 基于Hadoop的日志数据挖掘方法及系统

Publications (1)

Publication Number Publication Date
WO2017092444A1 true WO2017092444A1 (fr) 2017-06-08

Family

ID=58796202

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/097363 WO2017092444A1 (fr) 2015-12-02 2016-08-30 Procédé et système d'exploration de données de journal reposant sur hadoop

Country Status (2)

Country Link
CN (1) CN106815274B (fr)
WO (1) WO2017092444A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107241231A (zh) * 2017-07-26 2017-10-10 成都科来软件有限公司 一种原始网络数据包的快速精准定位方法
CN111597179A (zh) * 2020-05-18 2020-08-28 北京思特奇信息技术股份有限公司 自动清洗数据的方法、装置、电子设备及存储介质
CN112287208A (zh) * 2019-09-30 2021-01-29 北京沃东天骏信息技术有限公司 用户画像生成方法、装置、电子设备及存储介质
CN112632020A (zh) * 2020-12-25 2021-04-09 中国电子科技集团公司第三十研究所 基于spark大数据平台的日志信息类型提取方法、挖掘方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391645B (zh) * 2017-07-12 2018-04-10 广州市昊链信息科技股份有限公司 一种物流信息自动推送及实操规范形成系统和方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000607A (zh) * 2006-01-12 2007-07-18 国际商业机器公司 用于增强搜索结果导航的可视化方法及装置
CN102685221A (zh) * 2012-04-29 2012-09-19 华北电力大学(保定) 一种状态监测数据的分布式存储与并行挖掘方法
CN103036921A (zh) * 2011-09-29 2013-04-10 北京新媒传信科技有限公司 一种用户行为分析系统和方法
CN103955502A (zh) * 2014-04-24 2014-07-30 科技谷(厦门)信息技术有限公司 一种可视化olap的应用实现方法及系统
CN104317958A (zh) * 2014-11-12 2015-01-28 北京国双科技有限公司 一种实时数据处理方法及系统

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732123B1 (en) * 1998-02-23 2004-05-04 International Business Machines Corporation Database recovery to any point in time in an online environment utilizing disaster recovery technology
US7552147B2 (en) * 2005-09-02 2009-06-23 International Business Machines Corporation System and method for minimizing data outage time and data loss while handling errors detected during recovery
KR20090050405A (ko) * 2007-11-15 2009-05-20 한국전자통신연구원 상황 인식 시스템 환경에서 발생한 이벤트 로그에 기초하여사용자의 행위를 분류하는 방법 및 장치
CN101483557B (zh) * 2009-03-03 2011-07-13 中兴通讯股份有限公司 一种用于深度报文检测设备的日志统计、保存方法和系统
US9178935B2 (en) * 2009-03-05 2015-11-03 Paypal, Inc. Distributed steam processing
US10223431B2 (en) * 2013-01-31 2019-03-05 Facebook, Inc. Data stream splitting for low-latency data access
US10069677B2 (en) * 2013-04-06 2018-09-04 Citrix Systems, Inc. Systems and methods to collect logs from multiple nodes in a cluster of load balancers
CN104301360B (zh) * 2013-07-19 2019-03-12 阿里巴巴集团控股有限公司 一种日志数据记录的方法、日志服务器及系统
US20150081668A1 (en) * 2013-09-13 2015-03-19 Nec Laboratories America, Inc. Systems and methods for tuning multi-store systems to speed up big data query workload
CN104182506A (zh) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 日志管理方法
CN104616092B (zh) * 2014-12-16 2019-10-25 国家电网公司 一种基于分布式日志分析的行为模式处理方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000607A (zh) * 2006-01-12 2007-07-18 国际商业机器公司 用于增强搜索结果导航的可视化方法及装置
CN103036921A (zh) * 2011-09-29 2013-04-10 北京新媒传信科技有限公司 一种用户行为分析系统和方法
CN102685221A (zh) * 2012-04-29 2012-09-19 华北电力大学(保定) 一种状态监测数据的分布式存储与并行挖掘方法
CN103955502A (zh) * 2014-04-24 2014-07-30 科技谷(厦门)信息技术有限公司 一种可视化olap的应用实现方法及系统
CN104317958A (zh) * 2014-11-12 2015-01-28 北京国双科技有限公司 一种实时数据处理方法及系统

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107241231A (zh) * 2017-07-26 2017-10-10 成都科来软件有限公司 一种原始网络数据包的快速精准定位方法
CN107241231B (zh) * 2017-07-26 2020-04-03 成都科来软件有限公司 一种原始网络数据包的快速精准定位方法
CN112287208A (zh) * 2019-09-30 2021-01-29 北京沃东天骏信息技术有限公司 用户画像生成方法、装置、电子设备及存储介质
CN112287208B (zh) * 2019-09-30 2024-03-01 北京沃东天骏信息技术有限公司 用户画像生成方法、装置、电子设备及存储介质
CN111597179A (zh) * 2020-05-18 2020-08-28 北京思特奇信息技术股份有限公司 自动清洗数据的方法、装置、电子设备及存储介质
CN111597179B (zh) * 2020-05-18 2023-12-05 北京思特奇信息技术股份有限公司 自动清洗数据的方法、装置、电子设备及存储介质
CN112632020A (zh) * 2020-12-25 2021-04-09 中国电子科技集团公司第三十研究所 基于spark大数据平台的日志信息类型提取方法、挖掘方法

Also Published As

Publication number Publication date
CN106815274B (zh) 2022-02-18
CN106815274A (zh) 2017-06-09

Similar Documents

Publication Publication Date Title
US11941017B2 (en) Event driven extract, transform, load (ETL) processing
US20230376506A1 (en) System and Method for Analysis and Management of Data Distribution in a Distributed Database Environment
WO2017092444A1 (fr) Procédé et système d'exploration de données de journal reposant sur hadoop
US9740738B1 (en) Data retrieval from datastores with different data storage formats
US8725721B2 (en) Personalizing scoping and ordering of object types for search
WO2019024496A1 (fr) Procédé de recommandation d'entreprise et serveur d'application
CN103620601A (zh) 在映射缩减过程中汇合表
CN111046237B (zh) 用户行为数据处理方法、装置、电子设备及可读介质
WO2017096892A1 (fr) Procédé de construction d'index, procédé de recherche, et dispositif correspondant, appareil, et support de stockage informatique
AU2014207599A1 (en) Efficient query processing using histograms in a columnar database
WO2015074477A1 (fr) Procédé et appareil d'analyse de chemin
US20120278354A1 (en) User analysis through user log feature extraction
TW201415262A (zh) 基於Lucene的倒排索引系統構建、資料處理方法及裝置
WO2011090519A1 (fr) Accès à des tables de collecte de grands objets dans une base de données
US20170083566A1 (en) Partitioning advisor for online transaction processing workloads
WO2019085463A1 (fr) Procédé de recommandation de demande de service, serveur d'application et support de stockage lisible par ordinateur
CN107016115B (zh) 数据导出方法、装置、计算机可读存储介质及电子设备
US20180300373A1 (en) Combined sort and aggregation
US20190065548A1 (en) Method and system of optimizing database system, electronic device and storage medium
CN111382155A (zh) 一种数据仓库的数据处理方法、电子设备及介质
CN107181729B (zh) 在多租户云环境中的数据加密
JP2017537383A (ja) 正規化されたデータセットの関係認識集約(raa)
CN110851758B (zh) 一种网页访客数量统计方法及装置
WO2019153546A1 (fr) Procédé, appareil et dispositif de génération de données de dimension de dix-mille niveaux, et support de stockage
CN110851515A (zh) 一种基于Spark分布式环境下的大数据ETL模型执行方法及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869755

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869755

Country of ref document: EP

Kind code of ref document: A1