WO2017092444A1 - Log data mining method and system based on hadoop - Google Patents

Log data mining method and system based on hadoop Download PDF

Info

Publication number
WO2017092444A1
WO2017092444A1 PCT/CN2016/097363 CN2016097363W WO2017092444A1 WO 2017092444 A1 WO2017092444 A1 WO 2017092444A1 CN 2016097363 W CN2016097363 W CN 2016097363W WO 2017092444 A1 WO2017092444 A1 WO 2017092444A1
Authority
WO
WIPO (PCT)
Prior art keywords
log data
data set
time period
current time
user
Prior art date
Application number
PCT/CN2016/097363
Other languages
French (fr)
Chinese (zh)
Inventor
惠羿
熊伟
哈景楠
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2017092444A1 publication Critical patent/WO2017092444A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of computer data processing, and in particular, to a log data mining method and system based on Hadoop.
  • the main purpose of the embodiments of the present invention is to provide a log data mining method and system based on Hadoop, which aims to solve the technical problem that the traditional database has limited computing power, high storage cost, and cannot provide massive data mining.
  • a Hadoop-based log data mining method provided by an embodiment of the present invention includes:
  • the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second Log data collection;
  • the method further includes:
  • the method further includes:
  • step of performing the aggregation processing on the log data in the current time period to obtain the first log data set in the current time period includes:
  • the method further includes:
  • the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;
  • Data analysis is performed on the third log data set, and the result of the data analysis is displayed on the display interface.
  • the performing data analysis on the third log data set includes:
  • obtaining according to the log data of the user in the user group list, a level configuration table corresponding to the at least two user dimensions, where the user dimension is preset, and the level configuration table includes the user in the user group list according to the user.
  • the level at which the dimension is graded is the level at which the dimension is graded.
  • an embodiment of the present invention further provides a log data mining system based on Hadoop, including:
  • the first saving module is configured to save the first log data set in the obtained current time period to the Hadoop database
  • a parallel aggregation module configured to: if the number of the first log data set saved by the Hadoop database satisfies a preset value, parallelize the first log data set in the Hadoop database by using a preset parallel computing model Aggregating processing to obtain a second log data set;
  • Dividing a save module according to the dimension of the log data in the second log data set
  • the log data in the second log data set is dimensioned, and the obtained third log data set corresponding to different dimensions is saved in the Hadoop database.
  • system further includes:
  • the first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the system further includes a cleaning module
  • the cleaning module is configured to perform data cleaning on the log data in the current time period after the obtaining module obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
  • the first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
  • system further includes:
  • the reading module is configured to: if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction;
  • the analysis module is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
  • the analyzing module includes:
  • a clustering module configured to group users in the third log data set according to a preset clustering algorithm to obtain a user grouping list
  • Obtaining a display module configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user group list, where the user dimension is preset, and the level configuration table includes the user grouping list The user is ranked according to the user dimension.
  • a computer storage medium is further provided, and the computer storage medium may store an execution instruction for executing the Hadoop-based log data mining method in the foregoing embodiment.
  • the embodiment of the present invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the Hadoop database has saved the first log data set, the number of the first log data set is satisfied.
  • the set value is used to perform parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel computing model to obtain a second log data set, according to the dimension of the log data in the second log data set.
  • the log data in the second log data set is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data.
  • Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.
  • FIG. 1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention
  • FIG. 2 is a schematic flow chart of an additional step before step 101 of the first embodiment of FIG. 1;
  • FIG. 3 is a schematic flow chart of an additional step after step 103 of the first embodiment in FIG. 1;
  • FIG. 4 is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention.
  • FIG. 5 is a schematic diagram of additional functional modules in the second embodiment of Figure 4.
  • Figure 6 is a schematic illustration of additional functional modules in the second embodiment of Figure 4.
  • the invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies the preset Pre-defined parallel computing model Performing parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and performing maintenance division on the log data in the second log data set according to the dimension of the log data in the second log data set The third log data set corresponding to the different dimensions is saved to the Hadoop database to complete the mining of the log data.
  • the Hadoop database has better distributed storage capacity and parallel computing capability
  • the Hadoop database can be used for distributed storage of log data and parallel computing using the preset parallel computing model in Hadoop, which can realize massive data quickly and efficiently. Mining to meet the storage and computing needs of mining massive data.
  • FIG. 1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention, including:
  • Step 101 Save the first log data set in the obtained current time period to the Hadoop database.
  • the Hadoop-based log data mining method can be applied to a Hadoop-based log data mining system (hereinafter referred to as: mining system), and the mining system saves the first log data set in the current time period acquired.
  • mining system Hadoop-based log data mining system
  • the mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.
  • the time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.
  • Hadoop can implement Distributed File System (HDFS).
  • HDFS Distributed File System
  • the core of Hadoop framework is Hadoop database and parallel computing model.
  • Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive.
  • the data provides parallel operations.
  • the parallel computing model is a mapreduce computing model.
  • Step 102 If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second log. Data collection
  • the mining system saves the acquired first log data set to the Hadoop database in each time period, and if the Hadoop database has saved the first log data. If the number of sets satisfies a preset value, the first log data set in the Hadoop database may be aggregated by using a preset parallel computing model in the Hadoop framework to obtain a second log data set.
  • the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.
  • the mining system can also obtain a log data set in different time periods in a similar manner, for example, it can be obtained within one hour by using the first log data set of 15 time periods of 15 minutes.
  • the log data set can be obtained by using the log data set within 24 hours to obtain the log data set within one day.
  • the log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data at different times to meet different needs.
  • the mining system when the mining system performs parallel aggregation processing using the preset parallel computing model, the same log data count value is accumulated.
  • Step 103 Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.
  • the mining system divides the log data in the second log data set according to the dimension of the log data in the second log data set, and the obtained
  • the third log data set corresponding to different dimensions is saved in the Hadoop database to implement massive log data mining, and the saved third log data set can be used as the data source of the user data query, and supports icons, graphic queries and display interfaces of the display interface. Dimensional query enables data to be displayed from multiple angles to achieve data mining.
  • the content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on.
  • the Internet access location refers to the geographical location of the IP location used by the user.
  • the Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user. It should be noted that for Different types of log data, the dimensions of the log data are also different.
  • the dimension is in addition to the above-mentioned Internet content.
  • the Internet access location and the Internet access time the Internet access frequency, user age, monthly consumption, and the like may be included. Therefore, in practical applications, the dimension may be divided according to specific needs, which is not limited herein.
  • the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
  • the mining system since the mining system performs the parallel aggregation processing and the dimension division operation in the case where the number of the first log data sets saved in the Hadoop database satisfies the preset value, the third obtained is obtained.
  • the log data set actually corresponds to a time period.
  • the mining system saves, it can save the correspondence between the dimension, the time segment and the third log data set.
  • the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set.
  • the preset parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set.
  • the log data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.
  • FIG. 2 is a schematic flowchart of an additional step before step 101 in the first embodiment shown in FIG. 1 of the present invention, including:
  • Step 201 Obtain log data in a current time period from a network side.
  • the mining system obtains the log data in the current time period from the network side.
  • the mining system may obtain the current time period from the network side by using the log data to be extracted.
  • the log data in the log data may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or the network may be accepted.
  • Step 202 Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the mining system after acquiring the log data in the current time period, performs aggregation processing on the log data in the current time period to obtain a first log data set of the current time period.
  • the aggregation in step 202 may be performed according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained after the aggregation is of an order of magnitude far. It is much lower than the order of the log data in the current time period obtained, and the meaning of the data is completely saved.
  • the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side.
  • the order of log data reduces the amount of storage space required in the Hadoop database, saving storage space.
  • the mining system may further perform the following steps before performing step 202:
  • the mining system may perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.
  • step 202 is also required, and the adaptive adjustment of step 202 is:
  • the log data after the cleaning in the current time period is aggregated to obtain the first log data set in the current time period.
  • the cleaning of the log data may be to remove some data types that do not meet the preset settings.
  • the mining system performs data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.
  • FIG. 3 is a schematic flowchart of an additional step after step 103 in the first embodiment shown in FIG. 1 , which includes:
  • Step 301 If a data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction.
  • the user may request the query data by inputting the data query instruction, and if the mining system receives the data query instruction, according to the data.
  • the query dimension contained in the query instruction reads the third log data set corresponding to the dimension from the Hadoop database.
  • the data query instruction may further include a certain time period, and the mining system will read the third log data set corresponding to the query dimension in the time period.
  • Step 302 Perform data analysis on the third log data set, and display the result of the data analysis on the display interface.
  • the mining system further performs data analysis on the third log data set, and displays the result of the data analysis on the display interface.
  • the mining system performs the third log data set according to the preset clustering algorithm.
  • the user is grouped by the user to obtain a user grouping list; the level configuration table corresponding to at least two user dimensions is obtained according to the log data of the user in the user grouping list, and the level configuration table is displayed on the display interface; the user dimension is preset
  • the level configuration table includes the levels determined by the users in the user group list according to the user dimension.
  • user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions.
  • the user group obtained by the mining system includes: all user groups and microblog user groups.
  • all user groups all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups.
  • the top 20% are five-star users
  • the top 20% to 40% are four-star users, and so on, determine the star rating of each user in the Weibo user group.
  • Through the horizontal dimension rating and the vertical dimension rating it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.
  • the preset clustering algorithm may be a K-means algorithm.
  • the query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database.
  • the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.
  • the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.
  • the log data mining method based on the Hadoop database can be applied to the accurate marketing system of the traffic data, for example, the technical solution described in the embodiments shown in FIG. 1 to FIG. To achieve the mining of target users and the mining of marketing sites, etc., to provide operators with a data base for targeted and refined marketing of target users or target base station cells.
  • the query dimension may be the Internet content or the Internet traffic. If the target base station cell needs to be determined, the query dimension may be the Internet access location.
  • the user can select the query dimension according to specific needs, which is not limited here.
  • FIG. 4 is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention, including:
  • the first saving module 401 is configured to save the first log data set in the acquired current time period to the Hadoop database;
  • the mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.
  • the time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.
  • Hadoop can implement Distributed File System (HDFS).
  • HDFS Distributed File System
  • the core of Hadoop framework is Hadoop database and parallel computing model.
  • Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive.
  • the data provides parallel operations.
  • the parallel computing model is a mapreduce computing model.
  • the parallel aggregation module 402 is configured to: if the number of the first log data sets saved by the Hadoop database meets a preset value, perform the first log data set in the Hadoop database by using a preset parallel computing model. Parallel aggregation processing to obtain a second log data set;
  • the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.
  • the parallel aggregation module 402 can also obtain a log data set in different time periods in a similar manner. For example, a first log data set of 15 time periods can be used to obtain one. The log data collection in the hour can be obtained by using the log data set within 24 hours to obtain the log data set within one day. The log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data can be obtained at different times to meet different needs.
  • the partitioning and saving module 403 divides the log data in the second log data set according to the dimension of the log data in the second log data set, and saves the obtained third log data set corresponding to different dimensions to the In the Hadoop database.
  • the content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on.
  • the Internet access location refers to the geographical location of the IP location used by the user.
  • the Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user.
  • the dimensions of the log data are different for different types of log data, for example, the number of logs in the log.
  • the dimension may include the Internet access frequency, the user age, the monthly consumption, and the like in addition to the above-mentioned Internet content, the Internet access location, and the Internet access time. Therefore, in practical applications, the dimension division may be performed according to specific needs, which is not limited herein.
  • the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
  • the first saving module 401 saves the first log data set in the current time period acquired to the Hadoop database, and if the number of the first log data set saved in the Hadoop database meets the preset
  • the parallel aggregation module 402 performs parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set
  • the partition save module 403 is configured according to the second log.
  • the dimension of the log data in the data set is dimension-divided to the log data in the second log data set, and the obtained third log data set corresponding to the different dimensions is saved in the Hadoop database.
  • the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set.
  • the parallel computing model in the Hadoop database performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set.
  • the log data in the maintenance is divided and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.
  • FIG. 5 it is a schematic diagram of a function module added in the second embodiment shown in FIG. 4, including:
  • the obtaining module 501 is configured to acquire log data in the current time period from the network side;
  • the obtaining module 501 is configured to obtain the log data in the current time period from the network side.
  • the obtaining module 501 can obtain the log data from the network side.
  • the log data in the current time period may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or
  • the log data in the current time period provided by the third-party vendor on the network side may be accepted, or the log data in the current time period may be acquired in combination with at least two methods described above.
  • the first aggregation module 502 is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the first aggregation module 502 may be classified according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained by the aggregation is of an order of magnitude. It is far below the order of the log data in the current time period obtained, and the meaning of the data is completely preserved.
  • the mining system does not start to execute the first saving module 401 in the embodiment shown in FIG. 4 after executing the first aggregation module 502.
  • the system further includes a cleaning module 503;
  • the cleaning module 503 is configured to perform data cleaning on the log data in the current time period after the obtaining module 501 obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
  • the first aggregation module 502 is configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
  • the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side.
  • the order of log data reduces the amount of storage space required in the Hadoop database, saving storage space.
  • the mining system can also perform data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.
  • FIG. 6 a schematic diagram of a function module added to the second embodiment shown in FIG. 4 includes:
  • the reading module 601 is configured to: if the data query instruction is received, read the third corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction Log data collection;
  • the analyzing module 602 is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
  • the analysis module 602 includes:
  • the clustering module 603 is configured to perform grouping of users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
  • the obtaining display module 604 is configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user grouping list, and display the level configuration table on the display interface; the user dimension is preset.
  • the level configuration table includes a level determined by the user in the user grouping list according to the user dimension.
  • user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions.
  • the user group obtained by the mining system includes: all user groups and microblog user groups.
  • all user groups all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups.
  • the rankings are ranked according to the traffic generated after the user starts the Weibo.
  • the top 20% are five-star users, and the top 20% to 40% are four-star users.
  • Such a push determines the star rating of each user in the Weibo user group.
  • Through the horizontal dimension rating and the vertical dimension rating it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.
  • the preset clustering algorithm may be a K-means algorithm.
  • the query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database.
  • the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.
  • the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be configured to store program code for performing the following steps:
  • the first log data set in the Hadoop database is parallelized and processed by the preset parallel computing model to obtain the second log data. set;
  • the storage medium is further arranged to store program code for performing the following steps:
  • S2 Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • a mobile hard disk e.g., a hard disk
  • magnetic memory e.g., a hard disk
  • the first log data set in the current time period acquired is saved in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the preset is used.
  • the parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the log in the second log data set according to the dimension of the log data in the second log data set The data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.

Abstract

A log data mining method based on Hadoop, comprising: saving a first log data set over an acquired current time period into a Hadoop database (101); if the number of first log data sets saved in the Hadoop database satisfies a pre-set numerical value, performing parallel aggregation processing on the first log data sets in the Hadoop database using a pre-set parallel arithmetic model, so as to get a second log data set (102); and according to the dimension of the log data in the second log data set, performing dimension division on the log data in the second log data set, and saving a third log data set corresponding to obtained different dimensions into the Hadoop database(103). The method can rapidly and effectively realize mass data mining, and satisfy the storage and operation requirement for the mass data mining.

Description

基于Hadoop的日志数据挖掘方法及系统Hadoop-based log data mining method and system 技术领域Technical field
本发明涉及计算机数据处理领域,尤其涉及一种基于Hadoop的日志数据挖掘方法及系统。The present invention relates to the field of computer data processing, and in particular, to a log data mining method and system based on Hadoop.
背景技术Background technique
进入互联网时代以来,如何在不断暴增的海量用户信息中,快速寻找更合适、可量化、可预测的精准营销策略,成为了包括运营商在内众多企业的核心需求。Since entering the Internet era, how to quickly find more appropriate, quantifiable and predictable precision marketing strategies in the ever-increasing mass of user information has become the core demand of many enterprises including operators.
然而,传统数据库对数据运算能力有限,存储成本昂贵,无法满足海量数据的挖掘的需求。However, traditional databases have limited computing power and are expensive to store, which cannot meet the needs of massive data mining.
上述内容仅用于辅助理解本发明的技术方案,并不代表承认上述内容是相关技术。The above content is only used to assist in understanding the technical solutions of the present invention, and does not constitute an admission that the above is related art.
发明内容Summary of the invention
本发明实施例的主要目的在于提供一种基于Hadoop的日志数据挖掘方法及系统,旨在解决传统数据库对数据运算能力有限,存储成本昂贵,无法提供海量数据的挖掘的技术问题。The main purpose of the embodiments of the present invention is to provide a log data mining method and system based on Hadoop, which aims to solve the technical problem that the traditional database has limited computing power, high storage cost, and cannot provide massive data mining.
为实现上述目的,本发明实施例提供的一种基于Hadoop的日志数据挖掘方法,包括:To achieve the above objective, a Hadoop-based log data mining method provided by an embodiment of the present invention includes:
将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;Saving the first log data set in the current time period obtained to the Hadoop database;
若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second Log data collection;
根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。 Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.
可选地,所述方法还包括:Optionally, the method further includes:
从网络侧获取当前时间段内的日志数据;Obtaining log data in the current time period from the network side;
对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Aggregating processing the log data in the current time period to obtain a first log data set in the current time period.
可选地,所述从网络侧获取当前时间段内的日志数据的步骤之后还包括:Optionally, after the step of acquiring the log data in the current time period from the network side, the method further includes:
对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;Performing data cleaning on the log data in the current time period to obtain log data after cleaning in the current time period;
则所述对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合的步骤包括:And the step of performing the aggregation processing on the log data in the current time period to obtain the first log data set in the current time period includes:
对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Performing aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
可选地,所述方法还包括:Optionally, the method further includes:
若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;If the data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;
对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。Data analysis is performed on the third log data set, and the result of the data analysis is displayed on the display interface.
可选地,所述对所述第三日志数据集合进行数据分析,包括:Optionally, the performing data analysis on the third log data set includes:
按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;Performing user grouping on users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。And obtaining, according to the log data of the user in the user group list, a level configuration table corresponding to the at least two user dimensions, where the user dimension is preset, and the level configuration table includes the user in the user group list according to the user. The level at which the dimension is graded.
为实现上述目的,本发明实施例还提供一种基于Hadoop的日志数据挖掘系统,包括:To achieve the above objective, an embodiment of the present invention further provides a log data mining system based on Hadoop, including:
第一保存模块,设置为将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first saving module is configured to save the first log data set in the obtained current time period to the Hadoop database;
并行聚集模块,设置为若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;a parallel aggregation module, configured to: if the number of the first log data set saved by the Hadoop database satisfies a preset value, parallelize the first log data set in the Hadoop database by using a preset parallel computing model Aggregating processing to obtain a second log data set;
划分保存模块,根据所述第二日志数据集合中的日志数据的维度对所述 第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。Dividing a save module, according to the dimension of the log data in the second log data set The log data in the second log data set is dimensioned, and the obtained third log data set corresponding to different dimensions is saved in the Hadoop database.
可选地,所述系统还包括:Optionally, the system further includes:
获取模块,设置为从网络侧获取当前时间段内的日志数据;Obtaining a module, configured to obtain log data in a current time period from a network side;
第一聚集模块,设置为对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
可选地,所述系统还包括清洗模块;Optionally, the system further includes a cleaning module;
所述清洗模块设置为在所述获取模块获取所述当前时间段内的日志数据之后,对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;The cleaning module is configured to perform data cleaning on the log data in the current time period after the obtaining module obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
且所述第一聚集模块具体设置为对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
可选地,所述系统还包括:Optionally, the system further includes:
读取模块,设置为若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;The reading module is configured to: if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction;
分析模块,设置为对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。The analysis module is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
可选地,所述分析模块包括:Optionally, the analyzing module includes:
聚类模块,设置为按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;a clustering module, configured to group users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
获取显示模块,设置为根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。Obtaining a display module, configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user group list, where the user dimension is preset, and the level configuration table includes the user grouping list The user is ranked according to the user dimension.
在本发明实施例中,还提供了一种计算机存储介质,该计算机存储介质可以存储有执行指令,该执行指令用于执行上述实施例中的基于Hadoop的日志数据挖掘方法。 In the embodiment of the present invention, a computer storage medium is further provided, and the computer storage medium may store an execution instruction for executing the Hadoop-based log data mining method in the foregoing embodiment.
本发明实施例提供一种基于Hadoop的日志数据挖掘方法,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。The embodiment of the present invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the Hadoop database has saved the first log data set, the number of the first log data set is satisfied. The set value is used to perform parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel computing model to obtain a second log data set, according to the dimension of the log data in the second log data set. The log data in the second log data set is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.
附图说明DRAWINGS
图1为本发明第一实施例的基于Hadoop的日志数据挖掘方法的流程示意图;1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention;
图2为图1中的第一实施例的步骤101之前追加步骤的流程示意图;2 is a schematic flow chart of an additional step before step 101 of the first embodiment of FIG. 1;
图3为图1中的第一实施例的步骤103之后追加步骤的流程示意图;3 is a schematic flow chart of an additional step after step 103 of the first embodiment in FIG. 1;
图4为本发明第二实施例中基于Hadoop的日志数据挖掘系统的功能模块的示意图;4 is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention;
图5为图4的第二实施例中追加的功能模块的示意图;Figure 5 is a schematic diagram of additional functional modules in the second embodiment of Figure 4;
图6为图4的第二实施例中追加的功能模块的示意图。Figure 6 is a schematic illustration of additional functional modules in the second embodiment of Figure 4.
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明提供一种基于Hadoop的日志数据挖掘方法,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型 对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用Hadoop中的预置的并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。The invention provides a log data mining method based on Hadoop, and saves the first log data set in the current time period to the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies the preset Pre-defined parallel computing model Performing parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and performing maintenance division on the log data in the second log data set according to the dimension of the log data in the second log data set The third log data set corresponding to the different dimensions is saved to the Hadoop database to complete the mining of the log data. Because the Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using the preset parallel computing model in Hadoop, which can realize massive data quickly and efficiently. Mining to meet the storage and computing needs of mining massive data.
请参阅图1,为本发明第一实施例中基于Hadoop的日志数据挖掘方法的流程示意图,包括:1 is a schematic flowchart of a Hadoop-based log data mining method according to a first embodiment of the present invention, including:
步骤101、将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;Step 101: Save the first log data set in the obtained current time period to the Hadoop database.
在本发明实施例中,基于Hadoop的日志数据挖掘方法可以应用在基于Hadoop的日志数据挖掘系统(以下简称为:挖掘系统)中,挖掘系统将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中。In the embodiment of the present invention, the Hadoop-based log data mining method can be applied to a Hadoop-based log data mining system (hereinafter referred to as: mining system), and the mining system saves the first log data set in the current time period acquired. To the Hadoop database.
其中,挖掘系统是按照时间段获取第一日志数据集合的,例如,若时间段是15分钟或者是30分钟,则挖掘系统获取当前的15分钟时间段内的第一日志数据集合或者获取当前的30分钟时间段内第一日志数据集合。The mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.
其中,该时间段是获取数据的周期,可以按照数据量的大小确定该时间段的时长。The time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.
其中,Hadoop可实现分布式文件系统(Hadoop Distributed File System,HDFS),Hadoop的框架核心是Hadoop数据库及并行运算模型,其中,Hadoop数据库能够为海量的数据提供分布式存储,并行运行模型能够为海量的数据提供并行运算。Among them, Hadoop can implement Distributed File System (HDFS). The core of Hadoop framework is Hadoop database and parallel computing model. Among them, Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive. The data provides parallel operations.
优选的,该并行运算模型为mapreduce运算模型。Preferably, the parallel computing model is a mapreduce computing model.
步骤102、若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;Step 102: If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second log. Data collection
在本发明实施例中,挖掘系统在每个时间段内都将获取到的第一日志数据集合保存至Hadoop数据库中,若该Hadoop数据库已保存的第一日志数据 集合的个数满足预先设置的数值,则可利用该Hadoop框架中的预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行聚集处理,得到第二日志数据集合。In the embodiment of the present invention, the mining system saves the acquired first log data set to the Hadoop database in each time period, and if the Hadoop database has saved the first log data. If the number of sets satisfies a preset value, the first log data set in the Hadoop database may be aggregated by using a preset parallel computing model in the Hadoop framework to obtain a second log data set.
其中,在实际应用中可根据具体的需要预先设置该数值,例如,若上述的时间段为15分钟,且需要对一个小时内的第一日志数据集合进行聚集处理,则该预先设置的数值为4;若上述的时间段为30分钟,且需要对1天内的第一日志数据集合进行聚集处理,则该预先设置的数值为48。In the actual application, the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.
可以理解的是,基于上述的聚集处理,挖掘系统还可以利用类似的方式得到不同时间周期内的日志数据集合,例如:可以利用4个时间段为15分钟的第一日志数据集合得到一个小时内的日志数据集合,可以利用24个一个小时内的日志数据集合得到一天内的日志数据集合,可以利用30个一天内的日志数据集合得到一个月内的日志数据集合,且以此类推,可以得到不同时间内的日志数据集合,以满足不同的需求。It can be understood that, based on the above-mentioned aggregation processing, the mining system can also obtain a log data set in different time periods in a similar manner, for example, it can be obtained within one hour by using the first log data set of 15 time periods of 15 minutes. The log data set can be obtained by using the log data set within 24 hours to obtain the log data set within one day. The log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data at different times to meet different needs.
在本发明实施例中,挖掘系统在利用预置的并行运算模型进行并行聚集处理时,是将相同的日志数据的计数值进行累加。In the embodiment of the present invention, when the mining system performs parallel aggregation processing using the preset parallel computing model, the same log data count value is accumulated.
步骤103、根据第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中。Step 103: Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.
在本发明实施例中,挖掘系统在得到第二日志数据集合之后,将根据该第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,且将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中,以实现海量日志数据的挖掘,且保存的第三日志数据集合可以作为用户数据查询的数据源,支持显示界面的图标、图形查询及多维度查询,使得能够多角度展示数据,达到数据挖掘的展示效果。In the embodiment of the present invention, after obtaining the second log data set, the mining system divides the log data in the second log data set according to the dimension of the log data in the second log data set, and the obtained The third log data set corresponding to different dimensions is saved in the Hadoop database to implement massive log data mining, and the saved third log data set can be used as the data source of the user data query, and supports icons, graphic queries and display interfaces of the display interface. Dimensional query enables data to be displayed from multiple angles to achieve data mining.
其中,日志数据的维度有很多,包括但不限于上网内容、上网位置和上网时间,其中,上网内容是指在用户的浏览位置,该浏览位置可以是具体的某一个位置,例如可以是百度、搜狐、新浪微博等等,也可以是一类网址,例如:音乐、电影等等。上网位置是指用户使用的IP位置所处的地理位置范围,上网时间是指生成日志数据的时间。且维度的划分是根据系统的要求,通过维度上的数据完成对用户整体行为的进一步刻画。需要说明的是,对于 不同类型的日志数据,其日志数据的维度也是不一样的,例如:在对日志数据中的用户的流量数据采用本发明实施例中的技术方案进行数据挖掘时,其维度除了上述的上网内容、上网位置和上网时间以外,还可以包含上网频率、用户年龄、月消费等等,因此在实际应用中,可以根据具体的需要进行维度划分,此处不做限定。There are many dimensions of the log data, including but not limited to the content of the Internet, the location of the Internet, and the time of the Internet. The content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on. The Internet access location refers to the geographical location of the IP location used by the user. The Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user. It should be noted that for Different types of log data, the dimensions of the log data are also different. For example, when the data of the user in the log data is used for data mining by using the technical solution in the embodiment of the present invention, the dimension is in addition to the above-mentioned Internet content. In addition to the Internet access location and the Internet access time, the Internet access frequency, user age, monthly consumption, and the like may be included. Therefore, in practical applications, the dimension may be divided according to specific needs, which is not limited herein.
优选的,在本发明实施例中,挖掘系统在将不同维度对应的第三日志数据集合保存至Hadoop数据库中之后,还可以将该不同维度对应的第三日志数据集合保存至列存储阵列中,使得能够实现Hadoop数据库和列存储阵列的协同工作,使得能够满足不同的应用场景的数据需求。Preferably, in the embodiment of the present invention, after the third log data set corresponding to the different dimensions is saved in the Hadoop database, the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
优选的,由于挖掘系统是在Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值的情况下才会执行上述的并行聚集处理及维度划分的操作的,因此,得到的第三日志数据集合其实也对应着一个时间段,挖掘系统在保存时,可以保存维度、时间段及第三日志数据集合三者之间的对应关系。Preferably, since the mining system performs the parallel aggregation processing and the dimension division operation in the case where the number of the first log data sets saved in the Hadoop database satisfies the preset value, the third obtained is obtained. The log data set actually corresponds to a time period. When the mining system saves, it can save the correspondence between the dimension, the time segment and the third log data set.
在本发明实施例中,挖掘系统将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用Hadoop中的并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。In the embodiment of the present invention, the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set. The preset parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set. The log data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.
请参阅图2,为本发明图1所示的第一实施例中步骤101之前追加步骤的流程示意图,包括:2 is a schematic flowchart of an additional step before step 101 in the first embodiment shown in FIG. 1 of the present invention, including:
步骤201、从网络侧获取当前时间段内的日志数据;Step 201: Obtain log data in a current time period from a network side.
在本发明实施例中,挖掘系统是从网络侧获取当前时间段内的日志数据,具体的:挖掘系统可以通过日志数据的抽取的方式从网络侧获取当前时间段 内的日志数据,或者,可以利用网络爬虫技术从网络侧获取当前时间段内的日志数据,或者,可以通过从网络侧的BOSS营帐数据库中获取当前时间段内的日志数据,或者,可以接受网络侧的第三方厂商提供的当前时间段内的日志数据,或者结合上述的至少两种方式获取当前时间段内的日志数据。In the embodiment of the present invention, the mining system obtains the log data in the current time period from the network side. Specifically, the mining system may obtain the current time period from the network side by using the log data to be extracted. The log data in the log data may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or the network may be accepted. Log data in the current time period provided by the third-party vendor on the side, or log data in the current time period in combination with at least two methods described above.
步骤202、对当前时间段内的日志数据进行聚集处理,得到当前时间段内的第一日志数据集合。Step 202: Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
在本发明实施例中,挖掘系统在获取到当前时间段内的日志数据之后,对当前时间段内的日志数据进行聚集处理,得到当前时间段的第一日志数据集合。In the embodiment of the present invention, after acquiring the log data in the current time period, the mining system performs aggregation processing on the log data in the current time period to obtain a first log data set of the current time period.
其中,步骤202中聚集可以是根据日志数据的内容进行分类,把相同内容或者属于同一类的内容的日志数据作为一条数据进行数目上的累加,聚集后得到的第一日志数据集合的数量级将远远低于获取到的当前时间段内的日志数据的数量级,当时数据意义被完整的保存下来。The aggregation in step 202 may be performed according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained after the aggregation is of an order of magnitude far. It is much lower than the order of the log data in the current time period obtained, and the meaning of the data is completely saved.
在本发明实施例中,挖掘系统通过图2所示的追加的步骤实现第一日志数据集合的获取,且通过对从网络侧获取到的当前时间段内的日志数据进行聚集,能够有效的降低日志数据的数量级,使得在Hadoop数据库中所所需要的存储空间减小,节约存储空间。In the embodiment of the present invention, the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side. The order of log data reduces the amount of storage space required in the Hadoop database, saving storage space.
优选的,在本发明实施例中,挖掘系统在执行步骤202之前还可以执行以下步骤:Preferably, in the embodiment of the present invention, the mining system may further perform the following steps before performing step 202:
对当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;Data cleaning of the log data in the current time period to obtain log data after cleaning in the current time period;
在本发明实施例中,挖掘系统在对获取到的当前时间段内的日志数据进行聚集之前,还可以对当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据。In the embodiment of the present invention, before the collection of the log data in the current time period, the mining system may perform data cleaning on the log data in the current time period to obtain the cleaned log data in the current time period.
且若挖掘系统执行了上述步骤,则也需要对步骤202进行适应性的调整,且步骤202适应性调整为:And if the mining system performs the above steps, the adaptive adjustment of step 202 is also required, and the adaptive adjustment of step 202 is:
对当前时间段内清洗后的日志数据进行聚集处理,得到当前时间段内的第一日志数据集合。The log data after the cleaning in the current time period is aggregated to obtain the first log data set in the current time period.
其中,对日志数据进行清洗可以是去除一些不满足预先设置的数据类型 的日志数据,和/或,发现并纠正日志数据中可识别的错误,并修正或者删除出现可识别的日志数据。The cleaning of the log data may be to remove some data types that do not meet the preset settings. Log data, and/or, identify and correct identifiable errors in the log data, and correct or delete identifiable log data.
在本发明实施例中,挖掘系统通过对当前时间段内的日志数据进行数据清洗,使得能够除去一些无用或者出错的日志数据,降低日志数据处理的数量,且便于更好的进行数据挖掘。In the embodiment of the present invention, the mining system performs data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.
请参阅图3,为本发明图1所示第一实施例中的步骤103之后追加步骤的流程示意图,包括:Please refer to FIG. 3 , which is a schematic flowchart of an additional step after step 103 in the first embodiment shown in FIG. 1 , which includes:
步骤301、若接收到数据查询指令,则按照数据查询指令中包含的查询维度从Hadoop数据库中读取与查询维度对应的第三日志数据集合;Step 301: If a data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction.
在本发明实施例中,挖掘系统在将得到的第三日志数据保存至Hadoop数据库中之后,用户可以通过输入数据查询指令的方式请求查询数据,且若挖掘系统接收到数据查询指令,则按照数据查询指令中包含的查询维度从Hadoop数据库中读取与维度对应的第三日志数据集合。In the embodiment of the present invention, after the mining system saves the obtained third log data to the Hadoop database, the user may request the query data by inputting the data query instruction, and if the mining system receives the data query instruction, according to the data. The query dimension contained in the query instruction reads the third log data set corresponding to the dimension from the Hadoop database.
优选的,该数据查询指令中还可以包含某个时间段,则挖掘系统将读取在该时间段内,该查询维度对应的第三日志数据集合。Preferably, the data query instruction may further include a certain time period, and the mining system will read the third log data set corresponding to the query dimension in the time period.
步骤302、对第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。Step 302: Perform data analysis on the third log data set, and display the result of the data analysis on the display interface.
在本发明实施例中,挖掘系统还将对第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果,具体的:挖掘系统按照预先设置的聚类算法对第三日志数据集合中的用户进行用户分组,得到用户分组列表;根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示级别配置表;用户维度是预先设置的,级别配置表中包含用户分组列表中的用户按照用户维度进行分级确定的级别。In the embodiment of the present invention, the mining system further performs data analysis on the third log data set, and displays the result of the data analysis on the display interface. Specifically, the mining system performs the third log data set according to the preset clustering algorithm. The user is grouped by the user to obtain a user grouping list; the level configuration table corresponding to at least two user dimensions is obtained according to the log data of the user in the user grouping list, and the level configuration table is displayed on the display interface; the user dimension is preset The level configuration table includes the levels determined by the users in the user group list according to the user dimension.
其中,用户维度可以分为横向维度和纵向维度,并且在不同的维度下对用户进行评级。例如:挖掘系统得到的用户分组,包括:所有用户组及微博用户组,对于所有用户组,对该组内的所有用户按照使用的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该所有用户组中的每一个用户的星级。此即为横向维度评级。对于微博用户组合中的用户,按照用户启动微博之后产生的流量大小进 行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该微博用户组中的每一个用户的星级。此即为纵向维度评级。通过横向维度评级和纵向维度评级,使得能够对实现对用户群体的画像展示,以便业务专家针对具体的分组画像得到有针对性的方案。Among them, user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions. For example, the user group obtained by the mining system includes: all user groups and microblog user groups. For all user groups, all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups. This is the horizontal dimension rating. For the users in the Weibo user combination, according to the amount of traffic generated after the user starts the Weibo Ranks are ranked, the top 20% are five-star users, the top 20% to 40% are four-star users, and so on, determine the star rating of each user in the Weibo user group. This is a vertical dimension rating. Through the horizontal dimension rating and the vertical dimension rating, it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.
优选的,该预先设置的聚类算法可以是K-means算法。Preferably, the preset clustering algorithm may be a K-means algorithm.
其中,查询维度是基于Hadoop数据库中保存的第三日志数据集合对应的维度设置的,例如:查询维度可以是上网内容、上网时间、上网位置等中的任意一种或者任意几种。The query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database. For example, the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.
在本发明实施例中,挖掘系统通过按照数据查询指令中包含的查询维度从Hadoop数据库中读取与查询维度对应的第三日志数据集合,并对该第三日志数据集合进行数据分析,且在显示界面上显示数据分析的结果,使得能够有效的将数据挖掘的结果显示给用户。In the embodiment of the present invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.
需要说明的是,在本发明实施例中,基于Hadoop数据库的日志数据的挖掘方法可以应用在流量数据的精准营销系统中,例如,可以通过图1至图3所示实施例中描述的技术方案实现目标用户的挖掘及营销选址的挖掘等等,给运营商对目标用户或者目标基站小区做有针对性精细化营销提供数据基础。It should be noted that, in the embodiment of the present invention, the log data mining method based on the Hadoop database can be applied to the accurate marketing system of the traffic data, for example, the technical solution described in the embodiments shown in FIG. 1 to FIG. To achieve the mining of target users and the mining of marketing sites, etc., to provide operators with a data base for targeted and refined marketing of target users or target base station cells.
其中,若是需要确定目标用户,则在图3所示实施例中的步骤301中,查询维度可以是上网内容或者上网流量,若需要确定目标基站小区,则查询维度可以是上网位置。If the target user needs to be determined, in the step 301 in the embodiment shown in FIG. 3, the query dimension may be the Internet content or the Internet traffic. If the target base station cell needs to be determined, the query dimension may be the Internet access location.
在实际应用中,用户可以根据具体的需要选择查询维度,此处不做限定。In practical applications, the user can select the query dimension according to specific needs, which is not limited here.
请参阅图4,为本发明第二实施例中基于Hadoop的日志数据挖掘系统的功能模块的示意图,包括:Please refer to FIG. 4 , which is a schematic diagram of functional modules of a Hadoop-based log data mining system according to a second embodiment of the present invention, including:
第一保存模块401,设置为将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first saving module 401 is configured to save the first log data set in the acquired current time period to the Hadoop database;
其中,挖掘系统是按照时间段获取第一日志数据集合的,例如,若时间段是15分钟或者是30分钟,则挖掘系统获取当前的15分钟时间段内的第一日志数据集合或者获取当前的30分钟时间段内第一日志数据集合。 The mining system acquires the first log data set according to the time period. For example, if the time period is 15 minutes or 30 minutes, the mining system acquires the first log data set in the current 15-minute time period or obtains the current one. The first log data set in the 30 minute period.
其中,该时间段是获取数据的周期,可以按照数据量的大小确定该时间段的时长。The time period is a period for acquiring data, and the duration of the time period may be determined according to the size of the data amount.
其中,Hadoop可实现分布式文件系统(Hadoop Distributed File System,HDFS),Hadoop的框架核心是Hadoop数据库及并行运算模型,其中,Hadoop数据库能够为海量的数据提供分布式存储,并行运行模型能够为海量的数据提供并行运算。Among them, Hadoop can implement Distributed File System (HDFS). The core of Hadoop framework is Hadoop database and parallel computing model. Among them, Hadoop database can provide distributed storage for massive data, and the parallel running model can be massive. The data provides parallel operations.
优选的,并行运算模型为mapreduce运算模型。Preferably, the parallel computing model is a mapreduce computing model.
并行聚集模块402,设置为若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;The parallel aggregation module 402 is configured to: if the number of the first log data sets saved by the Hadoop database meets a preset value, perform the first log data set in the Hadoop database by using a preset parallel computing model. Parallel aggregation processing to obtain a second log data set;
其中,在实际应用中可根据具体的需要预先设置该数值,例如,若上述的时间段为15分钟,且需要对一个小时内的第一日志数据集合进行聚集处理,则该预先设置的数值为4;若上述的时间段为30分钟,且需要对1天内的第一日志数据集合进行聚集处理,则该预先设置的数值为48。In the actual application, the value may be preset according to specific needs. For example, if the time period is 15 minutes and the first log data set needs to be aggregated within one hour, the preset value is 4; If the above time period is 30 minutes, and the first log data set in one day needs to be aggregated, the preset value is 48.
可以理解的是,基于上述的聚集处理,并行聚集模块402还可以利用类似的方式得到不同时间周期内的日志数据集合,例如:可以利用4个时间段为15分钟的第一日志数据集合得到一个小时内的日志数据集合,可以利用24个一个小时内的日志数据集合得到一天内的日志数据集合,可以利用30个一天内的日志数据集合得到一个月内的日志数据集合,且以此类推,可以得到不同时间内的日志数据集合,以满足不同的需求。It can be understood that, based on the foregoing aggregation processing, the parallel aggregation module 402 can also obtain a log data set in different time periods in a similar manner. For example, a first log data set of 15 time periods can be used to obtain one. The log data collection in the hour can be obtained by using the log data set within 24 hours to obtain the log data set within one day. The log data set of 30 days can be used to obtain the log data set within one month, and so on. A collection of log data can be obtained at different times to meet different needs.
划分保存模块403,根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。The partitioning and saving module 403 divides the log data in the second log data set according to the dimension of the log data in the second log data set, and saves the obtained third log data set corresponding to different dimensions to the In the Hadoop database.
其中,日志数据的维度有很多,包括但不限于上网内容、上网位置和上网时间,其中,上网内容是指在用户的浏览位置,该浏览位置可以是具体的某一个位置,例如可以是百度、搜狐、新浪微博等等,也可以是一类网址,例如:音乐、电影等等。上网位置是指用户使用的IP位置所处的地理位置范围,上网时间是指生成日志数据的时间。且维度的划分是根据系统的要求,通过维度上的数据完成对用户整体行为的进一步刻画。需要说明的是,对于不同类型的日志数据,其日志数据的维度也是不一样的,例如:在对日志数 据中的用户的流量数据采用本发明实施例中的技术方案进行数据挖掘时,其维度除了上述的上网内容、上网位置和上网时间以外,还可以包含上网频率、用户年龄、月消费等等,因此在实际应用中,可以根据具体的需要进行维度划分,此处不做限定。There are many dimensions of the log data, including but not limited to the content of the Internet, the location of the Internet, and the time of the Internet. The content of the Internet refers to the location of the user, and the location of the browsing may be a specific location, for example, Baidu. Sohu, Sina Weibo, etc., can also be a type of website, such as music, movies, and so on. The Internet access location refers to the geographical location of the IP location used by the user. The Internet access time refers to the time when the log data is generated. And the division of dimensions is based on the requirements of the system, through the data in the dimension to further characterize the overall behavior of the user. It should be noted that the dimensions of the log data are different for different types of log data, for example, the number of logs in the log. According to the traffic data of the user in the data mining method according to the embodiment of the present invention, the dimension may include the Internet access frequency, the user age, the monthly consumption, and the like in addition to the above-mentioned Internet content, the Internet access location, and the Internet access time. Therefore, in practical applications, the dimension division may be performed according to specific needs, which is not limited herein.
优选的,在本发明实施例中,挖掘系统在将不同维度对应的第三日志数据集合保存至Hadoop数据库中之后,还可以将该不同维度对应的第三日志数据集合保存至列存储阵列中,使得能够实现Hadoop数据库和列存储阵列的协同工作,使得能够满足不同的应用场景的数据需求。Preferably, in the embodiment of the present invention, after the third log data set corresponding to the different dimensions is saved in the Hadoop database, the third log data set corresponding to the different dimensions may be saved in the column storage array. It enables the collaborative work of the Hadoop database and the column storage array to meet the data requirements of different application scenarios.
在本发明实施例中,第一保存模块401将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则并行聚集模块402利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,最后划分保存模块403根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。In the embodiment of the present invention, the first saving module 401 saves the first log data set in the current time period acquired to the Hadoop database, and if the number of the first log data set saved in the Hadoop database meets the preset And the parallel aggregation module 402 performs parallel aggregation processing on the first log data set in the Hadoop database by using a preset parallel operation model to obtain a second log data set, and finally the partition save module 403 is configured according to the second log. The dimension of the log data in the data set is dimension-divided to the log data in the second log data set, and the obtained third log data set corresponding to the different dimensions is saved in the Hadoop database.
在本发明实施例中,挖掘系统将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用Hadoop数据库中的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用Hadoop中的并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。In the embodiment of the present invention, the mining system saves the first log data set in the current time period acquired in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the mining system uses the first log data set. The parallel computing model in the Hadoop database performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the second log data set is obtained according to the dimension of the log data in the second log data set. The log data in the maintenance is divided and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing using Hadoop parallel computing model, which can quickly and efficiently realize massive data mining and meet the requirements. Storage and computing requirements for mining massive data.
请参阅图5,为图4所示的第二实施例中追加的功能模块的示意图,包括:Referring to FIG. 5, it is a schematic diagram of a function module added in the second embodiment shown in FIG. 4, including:
获取模块501,设置为从网络侧获取当前时间段内的日志数据;The obtaining module 501 is configured to acquire log data in the current time period from the network side;
在本发明实施例中,获取模块501是从网络侧获取当前时间段内的日志数据,具体的:获取模块501可以通过日志数据的抽取的方式从网络侧获取 当前时间段内的日志数据,或者,可以利用网络爬虫技术从网络侧获取当前时间段内的日志数据,或者,可以通过从网络侧的BOSS营帐数据库中获取当前时间段内的日志数据,或者,可以接受网络侧的第三方厂商提供的当前时间段内的日志数据,或者结合上述的至少两种方式获取当前时间段内的日志数据。In the embodiment of the present invention, the obtaining module 501 is configured to obtain the log data in the current time period from the network side. Specifically, the obtaining module 501 can obtain the log data from the network side. The log data in the current time period may be obtained from the network side by using the network crawler technology, or the log data in the current time period may be obtained from the BOSS camp database in the network side, or The log data in the current time period provided by the third-party vendor on the network side may be accepted, or the log data in the current time period may be acquired in combination with at least two methods described above.
第一聚集模块502,设置为对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module 502 is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
其中,第一聚集模块502可以是根据日志数据的内容进行分类,把相同内容或者属于同一类的内容的日志数据作为一条数据进行数目上的累加,聚集后得到的第一日志数据集合的数量级将远远低于获取到的当前时间段内的日志数据的数量级,当时数据意义被完整的保存下来。The first aggregation module 502 may be classified according to the content of the log data, and the log data of the same content or the content belonging to the same class is accumulated as a piece of data, and the first log data set obtained by the aggregation is of an order of magnitude. It is far below the order of the log data in the current time period obtained, and the meaning of the data is completely preserved.
在本发明实施例中挖掘系统在执行第一聚集模块502之后才会开始执行图4所示实施例中的第一保存模块401。In the embodiment of the present invention, the mining system does not start to execute the first saving module 401 in the embodiment shown in FIG. 4 after executing the first aggregation module 502.
在本发明实施例中,系统还包括清洗模块503;In the embodiment of the present invention, the system further includes a cleaning module 503;
清洗模块503设置为在所述获取模块501获取所述当前时间段内的日志数据之后,对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;The cleaning module 503 is configured to perform data cleaning on the log data in the current time period after the obtaining module 501 obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
且若挖掘系统执行了清洗模块503,则第一聚集模块502具体设置为对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。And the first aggregation module 502 is configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
在本发明实施例中,挖掘系统通过图2所示的追加的步骤实现第一日志数据集合的获取,且通过对从网络侧获取到的当前时间段内的日志数据进行聚集,能够有效的降低日志数据的数量级,使得在Hadoop数据库中所所需要的存储空间减小,节约存储空间。且挖掘系统还可以通过对当前时间段内的日志数据进行数据清洗,使得能够除去一些无用或者出错的日志数据,降低日志数据处理的数量,且便于更好的进行数据挖掘。In the embodiment of the present invention, the mining system implements the acquisition of the first log data set by using the additional steps shown in FIG. 2, and can effectively reduce the log data in the current time period acquired from the network side. The order of log data reduces the amount of storage space required in the Hadoop database, saving storage space. The mining system can also perform data cleaning on the log data in the current time period, so that some useless or erroneous log data can be removed, the number of log data processing is reduced, and data mining is facilitated.
请参阅图6,为图4所示的第二实施例追加的功能模块的示意图,包括:Referring to FIG. 6, a schematic diagram of a function module added to the second embodiment shown in FIG. 4 includes:
读取模块601,设置为若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三 日志数据集合;The reading module 601 is configured to: if the data query instruction is received, read the third corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction Log data collection;
分析模块602,设置为对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。The analyzing module 602 is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
其中,所述分析模块602包括:The analysis module 602 includes:
聚类模块603,设置为按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;The clustering module 603 is configured to perform grouping of users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
获取显示模块604,设置为根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示所述级别配置表;所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。The obtaining display module 604 is configured to obtain a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user grouping list, and display the level configuration table on the display interface; the user dimension is preset. The level configuration table includes a level determined by the user in the user grouping list according to the user dimension.
其中,用户维度可以分为横向维度和纵向维度,并且在不同的维度下对用户进行评级。例如:挖掘系统得到的用户分组,包括:所有用户组及微博用户组,对于所有用户组,对该组内的所有用户按照使用的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该所有用户组中的每一个用户的星级。此即为横向维度评级。对于微博用户组合中的用户,按照用户启动微博之后产生的流量大小进行名次排行,排名前20%的为五星级用户,排名前20%至40%的为四星级用户,并以此类推,确定该微博用户组中的每一个用户的星级。此即为纵向维度评级。通过横向维度评级和纵向维度评级,使得能够对实现对用户群体的画像展示,以便业务专家针对具体的分组画像得到有针对性的方案。Among them, user dimensions can be divided into horizontal dimensions and vertical dimensions, and users are rated in different dimensions. For example, the user group obtained by the mining system includes: all user groups and microblog user groups. For all user groups, all users in the group are ranked according to the traffic volume used, and the top 20% is five-star. Users, the top 20% to 40% of the four-star users, and so on, determine the star rating of each user in all user groups. This is the horizontal dimension rating. For the users in the Weibo user portfolio, the rankings are ranked according to the traffic generated after the user starts the Weibo. The top 20% are five-star users, and the top 20% to 40% are four-star users. Such a push determines the star rating of each user in the Weibo user group. This is a vertical dimension rating. Through the horizontal dimension rating and the vertical dimension rating, it is possible to implement a portrait presentation of the user group so that the business expert can get a targeted solution for the specific grouping portrait.
优选的,该预先设置的聚类算法可以是K-means算法。Preferably, the preset clustering algorithm may be a K-means algorithm.
其中,查询维度是基于Hadoop数据库中保存的第三日志数据集合对应的维度设置的,例如:查询维度可以是上网内容、上网时间、上网位置等中的任意一种或者任意几种。The query dimension is set according to a dimension corresponding to the third log data set saved in the Hadoop database. For example, the query dimension may be any one or any of the content of the Internet, the time of the Internet, and the location of the Internet.
在本发明实施例中,挖掘系统通过按照数据查询指令中包含的查询维度从Hadoop数据库中读取与查询维度对应的第三日志数据集合,并对该第三日志数据集合进行数据分析,且在显示界面上显示数据分析的结果,使得能够有效的将数据挖掘的结果显示给用户。In the embodiment of the present invention, the mining system reads the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction, and performs data analysis on the third log data set, and The results of the data analysis are displayed on the display interface, so that the results of the data mining can be effectively displayed to the user.
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码: Embodiments of the present invention also provide a storage medium. Optionally, in the embodiment, the foregoing storage medium may be configured to store program code for performing the following steps:
S1,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;S1: save the first log data set in the current time period acquired to the Hadoop database;
S2,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;S2. If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelized and processed by the preset parallel computing model to obtain the second log data. set;
S3,根据第二日志数据集合中的日志数据的维度对第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至Hadoop数据库中。S3, dimensioning the log data in the second log data set according to the dimension of the log data in the second log data set, and saving the obtained third log data set corresponding to different dimensions to the Hadoop database.
可选地,存储介质还被设置为存储用于执行以下步骤的程序代码:Optionally, the storage medium is further arranged to store program code for performing the following steps:
S1,从网络侧获取当前时间段内的日志数据;S1: Obtain log data in a current time period from a network side;
S2,对当前时间段内的日志数据进行聚集处理,得到当前时间段内的第一日志数据集合。S2: Perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Optionally, in this embodiment, the foregoing storage medium may include, but not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a magnetic memory. A variety of media that can store program code, such as a disc or a disc.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present invention in essence or the contribution to the related art can be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, CD-ROM). The instructions include a number of instructions for causing a terminal device (which may be a cell phone, computer, server, air conditioner, or network device, etc.) to perform the methods of various embodiments of the present invention.
以上仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。 The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention.
工业实用性Industrial applicability
在本发明实施例中,将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中,若Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对该Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合,根据该第二日志数据集合中的日志数据的维度对该第二日志数据集合中的日志数据进行维护划分,将得到的不同维度对应的第三日志数据集合保存至该Hadoop数据库中,以完成日志数据的挖掘。由于Hadoop数据库具有较好的分布式存储能力及并行运算能力,利用该Hadoop数据库对日志数据进行分布式存储及利用并行运算模型进行并行运算,能够快速有效地实现海量数据的挖掘,满足对海量数据进行挖掘的存储及运算需求。 In the embodiment of the present invention, the first log data set in the current time period acquired is saved in the Hadoop database, and if the number of the first log data set saved in the Hadoop database satisfies a preset value, the preset is used. The parallel computing model performs parallel aggregation processing on the first log data set in the Hadoop database to obtain a second log data set, and the log in the second log data set according to the dimension of the log data in the second log data set The data is maintained and divided, and the obtained third log data set corresponding to different dimensions is saved to the Hadoop database to complete the mining of the log data. Because Hadoop database has better distributed storage capacity and parallel computing capability, the Hadoop database can be used for distributed storage of log data and parallel computing with parallel computing model, which can quickly and efficiently mine massive data and meet massive data. The storage and computing requirements for mining.

Claims (10)

  1. 一种基于Hadoop的日志数据挖掘方法,包括:A Hadoop-based log data mining method, including:
    将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;Saving the first log data set in the current time period obtained to the Hadoop database;
    若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;If the number of the first log data set saved by the Hadoop database satisfies a preset value, the first log data set in the Hadoop database is parallelly aggregated by using a preset parallel computing model to obtain a second Log data collection;
    根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。Divide the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the Hadoop database.
  2. 根据权利要求1所述的方法,其中,所述方法还包括:The method of claim 1 wherein the method further comprises:
    从网络侧获取当前时间段内的日志数据;Obtaining log data in the current time period from the network side;
    对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Aggregating processing the log data in the current time period to obtain a first log data set in the current time period.
  3. 根据权利要求2所述的方法,其中,所述从网络侧获取当前时间段内的日志数据的步骤之后还包括:The method of claim 2, wherein the step of obtaining log data in the current time period from the network side further comprises:
    对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据;Performing data cleaning on the log data in the current time period to obtain log data after cleaning in the current time period;
    则所述对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合的步骤包括:And the step of performing the aggregation processing on the log data in the current time period to obtain the first log data set in the current time period includes:
    对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。Performing aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
  4. 根据权利要求1至3任意一项所述的方法,其中,所述方法还包括:The method of any one of claims 1 to 3, wherein the method further comprises:
    若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;If the data query instruction is received, the third log data set corresponding to the query dimension is read from the Hadoop database according to the query dimension included in the data query instruction;
    对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析 的结果。Performing data analysis on the third log data set and displaying data analysis on the display interface the result of.
  5. 根据权利要求4所述的方法,其中,所述对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果包括:The method of claim 4, wherein the performing data analysis on the third log data set and displaying the result of the data analysis on the display interface comprises:
    按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;Performing user grouping on users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
    根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示所述级别配置表;所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。Obtaining a level configuration table corresponding to at least two user dimensions according to the log data of the user in the user group list, and displaying the level configuration table on the display interface; the user dimension is preset, and the level configuration table includes The user in the user grouping list is hierarchically determined according to the user dimension.
  6. 一种基于Hadoop的日志数据挖掘系统,包括:A log data mining system based on Hadoop, including:
    第一保存模块,设置为将获取的当前时间段内的第一日志数据集合保存至Hadoop数据库中;The first saving module is configured to save the first log data set in the obtained current time period to the Hadoop database;
    并行聚集模块,设置为若所述Hadoop数据库已保存的第一日志数据集合的个数满足预先设置的数值,则利用预置的并行运算模型对所述Hadoop数据库中的第一日志数据集合进行并行聚集处理,得到第二日志数据集合;a parallel aggregation module, configured to: if the number of the first log data set saved by the Hadoop database satisfies a preset value, parallelize the first log data set in the Hadoop database by using a preset parallel computing model Aggregating processing to obtain a second log data set;
    划分保存模块,设置为根据所述第二日志数据集合中的日志数据的维度对所述第二日志数据集合中的日志数据进行维度划分,将得到的不同维度对应的第三日志数据集合保存至所述Hadoop数据库中。The partitioning save module is configured to perform dimension partitioning on the log data in the second log data set according to the dimension of the log data in the second log data set, and save the obtained third log data set corresponding to different dimensions to the In the Hadoop database.
  7. 根据权利要求6所述的系统,其中,所述系统还包括:The system of claim 6 wherein said system further comprises:
    获取模块,设置为从网络侧获取当前时间段内的日志数据;Obtaining a module, configured to obtain log data in a current time period from a network side;
    第一聚集模块,设置为对所述当前时间段内的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module is configured to perform aggregation processing on the log data in the current time period to obtain a first log data set in the current time period.
  8. 根据权利要求7所述的系统,其中,所述系统还包括清洗模块;The system of claim 7 wherein said system further comprises a cleaning module;
    所述清洗模块设置为在所述获取模块获取所述当前时间段内的日志数据之后,对所述当前时间段内的日志数据进行数据清洗,得到当前时间段内清洗后的日志数据; The cleaning module is configured to perform data cleaning on the log data in the current time period after the obtaining module obtains the log data in the current time period, and obtain the cleaned log data in the current time period;
    且所述第一聚集模块具体设置为对所述当前时间段内清洗后的日志数据进行聚集处理,得到所述当前时间段内的第一日志数据集合。The first aggregation module is specifically configured to perform aggregation processing on the cleaned log data in the current time period to obtain a first log data set in the current time period.
  9. 根据权利要求6至8任意一项所述的系统,其中,所述系统还包括:The system of any of claims 6 to 8, wherein the system further comprises:
    读取模块,设置为若接收到数据查询指令,则按照所述数据查询指令中包含的查询维度从所述Hadoop数据库中读取与所述查询维度对应的第三日志数据集合;The reading module is configured to: if the data query instruction is received, read the third log data set corresponding to the query dimension from the Hadoop database according to the query dimension included in the data query instruction;
    分析模块,设置为对所述第三日志数据集合进行数据分析,并在显示界面上显示数据分析的结果。The analysis module is configured to perform data analysis on the third log data set and display the result of the data analysis on the display interface.
  10. 根据权利要求9所述的系统,其中,所述分析模块包括:The system of claim 9 wherein said analyzing module comprises:
    聚类模块,设置为按照预先设置的聚类算法对所述第三日志数据集合中的用户进行用户分组,得到用户分组列表;a clustering module, configured to group users in the third log data set according to a preset clustering algorithm to obtain a user grouping list;
    获取显示模块,设置为根据用户分组列表中的用户的日志数据得到至少两个用户维度对应的级别配置表,并在显示界面上显示所述级别配置表;所述用户维度是预先设置的,所述级别配置表中包含所述用户分组列表中的用户按照所述用户维度进行分级确定的级别。 Obtaining a display module, configured to obtain a level configuration table corresponding to at least two user dimensions according to log data of the user in the user group list, and display the level configuration table on the display interface; the user dimension is preset The level configuration table includes a level determined by the users in the user group list according to the user dimension.
PCT/CN2016/097363 2015-12-02 2016-08-30 Log data mining method and system based on hadoop WO2017092444A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510875453.3 2015-12-02
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Publications (1)

Publication Number Publication Date
WO2017092444A1 true WO2017092444A1 (en) 2017-06-08

Family

ID=58796202

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/097363 WO2017092444A1 (en) 2015-12-02 2016-08-30 Log data mining method and system based on hadoop

Country Status (2)

Country Link
CN (1) CN106815274B (en)
WO (1) WO2017092444A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107241231A (en) * 2017-07-26 2017-10-10 成都科来软件有限公司 A kind of fast accurate localization method of raw network data bag
CN111597179A (en) * 2020-05-18 2020-08-28 北京思特奇信息技术股份有限公司 Method and device for automatically cleaning data, electronic equipment and storage medium
CN112287208A (en) * 2019-09-30 2021-01-29 北京沃东天骏信息技术有限公司 User portrait generation method and device, electronic equipment and storage medium
CN112632020A (en) * 2020-12-25 2021-04-09 中国电子科技集团公司第三十研究所 Log information type extraction method and mining method based on spark big data platform

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391645B (en) * 2017-07-12 2018-04-10 广州市昊链信息科技股份有限公司 A kind of logistics information automatic push and practical operation specification form system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000607A (en) * 2006-01-12 2007-07-18 国际商业机器公司 Visual method and device for strenthenzing search result guide
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN103036921A (en) * 2011-09-29 2013-04-10 北京新媒传信科技有限公司 User behavior analysis system and method
CN103955502A (en) * 2014-04-24 2014-07-30 科技谷(厦门)信息技术有限公司 Visualized on-line analytical processing (OLAP) application realizing method and system
CN104317958A (en) * 2014-11-12 2015-01-28 北京国双科技有限公司 Method and system for processing data in real time

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732123B1 (en) * 1998-02-23 2004-05-04 International Business Machines Corporation Database recovery to any point in time in an online environment utilizing disaster recovery technology
US7552147B2 (en) * 2005-09-02 2009-06-23 International Business Machines Corporation System and method for minimizing data outage time and data loss while handling errors detected during recovery
KR20090050405A (en) * 2007-11-15 2009-05-20 한국전자통신연구원 Method and apparatus for classifying user behaviors based on the event log generated from the context aware system environment
CN101483557B (en) * 2009-03-03 2011-07-13 中兴通讯股份有限公司 Log statistic, storing method and system used for deep packet detection apparatus
US9178935B2 (en) * 2009-03-05 2015-11-03 Paypal, Inc. Distributed steam processing
US10223431B2 (en) * 2013-01-31 2019-03-05 Facebook, Inc. Data stream splitting for low-latency data access
US10069677B2 (en) * 2013-04-06 2018-09-04 Citrix Systems, Inc. Systems and methods to collect logs from multiple nodes in a cluster of load balancers
CN104301360B (en) * 2013-07-19 2019-03-12 阿里巴巴集团控股有限公司 A kind of method of logdata record, log server and system
US9569491B2 (en) * 2013-09-13 2017-02-14 Nec Corporation MISO (multistore-online-tuning) system
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104616092B (en) * 2014-12-16 2019-10-25 国家电网公司 A kind of behavior pattern processing method based on distributed information log analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000607A (en) * 2006-01-12 2007-07-18 国际商业机器公司 Visual method and device for strenthenzing search result guide
CN103036921A (en) * 2011-09-29 2013-04-10 北京新媒传信科技有限公司 User behavior analysis system and method
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN103955502A (en) * 2014-04-24 2014-07-30 科技谷(厦门)信息技术有限公司 Visualized on-line analytical processing (OLAP) application realizing method and system
CN104317958A (en) * 2014-11-12 2015-01-28 北京国双科技有限公司 Method and system for processing data in real time

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107241231A (en) * 2017-07-26 2017-10-10 成都科来软件有限公司 A kind of fast accurate localization method of raw network data bag
CN107241231B (en) * 2017-07-26 2020-04-03 成都科来软件有限公司 Rapid and accurate positioning method for original network data packet
CN112287208A (en) * 2019-09-30 2021-01-29 北京沃东天骏信息技术有限公司 User portrait generation method and device, electronic equipment and storage medium
CN112287208B (en) * 2019-09-30 2024-03-01 北京沃东天骏信息技术有限公司 User portrait generation method, device, electronic equipment and storage medium
CN111597179A (en) * 2020-05-18 2020-08-28 北京思特奇信息技术股份有限公司 Method and device for automatically cleaning data, electronic equipment and storage medium
CN111597179B (en) * 2020-05-18 2023-12-05 北京思特奇信息技术股份有限公司 Method and device for automatically cleaning data, electronic equipment and storage medium
CN112632020A (en) * 2020-12-25 2021-04-09 中国电子科技集团公司第三十研究所 Log information type extraction method and mining method based on spark big data platform

Also Published As

Publication number Publication date
CN106815274B (en) 2022-02-18
CN106815274A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
US11762882B2 (en) System and method for analysis and management of data distribution in a distributed database environment
US11941017B2 (en) Event driven extract, transform, load (ETL) processing
WO2017092444A1 (en) Log data mining method and system based on hadoop
US8725721B2 (en) Personalizing scoping and ordering of object types for search
US9244971B1 (en) Data retrieval from heterogeneous storage systems
CN103620601A (en) Joining tables in a mapreduce procedure
EP3299972A1 (en) Efficient query processing using histograms in a columnar database
WO2019024496A1 (en) Enterprise recommendation method and application server
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
CN107748752B (en) Data processing method and device
CN111046237B (en) User behavior data processing method and device, electronic equipment and readable medium
WO2015074477A1 (en) Path analysis method and apparatus
US20120278354A1 (en) User analysis through user log feature extraction
US10191947B2 (en) Partitioning advisor for online transaction processing workloads
TW201415262A (en) Construction of inverted index system, data processing method and device based on Lucene
EP2526479A1 (en) Accessing large collection object tables in a database
WO2019085463A1 (en) Department demand recommendation method, application server, and computer-readable storage medium
CN107016115B (en) Data export method and device, computer readable storage medium and electronic equipment
CN111460279A (en) Information recommendation method and device, storage medium and computer equipment
US20190065548A1 (en) Method and system of optimizing database system, electronic device and storage medium
CN107181729B (en) Data encryption in a multi-tenant cloud environment
JP2017219899A (en) Knowledge search device, knowledge search method and knowledge search program
JP2017537383A (en) Relationship recognition aggregation (RAA) of normalized data sets
CN112506887A (en) Vehicle terminal CAN bus data processing method and device
WO2019153546A1 (en) Ten-thousand-level dimension data generation method, apparatus and device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16869755

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16869755

Country of ref document: EP

Kind code of ref document: A1