CN106815274A - Daily record data method for digging and system based on Hadoop - Google Patents

Daily record data method for digging and system based on Hadoop Download PDF

Info

Publication number
CN106815274A
CN106815274A CN201510875453.3A CN201510875453A CN106815274A CN 106815274 A CN106815274 A CN 106815274A CN 201510875453 A CN201510875453 A CN 201510875453A CN 106815274 A CN106815274 A CN 106815274A
Authority
CN
China
Prior art keywords
daily record
record data
user
data set
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510875453.3A
Other languages
Chinese (zh)
Other versions
CN106815274B (en
Inventor
惠羿
熊伟
哈景楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201510875453.3A priority Critical patent/CN106815274B/en
Priority to PCT/CN2016/097363 priority patent/WO2017092444A1/en
Publication of CN106815274A publication Critical patent/CN106815274A/en
Application granted granted Critical
Publication of CN106815274B publication Critical patent/CN106815274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a kind of daily record data method for digging based on Hadoop, the first daily record data set in the current slot of acquisition is preserved into Hadoop databases;If the number of the first daily record data set that Hadoop databases have been preserved meets the numerical value for pre-setting, parallel aggregation treatment is carried out to the first daily record data set in Hadoop databases using preset concurrent operation model, obtain the second daily record data acquisition system;The dimension of the daily record data in the second daily record data acquisition system carries out dimension division to the daily record data in the second daily record data acquisition system, and the corresponding 3rd daily record data set of the different dimensions that will be obtained is preserved into Hadoop databases.The invention also discloses a kind of daily record data digging system based on Hadoop.The present invention can fast and effeciently realize the excavation of mass data, storage and computing demand that satisfaction is excavated to mass data.

Description

Daily record data method for digging and system based on Hadoop
Technical field
The present invention relates to field of computer data processing, more particularly to a kind of daily record data based on Hadoop Method for digging and system.
Background technology
Since Internet era, how in the mass users information for constantly exploding, fast searching is more Properly, can quantify, predictable precision marketing strategy, become including numerous enterprises including operator Core demand.
However, traditional database is limited to data operational capability, carrying cost is expensive, it is impossible to meet magnanimity The demand of the excavation of data.
The above is only used for auxiliary and understands technical scheme, does not represent and recognizes that the above is Prior art.
The content of the invention
It is a primary object of the present invention to providing a kind of daily record data method for digging based on Hadoop and being System, it is intended to which it is limited to data operational capability to solve traditional database, carrying cost is expensive, it is impossible to provide sea Measure the technical problem of the excavation of data.
To achieve the above object, a kind of daily record data method for digging based on Hadoop that the present invention is provided, Including:
The first daily record data set in the current slot of acquisition is preserved into Hadoop databases;
If the number of the first daily record data set that the Hadoop databases have been preserved meets pre-setting Numerical value, then using preset concurrent operation model to the first daily record data in the Hadoop databases Set carries out parallel aggregation treatment, obtains the second daily record data acquisition system;
The dimension of the daily record data in the second daily record data acquisition system is to second log data set Daily record data in conjunction carries out dimension division, the corresponding 3rd daily record data set of the different dimensions that will be obtained Preserve into the Hadoop databases.
Preferably, methods described also includes:
The daily record data in current slot is obtained from network side;
Aggregation treatment is carried out to the daily record data in the current slot, is obtained in the current slot The first daily record data set.
Preferably, also include after the step of daily record data obtained from network side in current slot:
Data cleansing is carried out to the daily record data in the current slot, cleaning in current slot is obtained Daily record data afterwards;
Then the daily record data in the current slot carries out aggregation treatment, when obtaining described current Between the first daily record data set in section the step of include:
Aggregation treatment is carried out to the daily record data after being cleaned in the current slot, when obtaining described current Between the first daily record data set in section.
Preferably, methods described also includes:
If receiving data query instruction, in being instructed according to the data query inquiry dimension that includes from The 3rd daily record data set corresponding with the inquiry dimension is read in the Hadoop databases;
Data analysis, and the display data analysis on display interface are carried out to the 3rd daily record data set Result.
Preferably, it is described that data analysis is carried out to the 3rd daily record data set, including:
User point is carried out to the user in the 3rd daily record data set according to the clustering algorithm for pre-setting Group, obtains user grouping list;
The daily record data of the user in user grouping list obtains the corresponding level of at least two user's dimensions Other allocation list, user's dimension pre-sets, comprising the user point in the rank allocation list User in Groups List according to user's dimension be classified the rank of determination.
To achieve the above object, the present invention also provides a kind of daily record data digging system based on Hadoop, Including:
First preserving module, for the first daily record data set in the current slot by acquisition preserve to In Hadoop databases;
Parallel concentrating module, if the first daily record data set preserved for the Hadoop databases Number meets the numerical value for pre-setting, then using preset concurrent operation model to the Hadoop databases In the first daily record data set carry out parallel aggregation treatment, obtain the second daily record data acquisition system;
Preserving module is divided, the dimension of the daily record data in the second daily record data acquisition system is to described Daily record data in second daily record data acquisition system carries out dimension division, the different dimensions that will be obtained corresponding Three daily record data set are preserved into the Hadoop databases.
Preferably, the system also includes:
Acquisition module, for obtaining the daily record data in current slot from network side;
First concentrating module, for carrying out aggregation treatment to the daily record data in the current slot, obtains To the first daily record data set in the current slot.
Preferably, the system also includes cleaning module;
The cleaning module be used for the acquisition module obtain daily record data in the current slot it Afterwards, data cleansing is carried out to the daily record data in the current slot, obtains cleaning in current slot Daily record data afterwards;
And first concentrating module to the daily record data after being cleaned in the current slot specifically for entering Row aggregation is processed, and obtains the first daily record data set in the current slot.
Preferably, the system also includes:
Read module, if for receiving data query instruction, being wrapped in being instructed according to the data query The inquiry dimension for containing reads the 3rd daily record corresponding with the inquiry dimension from the Hadoop databases Data acquisition system;
Analysis module, for carrying out data analysis to the 3rd daily record data set, and in display interface The result of upper display data analysis.
Preferably, the analysis module includes:
Cluster module, for according to the clustering algorithm for pre-setting in the 3rd daily record data set User carries out user grouping, obtains user grouping list;
Display module is obtained, the daily record data for the user in user grouping list obtains at least two The corresponding rank allocation list of individual user's dimension, user's dimension pre-sets, the rank configuration According to user's dimension be classified the rank of determination in table comprising the user in the user grouping list
The present invention provides a kind of daily record data method for digging based on Hadoop, the current slot that will be obtained The first interior daily record data set is preserved into Hadoop databases, if what Hadoop databases had been preserved The number of the first daily record data set meets the numerical value for pre-setting, then using preset concurrent operation model Parallel aggregation treatment is carried out to the first daily record data set in the Hadoop databases, the second daily record is obtained Data acquisition system, the dimension of the daily record data in the second daily record data acquisition system is to second daily record data Daily record data in set carries out maintenance division, corresponding 3rd log data set of the different dimensions that will be obtained Close and preserve into the Hadoop databases, to complete the excavation of daily record data.Due to Hadoop databases With preferable distributed storage ability and concurrent operation ability, using the Hadoop databases to daily record number According to carrying out distributed storage and using concurrent operation model carrying out concurrent operation, can fast and effeciently realize The excavation of mass data, storage and computing demand that satisfaction is excavated to mass data.
Brief description of the drawings
Fig. 1 illustrates for the flow of the daily record data method for digging based on Hadoop of first embodiment of the invention Figure;
The schematic flow sheet of additional step before the step of Fig. 2 is the first embodiment in Fig. 1 101;
The schematic flow sheet of additional step after the step of Fig. 3 is the first embodiment in Fig. 1 103;
Fig. 4 is the functional module of the daily record data digging system based on Hadoop in second embodiment of the invention Schematic diagram;
Fig. 5 is the schematic diagram of functional module additional in the second embodiment of Fig. 4;
Fig. 6 is the schematic diagram of functional module additional in the second embodiment of Fig. 4.
The realization of the object of the invention, functional characteristics and advantage will be done further referring to the drawings in conjunction with the embodiments Explanation.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, limit is not used to The fixed present invention.
The present invention provides a kind of daily record data method for digging based on Hadoop, the current slot that will be obtained The first interior daily record data set is preserved into Hadoop databases, if what Hadoop databases had been preserved The number of the first daily record data set meets the numerical value for pre-setting, then using preset concurrent operation model Parallel aggregation treatment is carried out to the first daily record data set in the Hadoop databases, the second daily record is obtained Data acquisition system, the dimension of the daily record data in the second daily record data acquisition system is to second daily record data Daily record data in set carries out maintenance division, corresponding 3rd log data set of the different dimensions that will be obtained Close and preserve into the Hadoop databases, to complete the excavation of daily record data.Due to Hadoop databases With preferable distributed storage ability and concurrent operation ability, using the Hadoop databases to daily record number According to carrying out distributed storage and using the preset concurrent operation model in Hadoop carry out concurrent operation, energy Enough excavations for fast and effeciently realizing mass data, storage and computing that satisfaction is excavated to mass data Demand.
Fig. 1 is referred to, is the daily record data method for digging based on Hadoop in first embodiment of the invention Schematic flow sheet, including:
Step 101, the first daily record data set in the current slot of acquisition is preserved to Hadoop numbers According in storehouse;
In embodiments of the present invention, the daily record data method for digging based on Hadoop can be applied and is being based on The daily record data digging system of Hadoop is (hereinafter referred to as:Digging system) in, digging system will be obtained Current slot in the first daily record data set preserve into Hadoop databases.
Wherein, digging system is to obtain the first daily record data set according to the time period, if for example, the time Section is 15 minutes or 30 minutes, then digging system obtains first in current 15-min period Daily record data set obtains the first daily record data set in current 30 minutes section.
Wherein, the time period is the cycle for obtaining data, can determine the time according to the size of data volume The duration of section.
Wherein, Hadoop can realize distributed file system (Hadoop Distributed File System, HDFS), the framework core of Hadoop is Hadoop databases and concurrent operation model, wherein, Hadoop Database can provide distributed storage for the data of magnanimity, and parallel running model can be the data of magnanimity Concurrent operation is provided.
Preferably, the concurrent operation model is mapreduce operational models.
If the number of the first daily record data set that step 102, Hadoop databases have been preserved meets advance The numerical value of setting, then using preset concurrent operation model to the first daily record data in Hadoop databases Set carries out parallel aggregation treatment, obtains the second daily record data acquisition system;
In embodiments of the present invention, the first daily record number that digging system will get within each time period Preserved into Hadoop databases according to set, if the first daily record data that the Hadoop databases have been preserved The number of set meets the numerical value for pre-setting, then using the preset parallel fortune in the Hadoop frameworks Calculate model carries out aggregation treatment to the first daily record data set in Hadoop databases, obtains the second daily record Data acquisition system.
Wherein, in actual applications can be according to the numerical value be pre-set the need for specific, if for example, above-mentioned Time period be 15 minutes, and need to carry out aggregation treatment to the first daily record data set in a hour, The numerical value that then this pre-sets is 4;If the above-mentioned time period is 30 minutes, and is needed to the in 1 day One daily record data set carries out aggregation treatment, then the numerical value that this pre-sets is 48.
It is understood that being processed based on above-mentioned aggregation, digging system can also utilize similar mode The daily record data set in the different time cycle is obtained, for example:It can be 15 minutes using 4 time periods The first daily record data set daily record data set for obtaining in a hour, it is possible to use 24 one is small When in daily record data set obtain intraday daily record data set, it is possible to use 30 intraday days Will data acquisition system obtains the daily record data set in month, and by that analogy, can obtain different time Interior daily record data set, to meet different demands.
In embodiments of the present invention, digging system is assembled parallel using preset concurrent operation model It is that the count value of identical daily record data is added up during treatment.
Step 103, the dimension of daily record data in the second daily record data acquisition system are to the second log data set Daily record data in conjunction carries out dimension division, the corresponding 3rd daily record data set of the different dimensions that will be obtained Preserve into Hadoop databases.
In embodiments of the present invention, digging system, will be according to this after the second daily record data acquisition system is obtained The dimension of the daily record data in the second daily record data acquisition system is entered to the daily record data in the second daily record data acquisition system Row dimension is divided, and the corresponding 3rd daily record data set of different dimensions that will be obtained is preserved to Hadoop numbers According to storehouse, to realize the excavation of massive logs data, and the 3rd daily record data set for preserving can conduct The data source of user data query, supports icon, graphical query and the various dimensions inquiry of display interface, makes Multi-angle display data is obtained, the bandwagon effect of data mining is reached.
Wherein, the dimension of daily record data has a lot, including but not limited to internet content, online position and on The net time, wherein, internet content refers to browse position in user, and it can be specific that this browses position Some position, for example, can be Baidu, Sohu, Sina weibo etc., or a class network address, For example:Music, film etc..Online position refers to the geographical position model residing for the IP positions that user uses Enclose, the surf time refers to the time for generating daily record data.And the division of dimension is the requirement according to system, Further portraying to user's global behavior is completed by the data in dimension.It should be noted that for Different types of daily record data, the dimension of its daily record data be also it is different, for example:To daily record number When the data on flows of the user in carries out data mining using the technical scheme in the embodiment of the present invention, its Dimension except above-mentioned internet content, online position and in addition to the surf time, can also comprising frequency of surfing the Net, Age of user, moon consumption etc., therefore in actual applications, can be according to carrying out dimension the need for specific Divide, do not limit herein.
Preferably, in embodiments of the present invention, digging system is by the corresponding 3rd daily record number of different dimensions After being preserved into Hadoop databases according to set, can also be by the corresponding 3rd daily record number of the different dimensions Preserved into row storage array according to set, enabling realize the association of Hadoop databases and row storage array With work, enabling meet the demand data of different application scenarios.
Preferably, because digging system is the first daily record data set for having been preserved in Hadoop databases Number can just perform above-mentioned parallel aggregation treatment and dimension division in the case of meeting the numerical value for pre-setting Operation, therefore, the 3rd daily record data set for obtaining also correspond to a time period in fact, excavate System when stored, can preserve the correspondence between dimension, time period and the 3rd daily record data set three Relation.
In embodiments of the present invention, the first log data set in current slot that digging system will be obtained Close and preserve in Hadoop databases, if the first daily record data set that has preserved of Hadoop databases Number meets the numerical value for pre-setting, then using preset concurrent operation model in the Hadoop databases The first daily record data set carry out parallel aggregation treatment, obtain the second daily record data acquisition system, according to this The dimension of the daily record data in two daily record data set is entered to the daily record data in the second daily record data acquisition system Row safeguards and divides that the corresponding 3rd daily record data set of different dimensions that will be obtained is preserved to the Hadoop numbers According to storehouse, to complete the excavation of daily record data.Because Hadoop databases have preferable distributed storage Ability and concurrent operation ability, distributed storage and profit are carried out using the Hadoop databases to daily record data Concurrent operation is carried out with the concurrent operation model in Hadoop, mass data can be fast and effeciently realized Excavate, storage and computing demand that satisfaction is excavated to mass data.
Fig. 2 is referred to, is additional step before step 101 in the first embodiment shown in Fig. 1 of the present invention Schematic flow sheet, including:
Step 201, the daily record data from network side acquisition current slot;
In embodiments of the present invention, digging system is that the daily record data in current slot is obtained from network side, Specifically:Digging system can obtain current slot by way of the extraction of daily record data from network side Interior daily record data, or, it is possible to use web crawlers technology is obtained in current slot from network side Daily record data, or, current slot can be obtained by from the BOSS accounting datas storehouse of network side Interior daily record data, or, can receive in the current slot of third party manufacturer offer of network side Daily record data, or the daily record data in current slot is obtained with reference at least two above-mentioned modes.
Step 202, aggregation treatment is carried out to the daily record data in current slot, obtained in current slot The first daily record data set.
In embodiments of the present invention, digging system is after the daily record data in current slot is got, Aggregation treatment is carried out to the daily record data in current slot, the first daily record data of current slot is obtained Set.
Wherein, aggregation can be classified according to the content of daily record data in step 202, identical interior Hold or belong to cumulative, the aggregation that the daily record data of of a sort content enters in number of lines as a data The order of magnitude of the first daily record data set for obtaining afterwards is by well below the day in the current slot for getting The order of magnitude of will data, at that time data sense preserved by complete.
In embodiments of the present invention, digging system realizes the first daily record by the additional step shown in Fig. 2 The acquisition of data acquisition system, and carried out by the daily record data in the current slot to being got from network side Aggregation, can effectively reduce the order of magnitude of daily record data so that in Hadoop databases required for institute Memory space reduce, save memory space.
Preferably, in embodiments of the present invention, digging system can also carry out before step 202 is performed Following steps:
Data cleansing is carried out to the daily record data in current slot, after obtaining being cleaned in current slot Daily record data;
In embodiments of the present invention, daily record data of the digging system in the current slot to getting enters Before row aggregation, data cleansing can also be carried out to the daily record data in current slot, when obtaining current Between daily record data in section after cleaning.
And if digging system performs above-mentioned steps, being also required to carry out step 202 adjustment of adaptability, And step 202 accommodation is:
Aggregation treatment is carried out to the daily record data after being cleaned in current slot, is obtained in current slot First daily record data set.
Wherein, it can remove some data types for being unsatisfactory for pre-setting that cleaning is carried out to daily record data Daily record data, and/or, find and correct the mistake that can recognize that in daily record data, and correct or delete There is recognizable daily record data.
In embodiments of the present invention, digging system carries out data by the daily record data in current slot Cleaning, enabling remove some useless or error daily record datas, reduce the number of daily record data treatment Amount, and be easy to preferably carry out data mining.
Fig. 3 is referred to, is additional step after the step 103 in first embodiment shown in Fig. 1 of the present invention Schematic flow sheet, including:
If step 301, receiving data query instruction, the inquiry that includes is tieed up in being instructed according to data query Degree reads the 3rd daily record data set corresponding with inquiry dimension from Hadoop databases;
In embodiments of the present invention, digging system is preserved to Hadoop numbers in the 3rd daily record data that will be obtained After in storehouse, user can by way of input data query statement requesting query data, and if dig Pick system receives data query instruction, then in being instructed according to data query the inquiry dimension that includes from The 3rd daily record data set corresponding with dimension is read in Hadoop databases.
Preferably, certain time period can also be included in data query instruction, then digging system will read Within the time period, the corresponding 3rd daily record data set of the inquiry dimension.
Step 302, data analysis, and the display data on display interface are carried out to the 3rd daily record data set The result of analysis.
In embodiments of the present invention, digging system will also carry out data analysis to the 3rd daily record data set, And on display interface display data analysis result, specifically:Digging system is poly- according to what is pre-set Class algorithm carries out user grouping to the user in the 3rd daily record data set, obtains user grouping list;Root The corresponding rank of at least two user's dimensions is obtained according to the daily record data of the user in user grouping list to configure Table, and the display level allocation list on display interface;User's dimension pre-sets, rank allocation list According to user's dimension be classified the rank of determination comprising the user in user grouping list.
Wherein, user's dimension can be divided into transverse dimensions and longitudinal dimension, and right under different dimensions User is graded.For example:The user grouping that digging system is obtained, including:All user's groups and microblogging All users in the group, for all user's groups, are carried out name by user's group according to the uninterrupted for using Secondary seniority among brothers and sisters, before ranking 20% is five-star user, and 20% to 40% is four-star user before ranking, And so on, determine the star of each user in all user's groups.This is transverse dimensions and comments Level.User in being combined for microblog users, the uninterrupted produced after microblogging is started according to user and is entered Row ranking is ranked, 20% is five-star user before ranking, and 20% to 40% is four-star use before ranking Family, and so on, determine the star of each user in the microblog users group.This is longitudinal dimension Degree grading.Graded by transverse dimensions and longitudinal dimension is graded, enabling to realizing to user group's Portrait displaying, so that business expert obtains targetedly scheme for specific packet portrait.
Preferably, the clustering algorithm that this pre-sets can be K-means algorithms.
Wherein, inquiry dimension is corresponding based on the 3rd daily record data set preserved in Hadoop databases What dimension was set, for example:During inquiry dimension can be internet content, surf time, online position etc. Any one is any several.
In embodiments of the present invention, the inquiry dimension included during digging system according to data query by instructing The 3rd daily record data set corresponding with inquiry dimension is read from Hadoop databases, and to the 3rd day Will data acquisition system carries out data analysis, and on display interface display data analysis result, enabling The result of data mining is effectively shown to user.
It should be noted that in embodiments of the present invention, the digging of the daily record data based on Hadoop databases Pick method can be applied in the precision marketing system of data on flows, for example, can be by Fig. 1 to Fig. 3 Technical scheme described in illustrated embodiment realizes the excavation of targeted customer and the excavation of marketing addressing etc., Targeted fine integral method is done to targeted customer or target BS cell to operator data base is provided Plinth.
Wherein, if it needs to be determined that targeted customer, then in the embodiment shown in fig. 3 the step of 301 in, Inquiry dimension can be internet content or surfing flow, if desired determine target BS cell, then inquire about Dimension can be online position.
In actual applications, user can not limit herein according to inquiry dimension is selected the need for specific.
Fig. 4 is referred to, is the daily record data digging system based on Hadoop in second embodiment of the invention The schematic diagram of functional module, including:
First preserving module 401, for will obtain current slot in the first daily record data set preserve Into Hadoop databases;
Wherein, digging system is to obtain the first daily record data set according to the time period, if for example, the time Section is 15 minutes or 30 minutes, then digging system obtains first in current 15-min period Daily record data set obtains the first daily record data set in current 30 minutes section.
Wherein, the time period is the cycle for obtaining data, can determine the time according to the size of data volume The duration of section.
Wherein, Hadoop can realize distributed file system (Hadoop Distributed File System, HDFS), the framework core of Hadoop is Hadoop databases and concurrent operation model, wherein, Hadoop Database can provide distributed storage for the data of magnanimity, and parallel running model can be the data of magnanimity Concurrent operation is provided.
Preferably, concurrent operation model is mapreduce operational models.
Parallel concentrating module 402, if the first log data set preserved for the Hadoop databases The number of conjunction meets the numerical value for pre-setting, then using preset concurrent operation model to the Hadoop numbers Parallel aggregation treatment is carried out according to the first daily record data set in storehouse, the second daily record data acquisition system is obtained;
Wherein, in actual applications can be according to the numerical value be pre-set the need for specific, if for example, above-mentioned Time period be 15 minutes, and need to carry out aggregation treatment to the first daily record data set in a hour, The numerical value that then this pre-sets is 4;If the above-mentioned time period is 30 minutes, and is needed to the in 1 day One daily record data set carries out aggregation treatment, then the numerical value that this pre-sets is 48.
It is understood that being processed based on above-mentioned aggregation, parallel concentrating module 402 can also utilize class As the mode daily record data set that obtains in the different time cycle, for example:4 time periods can be utilized The daily record data set in a hour is obtained for the first daily record data set of 15 minutes, it is possible to use 24 Daily record data set in an individual hour obtains intraday daily record data set, it is possible to use 30 one Daily record data set in it obtains the daily record data set in month, and by that analogy, can obtain Daily record data set in different time, to meet different demands.
Preserving module 403 is divided, the dimension of the daily record data in the second daily record data acquisition system is to institute The daily record data stated in the second daily record data acquisition system carries out dimension division, and the different dimensions that will be obtained are corresponding 3rd daily record data set is preserved into the Hadoop databases.
Wherein, the dimension of daily record data has a lot, including but not limited to internet content, online position and on The net time, wherein, internet content refers to browse position in user, and it can be specific that this browses position Some position, for example, can be Baidu, Sohu, Sina weibo etc., or a class network address, For example:Music, film etc..Online position refers to the geographical position model residing for the IP positions that user uses Enclose, the surf time refers to the time for generating daily record data.And the division of dimension is the requirement according to system, Further portraying to user's global behavior is completed by the data in dimension.It should be noted that for Different types of daily record data, the dimension of its daily record data be also it is different, for example:To daily record number When the data on flows of the user in carries out data mining using the technical scheme in the embodiment of the present invention, its Dimension except above-mentioned internet content, online position and in addition to the surf time, can also comprising frequency of surfing the Net, Age of user, moon consumption etc., therefore in actual applications, can be according to carrying out dimension the need for specific Divide, do not limit herein.
Preferably, in embodiments of the present invention, digging system is by the corresponding 3rd daily record number of different dimensions After being preserved into Hadoop databases according to set, can also be by the corresponding 3rd daily record number of the different dimensions Preserved into row storage array according to set, enabling realize the association of Hadoop databases and row storage array With work, enabling meet the demand data of different application scenarios.
In embodiments of the present invention, the first preserving module 401 will be obtained first day in current slot Will data acquisition system is preserved into Hadoop databases, if first day that the Hadoop databases have been preserved The number of will data acquisition system meets the numerical value for pre-setting, then parallel concentrating module 402 using preset and Row operational model carries out parallel aggregation treatment to the first daily record data set in the Hadoop databases, The second daily record data acquisition system is obtained, preserving module 403 is finally divided according to the second daily record data acquisition system In the dimension of daily record data dimension division is carried out to the daily record data in the second daily record data acquisition system, The corresponding 3rd daily record data set of different dimensions that will be obtained is preserved into the Hadoop databases.
In embodiments of the present invention, the first log data set in current slot that digging system will be obtained Close and preserve in Hadoop databases, if the first daily record data set that has preserved of Hadoop databases Number meets the numerical value for pre-setting, then using the concurrent operation model in Hadoop databases to this The first daily record data set in Hadoop databases carries out parallel aggregation treatment, obtains the second daily record data Set, the dimension of the daily record data in the second daily record data acquisition system is to the second daily record data acquisition system In daily record data carry out maintenance division, the corresponding 3rd daily record data set of the different dimensions that will be obtained is protected Deposit into the Hadoop databases, to complete the excavation of daily record data.Because Hadoop databases have Preferable distributed storage ability and concurrent operation ability, are entered using the Hadoop databases to daily record data Row distributed storage and concurrent operation is carried out using the concurrent operation model in Hadoop, can be quickly effective Realize the excavation of mass data, storage and computing demand that satisfaction is excavated to mass data.
Fig. 5 is referred to, is the schematic diagram of the functional module added in the second embodiment shown in Fig. 4, including:
Acquisition module 501, for obtaining the daily record data in current slot from network side;
In embodiments of the present invention, acquisition module 501 is that the daily record in current slot is obtained from network side Data, specifically:Acquisition module 501 can be obtained by way of the extraction of daily record data from network side Daily record data in current slot, or, it is possible to use web crawlers technology obtains current from network side Daily record data in time period, or, can be obtained by from the BOSS accounting datas storehouse of network side Daily record data in current slot, or, can receive network side third party manufacturer provide it is current Daily record data in time period, or the day in current slot is obtained with reference at least two above-mentioned modes Will data.
First concentrating module 502, for carrying out aggregation treatment to the daily record data in the current slot, Obtain the first daily record data set in the current slot.
Wherein, the first concentrating module 502 can be classified according to the content of daily record data, identical Content belongs to the daily record data of of a sort content as adding up that a data enters in number of lines, and gathers The order of magnitude of the first daily record data set obtained after collection is by well below in the current slot for getting The order of magnitude of daily record data, at that time data sense preserved by complete.
Digging system can just start to perform after the first concentrating module 502 is performed in embodiments of the present invention The first preserving module 401 in embodiment illustrated in fig. 4.
In embodiments of the present invention, system also includes cleaning module 503;
Cleaning module 503 is used to obtain the daily record number in the current slot in the acquisition module 501 After, data cleansing is carried out to the daily record data in the current slot, obtained in current slot Daily record data after cleaning;
And if digging system performs cleaning module 503, the first concentrating module 502 is specifically for described Daily record data in current slot after cleaning carries out aggregation treatment, obtains the in the current slot One daily record data set.
In embodiments of the present invention, digging system realizes the first daily record by the additional step shown in Fig. 2 The acquisition of data acquisition system, and carried out by the daily record data in the current slot to being got from network side Aggregation, can effectively reduce the order of magnitude of daily record data so that in Hadoop databases required for institute Memory space reduce, save memory space.And digging system can also be by current slot Daily record data carries out data cleansing, enabling removes some useless or error daily record datas, reduces The quantity of daily record data treatment, and be easy to preferably carry out data mining.
Fig. 6 is referred to, is the schematic diagram of the functional module that the second embodiment shown in Fig. 4 is added, including:
Read module 601, if for receiving data query instruction, in being instructed according to the data query Comprising inquiry dimension read from the Hadoop databases with it is described inquiry dimension it is corresponding 3rd day Will data acquisition system;
Analysis module 602, for carrying out data analysis to the 3rd daily record data set, and on display circle The result of display data analysis on face.
Wherein, the analysis module 602 includes:
Cluster module 603, for according to the clustering algorithm for pre-setting in the 3rd daily record data set User carry out user grouping, obtain user grouping list;
Display module 604 is obtained, the daily record data for the user in user grouping list is obtained at least The corresponding rank allocation list of two user's dimensions, and the rank allocation list is shown on display interface;Institute State user's dimension to pre-set, comprising the use in the user grouping list in the rank allocation list Family according to user's dimension be classified the rank of determination.
Wherein, user's dimension can be divided into transverse dimensions and longitudinal dimension, and right under different dimensions User is graded.For example:The user grouping that digging system is obtained, including:All user's groups and microblogging All users in the group, for all user's groups, are carried out name by user's group according to the uninterrupted for using Secondary seniority among brothers and sisters, before ranking 20% is five-star user, and 20% to 40% is four-star user before ranking, And so on, determine the star of each user in all user's groups.This is transverse dimensions and comments Level.User in being combined for microblog users, the uninterrupted produced after microblogging is started according to user and is entered Row ranking is ranked, 20% is five-star user before ranking, and 20% to 40% is four-star use before ranking Family, and so on, determine the star of each user in the microblog users group.This is longitudinal dimension Degree grading.Graded by transverse dimensions and longitudinal dimension is graded, enabling to realizing to user group's Portrait displaying, so that business expert obtains targetedly scheme for specific packet portrait.
Preferably, the clustering algorithm that this pre-sets can be K-means algorithms.
Wherein, inquiry dimension is corresponding based on the 3rd daily record data set preserved in Hadoop databases What dimension was set, for example:During inquiry dimension can be internet content, surf time, online position etc. Any one is any several.
In embodiments of the present invention, the inquiry dimension included during digging system according to data query by instructing The 3rd daily record data set corresponding with inquiry dimension is read from Hadoop databases, and to the 3rd day Will data acquisition system carries out data analysis, and on display interface display data analysis result, enabling The result of data mining is effectively shown to user.
Through the above description of the embodiments, those skilled in the art can be understood that above-mentioned Embodiment method can add the mode of required general hardware platform to realize by software, naturally it is also possible to logical Cross hardware, but the former is more preferably implementation method in many cases.It is of the invention based on such understanding The part that technical scheme substantially contributes to prior art in other words can in the form of software product body Reveal and, the computer software product is stored in storage medium (such as ROM/RAM, magnetic disc, a light Disk) in, including some instructions are used to so that a station terminal equipment (can be mobile phone, computer, service Device, air-conditioner, or network equipment etc.) method that performs each embodiment of the invention.
The preferred embodiments of the present invention are these are only, the scope of the claims of the invention is not thereby limited, it is every The equivalent structure or equivalent flow conversion made using description of the invention and accompanying drawing content, or directly or Connect and be used in other related technical fields, be included within the scope of the present invention.

Claims (10)

1. a kind of daily record data method for digging based on Hadoop, it is characterised in that including:
The first daily record data set in the current slot of acquisition is preserved into Hadoop databases;
If the number of the first daily record data set that the Hadoop databases have been preserved meets pre-setting Numerical value, then using preset concurrent operation model to the first daily record number in the Hadoop databases Parallel aggregation treatment is carried out according to set, the second daily record data acquisition system is obtained;
The dimension of the daily record data in the second daily record data acquisition system is to second log data set Daily record data in conjunction carries out dimension division, the corresponding 3rd daily record data set of the different dimensions that will be obtained Preserve into the Hadoop databases.
2. method according to claim 1, it is characterised in that methods described also includes:
The daily record data in current slot is obtained from network side;
Aggregation treatment is carried out to the daily record data in the current slot, is obtained in the current slot The first daily record data set.
3. method according to claim 2, it is characterised in that it is described from network side obtain current when Between daily record data in section the step of after also include:
Data cleansing is carried out to the daily record data in the current slot, cleaning in current slot is obtained Daily record data afterwards;
Then the daily record data in the current slot carries out aggregation treatment, when obtaining described current Between the first daily record data set in section the step of include:
Aggregation treatment is carried out to the daily record data after being cleaned in the current slot, when obtaining described current Between the first daily record data set in section.
4. the method according to claims 1 to 3 any one, it is characterised in that methods described is also Including:
If receiving data query instruction, in being instructed according to the data query inquiry dimension that includes from The 3rd daily record data set corresponding with the inquiry dimension is read in the Hadoop databases;
Data analysis, and the display data analysis on display interface are carried out to the 3rd daily record data set Result.
5. method according to claim 4, it is characterised in that described to the 3rd daily record data Set carries out data analysis, and the result of display data analysis includes on display interface:
User point is carried out to the user in the 3rd daily record data set according to the clustering algorithm for pre-setting Group, obtains user grouping list;
The daily record data of the user in user grouping list obtains the corresponding level of at least two user's dimensions Other allocation list, and the rank allocation list is shown on display interface;User's dimension is to pre-set , enter according to user's dimension comprising the user in the user grouping list in the rank allocation list The rank that row classification determines.
6. a kind of daily record data digging system based on Hadoop, it is characterised in that including:
First preserving module, for the first daily record data set in the current slot by acquisition preserve to In Hadoop databases;
Parallel concentrating module, if the first daily record data set preserved for the Hadoop databases Number meet the numerical value that pre-sets, then using preset concurrent operation model to the Hadoop numbers Parallel aggregation treatment is carried out according to the first daily record data set in storehouse, the second daily record data acquisition system is obtained;
Preserving module is divided, for the dimension pair of the daily record data in the second daily record data acquisition system Daily record data in the second daily record data acquisition system carries out dimension division, the different dimensions correspondence that will be obtained The 3rd daily record data set preserve into the Hadoop databases.
7. system according to claim 6, it is characterised in that the system also includes:
Acquisition module, for obtaining the daily record data in current slot from network side;
First concentrating module, for carrying out aggregation treatment to the daily record data in the current slot, obtains To the first daily record data set in the current slot.
8. system according to claim 7, it is characterised in that the system also includes cleaning module;
The cleaning module be used for the acquisition module obtain daily record data in the current slot it Afterwards, data cleansing is carried out to the daily record data in the current slot, obtains cleaning in current slot Daily record data afterwards;
And first concentrating module to the daily record data after being cleaned in the current slot specifically for entering Row aggregation is processed, and obtains the first daily record data set in the current slot.
9. the system according to claim 6 to 8 any one, it is characterised in that the system is also Including:
Read module, if for receiving data query instruction, being wrapped in being instructed according to the data query The inquiry dimension for containing reads the 3rd daily record corresponding with the inquiry dimension from the Hadoop databases Data acquisition system;
Analysis module, for carrying out data analysis to the 3rd daily record data set, and in display interface The result of upper display data analysis.
10. system according to claim 9, it is characterised in that the analysis module includes:
Cluster module, for according to the clustering algorithm for pre-setting in the 3rd daily record data set User carries out user grouping, obtains user grouping list;
Display module is obtained, the daily record data for the user in user grouping list obtains at least two The corresponding rank allocation list of individual user's dimension, and the rank allocation list is shown on display interface;It is described User's dimension pre-sets, comprising the user in the user grouping list in the rank allocation list According to user's dimension be classified the rank of determination.
CN201510875453.3A 2015-12-02 2015-12-02 Hadoop-based log data mining method and system Active CN106815274B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system
PCT/CN2016/097363 WO2017092444A1 (en) 2015-12-02 2016-08-30 Log data mining method and system based on hadoop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510875453.3A CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Publications (2)

Publication Number Publication Date
CN106815274A true CN106815274A (en) 2017-06-09
CN106815274B CN106815274B (en) 2022-02-18

Family

ID=58796202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510875453.3A Active CN106815274B (en) 2015-12-02 2015-12-02 Hadoop-based log data mining method and system

Country Status (2)

Country Link
CN (1) CN106815274B (en)
WO (1) WO2017092444A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391645A (en) * 2017-07-12 2017-11-24 广州市昊链信息科技股份有限公司 A kind of logistics information automatic push and practical operation specification form system and method

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107241231B (en) * 2017-07-26 2020-04-03 成都科来软件有限公司 Rapid and accurate positioning method for original network data packet
CN112287208B (en) * 2019-09-30 2024-03-01 北京沃东天骏信息技术有限公司 User portrait generation method, device, electronic equipment and storage medium
CN111597179B (en) * 2020-05-18 2023-12-05 北京思特奇信息技术股份有限公司 Method and device for automatically cleaning data, electronic equipment and storage medium
CN112632020B (en) * 2020-12-25 2022-03-18 中国电子科技集团公司第三十研究所 Log information type extraction method and mining method based on spark big data platform

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732123B1 (en) * 1998-02-23 2004-05-04 International Business Machines Corporation Database recovery to any point in time in an online environment utilizing disaster recovery technology
US20070055687A1 (en) * 2005-09-02 2007-03-08 International Business Machines Corporation System and method for minimizing data outage time and data loss while handling errors detected during recovery
KR20090050405A (en) * 2007-11-15 2009-05-20 한국전자통신연구원 Method and apparatus for classifying user behaviors based on the event log generated from the context aware system environment
CN101483557A (en) * 2009-03-03 2009-07-15 中兴通讯股份有限公司 Log statistic, storing method and system used for deep packet detection apparatus
US20100306286A1 (en) * 2009-03-05 2010-12-02 Chi-Hsien Chiu Distributed steam processing
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
US20140304401A1 (en) * 2013-04-06 2014-10-09 Citrix Systems, Inc. Systems and methods to collect logs from multiple nodes in a cluster of load balancers
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104301360A (en) * 2013-07-19 2015-01-21 阿里巴巴集团控股有限公司 Method, log server and system for recording log data
US20150081668A1 (en) * 2013-09-13 2015-03-19 Nec Laboratories America, Inc. Systems and methods for tuning multi-store systems to speed up big data query workload
CN104616092A (en) * 2014-12-16 2015-05-13 国家电网公司 Distributed log analysis based distributed mode handling method
CN104969213A (en) * 2013-01-31 2015-10-07 脸谱公司 Data stream splitting for low-latency data access

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100481077C (en) * 2006-01-12 2009-04-22 国际商业机器公司 Visual method and device for strengthening search result guide
CN103036921B (en) * 2011-09-29 2015-09-23 北京新媒传信科技有限公司 A kind of user behavior analysis system and method
CN103955502B (en) * 2014-04-24 2017-07-28 科技谷(厦门)信息技术有限公司 A kind of visualization OLAP application realization method and system
CN104317958B (en) * 2014-11-12 2018-01-16 北京国双科技有限公司 A kind of real-time data processing method and system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6732123B1 (en) * 1998-02-23 2004-05-04 International Business Machines Corporation Database recovery to any point in time in an online environment utilizing disaster recovery technology
US20070055687A1 (en) * 2005-09-02 2007-03-08 International Business Machines Corporation System and method for minimizing data outage time and data loss while handling errors detected during recovery
KR20090050405A (en) * 2007-11-15 2009-05-20 한국전자통신연구원 Method and apparatus for classifying user behaviors based on the event log generated from the context aware system environment
CN101483557A (en) * 2009-03-03 2009-07-15 中兴通讯股份有限公司 Log statistic, storing method and system used for deep packet detection apparatus
US20100306286A1 (en) * 2009-03-05 2010-12-02 Chi-Hsien Chiu Distributed steam processing
CN102685221A (en) * 2012-04-29 2012-09-19 华北电力大学(保定) Distributed storage and parallel mining method for state monitoring data
CN104969213A (en) * 2013-01-31 2015-10-07 脸谱公司 Data stream splitting for low-latency data access
US20140304401A1 (en) * 2013-04-06 2014-10-09 Citrix Systems, Inc. Systems and methods to collect logs from multiple nodes in a cluster of load balancers
CN104301360A (en) * 2013-07-19 2015-01-21 阿里巴巴集团控股有限公司 Method, log server and system for recording log data
US20150081668A1 (en) * 2013-09-13 2015-03-19 Nec Laboratories America, Inc. Systems and methods for tuning multi-store systems to speed up big data query workload
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104616092A (en) * 2014-12-16 2015-05-13 国家电网公司 Distributed log analysis based distributed mode handling method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391645A (en) * 2017-07-12 2017-11-24 广州市昊链信息科技股份有限公司 A kind of logistics information automatic push and practical operation specification form system and method
CN107391645B (en) * 2017-07-12 2018-04-10 广州市昊链信息科技股份有限公司 A kind of logistics information automatic push and practical operation specification form system and method

Also Published As

Publication number Publication date
CN106815274B (en) 2022-02-18
WO2017092444A1 (en) 2017-06-08

Similar Documents

Publication Publication Date Title
CN106815274A (en) Daily record data method for digging and system based on Hadoop
RU2628208C2 (en) Cloud-border topologies
CN102682059B (en) Method and system for distributing users to clusters
CN102137029B (en) A kind of instant communication contacts approaches to IM and device
CN103620601A (en) Joining tables in a mapreduce procedure
WO2019024496A1 (en) Enterprise recommendation method and application server
US10387815B2 (en) Continuously variable resolution of resource allocation
CN111506801A (en) Sequencing method and device for sub-applications in application App
CN108021673A (en) A kind of user interest model generation method, position recommend method and computing device
CN105335409A (en) Target user determination method and device and network server
CN104182506A (en) Log management method
CN104133765B (en) The test case sending method of network activity and test case server
CN105630800A (en) Node importance ranking method and system
CN104217030A (en) Method and device for classifying users according to search log data of server
CN106815254A (en) A kind of data processing method and device
WO2018144048A1 (en) Gain adjustment component for computer network routing infrastructure
CN103593393A (en) Social circle digging method and device based on microblog interactive relationships
CN104503831A (en) Equipment optimization method and device
US11321318B2 (en) Dynamic access paths
CN103870571B (en) Cube reconstructing method and device in Multi-dimension on-line analytical process system
US20190163671A1 (en) Determining collaboration recommendations from file path information
US20170116256A1 (en) Reliance measurement technique in master data management (mdm) repositories and mdm repositories on clouded federated databases with linkages
CN110928917A (en) Target user determination method and device, computing equipment and medium
CN102724290B (en) Method, device and system for getting target customer group
JP2022096632A (en) Computer-implemented method, computer system, and computer program (ranking datasets based on data attributes)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant