CN106815274A - Daily record data method for digging and system based on Hadoop - Google Patents
Daily record data method for digging and system based on Hadoop Download PDFInfo
- Publication number
- CN106815274A CN106815274A CN201510875453.3A CN201510875453A CN106815274A CN 106815274 A CN106815274 A CN 106815274A CN 201510875453 A CN201510875453 A CN 201510875453A CN 106815274 A CN106815274 A CN 106815274A
- Authority
- CN
- China
- Prior art keywords
- daily record
- record data
- user
- data set
- hadoop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a kind of daily record data method for digging based on Hadoop, the first daily record data set in the current slot of acquisition is preserved into Hadoop databases;If the number of the first daily record data set that Hadoop databases have been preserved meets the numerical value for pre-setting, parallel aggregation treatment is carried out to the first daily record data set in Hadoop databases using preset concurrent operation model, obtain the second daily record data acquisition system;The dimension of the daily record data in the second daily record data acquisition system carries out dimension division to the daily record data in the second daily record data acquisition system, and the corresponding 3rd daily record data set of the different dimensions that will be obtained is preserved into Hadoop databases.The invention also discloses a kind of daily record data digging system based on Hadoop.The present invention can fast and effeciently realize the excavation of mass data, storage and computing demand that satisfaction is excavated to mass data.
Description
Technical field
The present invention relates to field of computer data processing, more particularly to a kind of daily record data based on Hadoop
Method for digging and system.
Background technology
Since Internet era, how in the mass users information for constantly exploding, fast searching is more
Properly, can quantify, predictable precision marketing strategy, become including numerous enterprises including operator
Core demand.
However, traditional database is limited to data operational capability, carrying cost is expensive, it is impossible to meet magnanimity
The demand of the excavation of data.
The above is only used for auxiliary and understands technical scheme, does not represent and recognizes that the above is
Prior art.
The content of the invention
It is a primary object of the present invention to providing a kind of daily record data method for digging based on Hadoop and being
System, it is intended to which it is limited to data operational capability to solve traditional database, carrying cost is expensive, it is impossible to provide sea
Measure the technical problem of the excavation of data.
To achieve the above object, a kind of daily record data method for digging based on Hadoop that the present invention is provided,
Including:
The first daily record data set in the current slot of acquisition is preserved into Hadoop databases;
If the number of the first daily record data set that the Hadoop databases have been preserved meets pre-setting
Numerical value, then using preset concurrent operation model to the first daily record data in the Hadoop databases
Set carries out parallel aggregation treatment, obtains the second daily record data acquisition system;
The dimension of the daily record data in the second daily record data acquisition system is to second log data set
Daily record data in conjunction carries out dimension division, the corresponding 3rd daily record data set of the different dimensions that will be obtained
Preserve into the Hadoop databases.
Preferably, methods described also includes:
The daily record data in current slot is obtained from network side;
Aggregation treatment is carried out to the daily record data in the current slot, is obtained in the current slot
The first daily record data set.
Preferably, also include after the step of daily record data obtained from network side in current slot:
Data cleansing is carried out to the daily record data in the current slot, cleaning in current slot is obtained
Daily record data afterwards;
Then the daily record data in the current slot carries out aggregation treatment, when obtaining described current
Between the first daily record data set in section the step of include:
Aggregation treatment is carried out to the daily record data after being cleaned in the current slot, when obtaining described current
Between the first daily record data set in section.
Preferably, methods described also includes:
If receiving data query instruction, in being instructed according to the data query inquiry dimension that includes from
The 3rd daily record data set corresponding with the inquiry dimension is read in the Hadoop databases;
Data analysis, and the display data analysis on display interface are carried out to the 3rd daily record data set
Result.
Preferably, it is described that data analysis is carried out to the 3rd daily record data set, including:
User point is carried out to the user in the 3rd daily record data set according to the clustering algorithm for pre-setting
Group, obtains user grouping list;
The daily record data of the user in user grouping list obtains the corresponding level of at least two user's dimensions
Other allocation list, user's dimension pre-sets, comprising the user point in the rank allocation list
User in Groups List according to user's dimension be classified the rank of determination.
To achieve the above object, the present invention also provides a kind of daily record data digging system based on Hadoop,
Including:
First preserving module, for the first daily record data set in the current slot by acquisition preserve to
In Hadoop databases;
Parallel concentrating module, if the first daily record data set preserved for the Hadoop databases
Number meets the numerical value for pre-setting, then using preset concurrent operation model to the Hadoop databases
In the first daily record data set carry out parallel aggregation treatment, obtain the second daily record data acquisition system;
Preserving module is divided, the dimension of the daily record data in the second daily record data acquisition system is to described
Daily record data in second daily record data acquisition system carries out dimension division, the different dimensions that will be obtained corresponding
Three daily record data set are preserved into the Hadoop databases.
Preferably, the system also includes:
Acquisition module, for obtaining the daily record data in current slot from network side;
First concentrating module, for carrying out aggregation treatment to the daily record data in the current slot, obtains
To the first daily record data set in the current slot.
Preferably, the system also includes cleaning module;
The cleaning module be used for the acquisition module obtain daily record data in the current slot it
Afterwards, data cleansing is carried out to the daily record data in the current slot, obtains cleaning in current slot
Daily record data afterwards;
And first concentrating module to the daily record data after being cleaned in the current slot specifically for entering
Row aggregation is processed, and obtains the first daily record data set in the current slot.
Preferably, the system also includes:
Read module, if for receiving data query instruction, being wrapped in being instructed according to the data query
The inquiry dimension for containing reads the 3rd daily record corresponding with the inquiry dimension from the Hadoop databases
Data acquisition system;
Analysis module, for carrying out data analysis to the 3rd daily record data set, and in display interface
The result of upper display data analysis.
Preferably, the analysis module includes:
Cluster module, for according to the clustering algorithm for pre-setting in the 3rd daily record data set
User carries out user grouping, obtains user grouping list;
Display module is obtained, the daily record data for the user in user grouping list obtains at least two
The corresponding rank allocation list of individual user's dimension, user's dimension pre-sets, the rank configuration
According to user's dimension be classified the rank of determination in table comprising the user in the user grouping list
The present invention provides a kind of daily record data method for digging based on Hadoop, the current slot that will be obtained
The first interior daily record data set is preserved into Hadoop databases, if what Hadoop databases had been preserved
The number of the first daily record data set meets the numerical value for pre-setting, then using preset concurrent operation model
Parallel aggregation treatment is carried out to the first daily record data set in the Hadoop databases, the second daily record is obtained
Data acquisition system, the dimension of the daily record data in the second daily record data acquisition system is to second daily record data
Daily record data in set carries out maintenance division, corresponding 3rd log data set of the different dimensions that will be obtained
Close and preserve into the Hadoop databases, to complete the excavation of daily record data.Due to Hadoop databases
With preferable distributed storage ability and concurrent operation ability, using the Hadoop databases to daily record number
According to carrying out distributed storage and using concurrent operation model carrying out concurrent operation, can fast and effeciently realize
The excavation of mass data, storage and computing demand that satisfaction is excavated to mass data.
Brief description of the drawings
Fig. 1 illustrates for the flow of the daily record data method for digging based on Hadoop of first embodiment of the invention
Figure;
The schematic flow sheet of additional step before the step of Fig. 2 is the first embodiment in Fig. 1 101;
The schematic flow sheet of additional step after the step of Fig. 3 is the first embodiment in Fig. 1 103;
Fig. 4 is the functional module of the daily record data digging system based on Hadoop in second embodiment of the invention
Schematic diagram;
Fig. 5 is the schematic diagram of functional module additional in the second embodiment of Fig. 4;
Fig. 6 is the schematic diagram of functional module additional in the second embodiment of Fig. 4.
The realization of the object of the invention, functional characteristics and advantage will be done further referring to the drawings in conjunction with the embodiments
Explanation.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, limit is not used to
The fixed present invention.
The present invention provides a kind of daily record data method for digging based on Hadoop, the current slot that will be obtained
The first interior daily record data set is preserved into Hadoop databases, if what Hadoop databases had been preserved
The number of the first daily record data set meets the numerical value for pre-setting, then using preset concurrent operation model
Parallel aggregation treatment is carried out to the first daily record data set in the Hadoop databases, the second daily record is obtained
Data acquisition system, the dimension of the daily record data in the second daily record data acquisition system is to second daily record data
Daily record data in set carries out maintenance division, corresponding 3rd log data set of the different dimensions that will be obtained
Close and preserve into the Hadoop databases, to complete the excavation of daily record data.Due to Hadoop databases
With preferable distributed storage ability and concurrent operation ability, using the Hadoop databases to daily record number
According to carrying out distributed storage and using the preset concurrent operation model in Hadoop carry out concurrent operation, energy
Enough excavations for fast and effeciently realizing mass data, storage and computing that satisfaction is excavated to mass data
Demand.
Fig. 1 is referred to, is the daily record data method for digging based on Hadoop in first embodiment of the invention
Schematic flow sheet, including:
Step 101, the first daily record data set in the current slot of acquisition is preserved to Hadoop numbers
According in storehouse;
In embodiments of the present invention, the daily record data method for digging based on Hadoop can be applied and is being based on
The daily record data digging system of Hadoop is (hereinafter referred to as:Digging system) in, digging system will be obtained
Current slot in the first daily record data set preserve into Hadoop databases.
Wherein, digging system is to obtain the first daily record data set according to the time period, if for example, the time
Section is 15 minutes or 30 minutes, then digging system obtains first in current 15-min period
Daily record data set obtains the first daily record data set in current 30 minutes section.
Wherein, the time period is the cycle for obtaining data, can determine the time according to the size of data volume
The duration of section.
Wherein, Hadoop can realize distributed file system (Hadoop Distributed File System,
HDFS), the framework core of Hadoop is Hadoop databases and concurrent operation model, wherein, Hadoop
Database can provide distributed storage for the data of magnanimity, and parallel running model can be the data of magnanimity
Concurrent operation is provided.
Preferably, the concurrent operation model is mapreduce operational models.
If the number of the first daily record data set that step 102, Hadoop databases have been preserved meets advance
The numerical value of setting, then using preset concurrent operation model to the first daily record data in Hadoop databases
Set carries out parallel aggregation treatment, obtains the second daily record data acquisition system;
In embodiments of the present invention, the first daily record number that digging system will get within each time period
Preserved into Hadoop databases according to set, if the first daily record data that the Hadoop databases have been preserved
The number of set meets the numerical value for pre-setting, then using the preset parallel fortune in the Hadoop frameworks
Calculate model carries out aggregation treatment to the first daily record data set in Hadoop databases, obtains the second daily record
Data acquisition system.
Wherein, in actual applications can be according to the numerical value be pre-set the need for specific, if for example, above-mentioned
Time period be 15 minutes, and need to carry out aggregation treatment to the first daily record data set in a hour,
The numerical value that then this pre-sets is 4;If the above-mentioned time period is 30 minutes, and is needed to the in 1 day
One daily record data set carries out aggregation treatment, then the numerical value that this pre-sets is 48.
It is understood that being processed based on above-mentioned aggregation, digging system can also utilize similar mode
The daily record data set in the different time cycle is obtained, for example:It can be 15 minutes using 4 time periods
The first daily record data set daily record data set for obtaining in a hour, it is possible to use 24 one is small
When in daily record data set obtain intraday daily record data set, it is possible to use 30 intraday days
Will data acquisition system obtains the daily record data set in month, and by that analogy, can obtain different time
Interior daily record data set, to meet different demands.
In embodiments of the present invention, digging system is assembled parallel using preset concurrent operation model
It is that the count value of identical daily record data is added up during treatment.
Step 103, the dimension of daily record data in the second daily record data acquisition system are to the second log data set
Daily record data in conjunction carries out dimension division, the corresponding 3rd daily record data set of the different dimensions that will be obtained
Preserve into Hadoop databases.
In embodiments of the present invention, digging system, will be according to this after the second daily record data acquisition system is obtained
The dimension of the daily record data in the second daily record data acquisition system is entered to the daily record data in the second daily record data acquisition system
Row dimension is divided, and the corresponding 3rd daily record data set of different dimensions that will be obtained is preserved to Hadoop numbers
According to storehouse, to realize the excavation of massive logs data, and the 3rd daily record data set for preserving can conduct
The data source of user data query, supports icon, graphical query and the various dimensions inquiry of display interface, makes
Multi-angle display data is obtained, the bandwagon effect of data mining is reached.
Wherein, the dimension of daily record data has a lot, including but not limited to internet content, online position and on
The net time, wherein, internet content refers to browse position in user, and it can be specific that this browses position
Some position, for example, can be Baidu, Sohu, Sina weibo etc., or a class network address,
For example:Music, film etc..Online position refers to the geographical position model residing for the IP positions that user uses
Enclose, the surf time refers to the time for generating daily record data.And the division of dimension is the requirement according to system,
Further portraying to user's global behavior is completed by the data in dimension.It should be noted that for
Different types of daily record data, the dimension of its daily record data be also it is different, for example:To daily record number
When the data on flows of the user in carries out data mining using the technical scheme in the embodiment of the present invention, its
Dimension except above-mentioned internet content, online position and in addition to the surf time, can also comprising frequency of surfing the Net,
Age of user, moon consumption etc., therefore in actual applications, can be according to carrying out dimension the need for specific
Divide, do not limit herein.
Preferably, in embodiments of the present invention, digging system is by the corresponding 3rd daily record number of different dimensions
After being preserved into Hadoop databases according to set, can also be by the corresponding 3rd daily record number of the different dimensions
Preserved into row storage array according to set, enabling realize the association of Hadoop databases and row storage array
With work, enabling meet the demand data of different application scenarios.
Preferably, because digging system is the first daily record data set for having been preserved in Hadoop databases
Number can just perform above-mentioned parallel aggregation treatment and dimension division in the case of meeting the numerical value for pre-setting
Operation, therefore, the 3rd daily record data set for obtaining also correspond to a time period in fact, excavate
System when stored, can preserve the correspondence between dimension, time period and the 3rd daily record data set three
Relation.
In embodiments of the present invention, the first log data set in current slot that digging system will be obtained
Close and preserve in Hadoop databases, if the first daily record data set that has preserved of Hadoop databases
Number meets the numerical value for pre-setting, then using preset concurrent operation model in the Hadoop databases
The first daily record data set carry out parallel aggregation treatment, obtain the second daily record data acquisition system, according to this
The dimension of the daily record data in two daily record data set is entered to the daily record data in the second daily record data acquisition system
Row safeguards and divides that the corresponding 3rd daily record data set of different dimensions that will be obtained is preserved to the Hadoop numbers
According to storehouse, to complete the excavation of daily record data.Because Hadoop databases have preferable distributed storage
Ability and concurrent operation ability, distributed storage and profit are carried out using the Hadoop databases to daily record data
Concurrent operation is carried out with the concurrent operation model in Hadoop, mass data can be fast and effeciently realized
Excavate, storage and computing demand that satisfaction is excavated to mass data.
Fig. 2 is referred to, is additional step before step 101 in the first embodiment shown in Fig. 1 of the present invention
Schematic flow sheet, including:
Step 201, the daily record data from network side acquisition current slot;
In embodiments of the present invention, digging system is that the daily record data in current slot is obtained from network side,
Specifically:Digging system can obtain current slot by way of the extraction of daily record data from network side
Interior daily record data, or, it is possible to use web crawlers technology is obtained in current slot from network side
Daily record data, or, current slot can be obtained by from the BOSS accounting datas storehouse of network side
Interior daily record data, or, can receive in the current slot of third party manufacturer offer of network side
Daily record data, or the daily record data in current slot is obtained with reference at least two above-mentioned modes.
Step 202, aggregation treatment is carried out to the daily record data in current slot, obtained in current slot
The first daily record data set.
In embodiments of the present invention, digging system is after the daily record data in current slot is got,
Aggregation treatment is carried out to the daily record data in current slot, the first daily record data of current slot is obtained
Set.
Wherein, aggregation can be classified according to the content of daily record data in step 202, identical interior
Hold or belong to cumulative, the aggregation that the daily record data of of a sort content enters in number of lines as a data
The order of magnitude of the first daily record data set for obtaining afterwards is by well below the day in the current slot for getting
The order of magnitude of will data, at that time data sense preserved by complete.
In embodiments of the present invention, digging system realizes the first daily record by the additional step shown in Fig. 2
The acquisition of data acquisition system, and carried out by the daily record data in the current slot to being got from network side
Aggregation, can effectively reduce the order of magnitude of daily record data so that in Hadoop databases required for institute
Memory space reduce, save memory space.
Preferably, in embodiments of the present invention, digging system can also carry out before step 202 is performed
Following steps:
Data cleansing is carried out to the daily record data in current slot, after obtaining being cleaned in current slot
Daily record data;
In embodiments of the present invention, daily record data of the digging system in the current slot to getting enters
Before row aggregation, data cleansing can also be carried out to the daily record data in current slot, when obtaining current
Between daily record data in section after cleaning.
And if digging system performs above-mentioned steps, being also required to carry out step 202 adjustment of adaptability,
And step 202 accommodation is:
Aggregation treatment is carried out to the daily record data after being cleaned in current slot, is obtained in current slot
First daily record data set.
Wherein, it can remove some data types for being unsatisfactory for pre-setting that cleaning is carried out to daily record data
Daily record data, and/or, find and correct the mistake that can recognize that in daily record data, and correct or delete
There is recognizable daily record data.
In embodiments of the present invention, digging system carries out data by the daily record data in current slot
Cleaning, enabling remove some useless or error daily record datas, reduce the number of daily record data treatment
Amount, and be easy to preferably carry out data mining.
Fig. 3 is referred to, is additional step after the step 103 in first embodiment shown in Fig. 1 of the present invention
Schematic flow sheet, including:
If step 301, receiving data query instruction, the inquiry that includes is tieed up in being instructed according to data query
Degree reads the 3rd daily record data set corresponding with inquiry dimension from Hadoop databases;
In embodiments of the present invention, digging system is preserved to Hadoop numbers in the 3rd daily record data that will be obtained
After in storehouse, user can by way of input data query statement requesting query data, and if dig
Pick system receives data query instruction, then in being instructed according to data query the inquiry dimension that includes from
The 3rd daily record data set corresponding with dimension is read in Hadoop databases.
Preferably, certain time period can also be included in data query instruction, then digging system will read
Within the time period, the corresponding 3rd daily record data set of the inquiry dimension.
Step 302, data analysis, and the display data on display interface are carried out to the 3rd daily record data set
The result of analysis.
In embodiments of the present invention, digging system will also carry out data analysis to the 3rd daily record data set,
And on display interface display data analysis result, specifically:Digging system is poly- according to what is pre-set
Class algorithm carries out user grouping to the user in the 3rd daily record data set, obtains user grouping list;Root
The corresponding rank of at least two user's dimensions is obtained according to the daily record data of the user in user grouping list to configure
Table, and the display level allocation list on display interface;User's dimension pre-sets, rank allocation list
According to user's dimension be classified the rank of determination comprising the user in user grouping list.
Wherein, user's dimension can be divided into transverse dimensions and longitudinal dimension, and right under different dimensions
User is graded.For example:The user grouping that digging system is obtained, including:All user's groups and microblogging
All users in the group, for all user's groups, are carried out name by user's group according to the uninterrupted for using
Secondary seniority among brothers and sisters, before ranking 20% is five-star user, and 20% to 40% is four-star user before ranking,
And so on, determine the star of each user in all user's groups.This is transverse dimensions and comments
Level.User in being combined for microblog users, the uninterrupted produced after microblogging is started according to user and is entered
Row ranking is ranked, 20% is five-star user before ranking, and 20% to 40% is four-star use before ranking
Family, and so on, determine the star of each user in the microblog users group.This is longitudinal dimension
Degree grading.Graded by transverse dimensions and longitudinal dimension is graded, enabling to realizing to user group's
Portrait displaying, so that business expert obtains targetedly scheme for specific packet portrait.
Preferably, the clustering algorithm that this pre-sets can be K-means algorithms.
Wherein, inquiry dimension is corresponding based on the 3rd daily record data set preserved in Hadoop databases
What dimension was set, for example:During inquiry dimension can be internet content, surf time, online position etc.
Any one is any several.
In embodiments of the present invention, the inquiry dimension included during digging system according to data query by instructing
The 3rd daily record data set corresponding with inquiry dimension is read from Hadoop databases, and to the 3rd day
Will data acquisition system carries out data analysis, and on display interface display data analysis result, enabling
The result of data mining is effectively shown to user.
It should be noted that in embodiments of the present invention, the digging of the daily record data based on Hadoop databases
Pick method can be applied in the precision marketing system of data on flows, for example, can be by Fig. 1 to Fig. 3
Technical scheme described in illustrated embodiment realizes the excavation of targeted customer and the excavation of marketing addressing etc.,
Targeted fine integral method is done to targeted customer or target BS cell to operator data base is provided
Plinth.
Wherein, if it needs to be determined that targeted customer, then in the embodiment shown in fig. 3 the step of 301 in,
Inquiry dimension can be internet content or surfing flow, if desired determine target BS cell, then inquire about
Dimension can be online position.
In actual applications, user can not limit herein according to inquiry dimension is selected the need for specific.
Fig. 4 is referred to, is the daily record data digging system based on Hadoop in second embodiment of the invention
The schematic diagram of functional module, including:
First preserving module 401, for will obtain current slot in the first daily record data set preserve
Into Hadoop databases;
Wherein, digging system is to obtain the first daily record data set according to the time period, if for example, the time
Section is 15 minutes or 30 minutes, then digging system obtains first in current 15-min period
Daily record data set obtains the first daily record data set in current 30 minutes section.
Wherein, the time period is the cycle for obtaining data, can determine the time according to the size of data volume
The duration of section.
Wherein, Hadoop can realize distributed file system (Hadoop Distributed File System,
HDFS), the framework core of Hadoop is Hadoop databases and concurrent operation model, wherein, Hadoop
Database can provide distributed storage for the data of magnanimity, and parallel running model can be the data of magnanimity
Concurrent operation is provided.
Preferably, concurrent operation model is mapreduce operational models.
Parallel concentrating module 402, if the first log data set preserved for the Hadoop databases
The number of conjunction meets the numerical value for pre-setting, then using preset concurrent operation model to the Hadoop numbers
Parallel aggregation treatment is carried out according to the first daily record data set in storehouse, the second daily record data acquisition system is obtained;
Wherein, in actual applications can be according to the numerical value be pre-set the need for specific, if for example, above-mentioned
Time period be 15 minutes, and need to carry out aggregation treatment to the first daily record data set in a hour,
The numerical value that then this pre-sets is 4;If the above-mentioned time period is 30 minutes, and is needed to the in 1 day
One daily record data set carries out aggregation treatment, then the numerical value that this pre-sets is 48.
It is understood that being processed based on above-mentioned aggregation, parallel concentrating module 402 can also utilize class
As the mode daily record data set that obtains in the different time cycle, for example:4 time periods can be utilized
The daily record data set in a hour is obtained for the first daily record data set of 15 minutes, it is possible to use 24
Daily record data set in an individual hour obtains intraday daily record data set, it is possible to use 30 one
Daily record data set in it obtains the daily record data set in month, and by that analogy, can obtain
Daily record data set in different time, to meet different demands.
Preserving module 403 is divided, the dimension of the daily record data in the second daily record data acquisition system is to institute
The daily record data stated in the second daily record data acquisition system carries out dimension division, and the different dimensions that will be obtained are corresponding
3rd daily record data set is preserved into the Hadoop databases.
Wherein, the dimension of daily record data has a lot, including but not limited to internet content, online position and on
The net time, wherein, internet content refers to browse position in user, and it can be specific that this browses position
Some position, for example, can be Baidu, Sohu, Sina weibo etc., or a class network address,
For example:Music, film etc..Online position refers to the geographical position model residing for the IP positions that user uses
Enclose, the surf time refers to the time for generating daily record data.And the division of dimension is the requirement according to system,
Further portraying to user's global behavior is completed by the data in dimension.It should be noted that for
Different types of daily record data, the dimension of its daily record data be also it is different, for example:To daily record number
When the data on flows of the user in carries out data mining using the technical scheme in the embodiment of the present invention, its
Dimension except above-mentioned internet content, online position and in addition to the surf time, can also comprising frequency of surfing the Net,
Age of user, moon consumption etc., therefore in actual applications, can be according to carrying out dimension the need for specific
Divide, do not limit herein.
Preferably, in embodiments of the present invention, digging system is by the corresponding 3rd daily record number of different dimensions
After being preserved into Hadoop databases according to set, can also be by the corresponding 3rd daily record number of the different dimensions
Preserved into row storage array according to set, enabling realize the association of Hadoop databases and row storage array
With work, enabling meet the demand data of different application scenarios.
In embodiments of the present invention, the first preserving module 401 will be obtained first day in current slot
Will data acquisition system is preserved into Hadoop databases, if first day that the Hadoop databases have been preserved
The number of will data acquisition system meets the numerical value for pre-setting, then parallel concentrating module 402 using preset and
Row operational model carries out parallel aggregation treatment to the first daily record data set in the Hadoop databases,
The second daily record data acquisition system is obtained, preserving module 403 is finally divided according to the second daily record data acquisition system
In the dimension of daily record data dimension division is carried out to the daily record data in the second daily record data acquisition system,
The corresponding 3rd daily record data set of different dimensions that will be obtained is preserved into the Hadoop databases.
In embodiments of the present invention, the first log data set in current slot that digging system will be obtained
Close and preserve in Hadoop databases, if the first daily record data set that has preserved of Hadoop databases
Number meets the numerical value for pre-setting, then using the concurrent operation model in Hadoop databases to this
The first daily record data set in Hadoop databases carries out parallel aggregation treatment, obtains the second daily record data
Set, the dimension of the daily record data in the second daily record data acquisition system is to the second daily record data acquisition system
In daily record data carry out maintenance division, the corresponding 3rd daily record data set of the different dimensions that will be obtained is protected
Deposit into the Hadoop databases, to complete the excavation of daily record data.Because Hadoop databases have
Preferable distributed storage ability and concurrent operation ability, are entered using the Hadoop databases to daily record data
Row distributed storage and concurrent operation is carried out using the concurrent operation model in Hadoop, can be quickly effective
Realize the excavation of mass data, storage and computing demand that satisfaction is excavated to mass data.
Fig. 5 is referred to, is the schematic diagram of the functional module added in the second embodiment shown in Fig. 4, including:
Acquisition module 501, for obtaining the daily record data in current slot from network side;
In embodiments of the present invention, acquisition module 501 is that the daily record in current slot is obtained from network side
Data, specifically:Acquisition module 501 can be obtained by way of the extraction of daily record data from network side
Daily record data in current slot, or, it is possible to use web crawlers technology obtains current from network side
Daily record data in time period, or, can be obtained by from the BOSS accounting datas storehouse of network side
Daily record data in current slot, or, can receive network side third party manufacturer provide it is current
Daily record data in time period, or the day in current slot is obtained with reference at least two above-mentioned modes
Will data.
First concentrating module 502, for carrying out aggregation treatment to the daily record data in the current slot,
Obtain the first daily record data set in the current slot.
Wherein, the first concentrating module 502 can be classified according to the content of daily record data, identical
Content belongs to the daily record data of of a sort content as adding up that a data enters in number of lines, and gathers
The order of magnitude of the first daily record data set obtained after collection is by well below in the current slot for getting
The order of magnitude of daily record data, at that time data sense preserved by complete.
Digging system can just start to perform after the first concentrating module 502 is performed in embodiments of the present invention
The first preserving module 401 in embodiment illustrated in fig. 4.
In embodiments of the present invention, system also includes cleaning module 503;
Cleaning module 503 is used to obtain the daily record number in the current slot in the acquisition module 501
After, data cleansing is carried out to the daily record data in the current slot, obtained in current slot
Daily record data after cleaning;
And if digging system performs cleaning module 503, the first concentrating module 502 is specifically for described
Daily record data in current slot after cleaning carries out aggregation treatment, obtains the in the current slot
One daily record data set.
In embodiments of the present invention, digging system realizes the first daily record by the additional step shown in Fig. 2
The acquisition of data acquisition system, and carried out by the daily record data in the current slot to being got from network side
Aggregation, can effectively reduce the order of magnitude of daily record data so that in Hadoop databases required for institute
Memory space reduce, save memory space.And digging system can also be by current slot
Daily record data carries out data cleansing, enabling removes some useless or error daily record datas, reduces
The quantity of daily record data treatment, and be easy to preferably carry out data mining.
Fig. 6 is referred to, is the schematic diagram of the functional module that the second embodiment shown in Fig. 4 is added, including:
Read module 601, if for receiving data query instruction, in being instructed according to the data query
Comprising inquiry dimension read from the Hadoop databases with it is described inquiry dimension it is corresponding 3rd day
Will data acquisition system;
Analysis module 602, for carrying out data analysis to the 3rd daily record data set, and on display circle
The result of display data analysis on face.
Wherein, the analysis module 602 includes:
Cluster module 603, for according to the clustering algorithm for pre-setting in the 3rd daily record data set
User carry out user grouping, obtain user grouping list;
Display module 604 is obtained, the daily record data for the user in user grouping list is obtained at least
The corresponding rank allocation list of two user's dimensions, and the rank allocation list is shown on display interface;Institute
State user's dimension to pre-set, comprising the use in the user grouping list in the rank allocation list
Family according to user's dimension be classified the rank of determination.
Wherein, user's dimension can be divided into transverse dimensions and longitudinal dimension, and right under different dimensions
User is graded.For example:The user grouping that digging system is obtained, including:All user's groups and microblogging
All users in the group, for all user's groups, are carried out name by user's group according to the uninterrupted for using
Secondary seniority among brothers and sisters, before ranking 20% is five-star user, and 20% to 40% is four-star user before ranking,
And so on, determine the star of each user in all user's groups.This is transverse dimensions and comments
Level.User in being combined for microblog users, the uninterrupted produced after microblogging is started according to user and is entered
Row ranking is ranked, 20% is five-star user before ranking, and 20% to 40% is four-star use before ranking
Family, and so on, determine the star of each user in the microblog users group.This is longitudinal dimension
Degree grading.Graded by transverse dimensions and longitudinal dimension is graded, enabling to realizing to user group's
Portrait displaying, so that business expert obtains targetedly scheme for specific packet portrait.
Preferably, the clustering algorithm that this pre-sets can be K-means algorithms.
Wherein, inquiry dimension is corresponding based on the 3rd daily record data set preserved in Hadoop databases
What dimension was set, for example:During inquiry dimension can be internet content, surf time, online position etc.
Any one is any several.
In embodiments of the present invention, the inquiry dimension included during digging system according to data query by instructing
The 3rd daily record data set corresponding with inquiry dimension is read from Hadoop databases, and to the 3rd day
Will data acquisition system carries out data analysis, and on display interface display data analysis result, enabling
The result of data mining is effectively shown to user.
Through the above description of the embodiments, those skilled in the art can be understood that above-mentioned
Embodiment method can add the mode of required general hardware platform to realize by software, naturally it is also possible to logical
Cross hardware, but the former is more preferably implementation method in many cases.It is of the invention based on such understanding
The part that technical scheme substantially contributes to prior art in other words can in the form of software product body
Reveal and, the computer software product is stored in storage medium (such as ROM/RAM, magnetic disc, a light
Disk) in, including some instructions are used to so that a station terminal equipment (can be mobile phone, computer, service
Device, air-conditioner, or network equipment etc.) method that performs each embodiment of the invention.
The preferred embodiments of the present invention are these are only, the scope of the claims of the invention is not thereby limited, it is every
The equivalent structure or equivalent flow conversion made using description of the invention and accompanying drawing content, or directly or
Connect and be used in other related technical fields, be included within the scope of the present invention.
Claims (10)
1. a kind of daily record data method for digging based on Hadoop, it is characterised in that including:
The first daily record data set in the current slot of acquisition is preserved into Hadoop databases;
If the number of the first daily record data set that the Hadoop databases have been preserved meets pre-setting
Numerical value, then using preset concurrent operation model to the first daily record number in the Hadoop databases
Parallel aggregation treatment is carried out according to set, the second daily record data acquisition system is obtained;
The dimension of the daily record data in the second daily record data acquisition system is to second log data set
Daily record data in conjunction carries out dimension division, the corresponding 3rd daily record data set of the different dimensions that will be obtained
Preserve into the Hadoop databases.
2. method according to claim 1, it is characterised in that methods described also includes:
The daily record data in current slot is obtained from network side;
Aggregation treatment is carried out to the daily record data in the current slot, is obtained in the current slot
The first daily record data set.
3. method according to claim 2, it is characterised in that it is described from network side obtain current when
Between daily record data in section the step of after also include:
Data cleansing is carried out to the daily record data in the current slot, cleaning in current slot is obtained
Daily record data afterwards;
Then the daily record data in the current slot carries out aggregation treatment, when obtaining described current
Between the first daily record data set in section the step of include:
Aggregation treatment is carried out to the daily record data after being cleaned in the current slot, when obtaining described current
Between the first daily record data set in section.
4. the method according to claims 1 to 3 any one, it is characterised in that methods described is also
Including:
If receiving data query instruction, in being instructed according to the data query inquiry dimension that includes from
The 3rd daily record data set corresponding with the inquiry dimension is read in the Hadoop databases;
Data analysis, and the display data analysis on display interface are carried out to the 3rd daily record data set
Result.
5. method according to claim 4, it is characterised in that described to the 3rd daily record data
Set carries out data analysis, and the result of display data analysis includes on display interface:
User point is carried out to the user in the 3rd daily record data set according to the clustering algorithm for pre-setting
Group, obtains user grouping list;
The daily record data of the user in user grouping list obtains the corresponding level of at least two user's dimensions
Other allocation list, and the rank allocation list is shown on display interface;User's dimension is to pre-set
, enter according to user's dimension comprising the user in the user grouping list in the rank allocation list
The rank that row classification determines.
6. a kind of daily record data digging system based on Hadoop, it is characterised in that including:
First preserving module, for the first daily record data set in the current slot by acquisition preserve to
In Hadoop databases;
Parallel concentrating module, if the first daily record data set preserved for the Hadoop databases
Number meet the numerical value that pre-sets, then using preset concurrent operation model to the Hadoop numbers
Parallel aggregation treatment is carried out according to the first daily record data set in storehouse, the second daily record data acquisition system is obtained;
Preserving module is divided, for the dimension pair of the daily record data in the second daily record data acquisition system
Daily record data in the second daily record data acquisition system carries out dimension division, the different dimensions correspondence that will be obtained
The 3rd daily record data set preserve into the Hadoop databases.
7. system according to claim 6, it is characterised in that the system also includes:
Acquisition module, for obtaining the daily record data in current slot from network side;
First concentrating module, for carrying out aggregation treatment to the daily record data in the current slot, obtains
To the first daily record data set in the current slot.
8. system according to claim 7, it is characterised in that the system also includes cleaning module;
The cleaning module be used for the acquisition module obtain daily record data in the current slot it
Afterwards, data cleansing is carried out to the daily record data in the current slot, obtains cleaning in current slot
Daily record data afterwards;
And first concentrating module to the daily record data after being cleaned in the current slot specifically for entering
Row aggregation is processed, and obtains the first daily record data set in the current slot.
9. the system according to claim 6 to 8 any one, it is characterised in that the system is also
Including:
Read module, if for receiving data query instruction, being wrapped in being instructed according to the data query
The inquiry dimension for containing reads the 3rd daily record corresponding with the inquiry dimension from the Hadoop databases
Data acquisition system;
Analysis module, for carrying out data analysis to the 3rd daily record data set, and in display interface
The result of upper display data analysis.
10. system according to claim 9, it is characterised in that the analysis module includes:
Cluster module, for according to the clustering algorithm for pre-setting in the 3rd daily record data set
User carries out user grouping, obtains user grouping list;
Display module is obtained, the daily record data for the user in user grouping list obtains at least two
The corresponding rank allocation list of individual user's dimension, and the rank allocation list is shown on display interface;It is described
User's dimension pre-sets, comprising the user in the user grouping list in the rank allocation list
According to user's dimension be classified the rank of determination.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510875453.3A CN106815274B (en) | 2015-12-02 | 2015-12-02 | Hadoop-based log data mining method and system |
PCT/CN2016/097363 WO2017092444A1 (en) | 2015-12-02 | 2016-08-30 | Log data mining method and system based on hadoop |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510875453.3A CN106815274B (en) | 2015-12-02 | 2015-12-02 | Hadoop-based log data mining method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106815274A true CN106815274A (en) | 2017-06-09 |
CN106815274B CN106815274B (en) | 2022-02-18 |
Family
ID=58796202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510875453.3A Active CN106815274B (en) | 2015-12-02 | 2015-12-02 | Hadoop-based log data mining method and system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106815274B (en) |
WO (1) | WO2017092444A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391645A (en) * | 2017-07-12 | 2017-11-24 | 广州市昊链信息科技股份有限公司 | A kind of logistics information automatic push and practical operation specification form system and method |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107241231B (en) * | 2017-07-26 | 2020-04-03 | 成都科来软件有限公司 | Rapid and accurate positioning method for original network data packet |
CN112287208B (en) * | 2019-09-30 | 2024-03-01 | 北京沃东天骏信息技术有限公司 | User portrait generation method, device, electronic equipment and storage medium |
CN111597179B (en) * | 2020-05-18 | 2023-12-05 | 北京思特奇信息技术股份有限公司 | Method and device for automatically cleaning data, electronic equipment and storage medium |
CN112632020B (en) * | 2020-12-25 | 2022-03-18 | 中国电子科技集团公司第三十研究所 | Log information type extraction method and mining method based on spark big data platform |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6732123B1 (en) * | 1998-02-23 | 2004-05-04 | International Business Machines Corporation | Database recovery to any point in time in an online environment utilizing disaster recovery technology |
US20070055687A1 (en) * | 2005-09-02 | 2007-03-08 | International Business Machines Corporation | System and method for minimizing data outage time and data loss while handling errors detected during recovery |
KR20090050405A (en) * | 2007-11-15 | 2009-05-20 | 한국전자통신연구원 | Method and apparatus for classifying user behaviors based on the event log generated from the context aware system environment |
CN101483557A (en) * | 2009-03-03 | 2009-07-15 | 中兴通讯股份有限公司 | Log statistic, storing method and system used for deep packet detection apparatus |
US20100306286A1 (en) * | 2009-03-05 | 2010-12-02 | Chi-Hsien Chiu | Distributed steam processing |
CN102685221A (en) * | 2012-04-29 | 2012-09-19 | 华北电力大学(保定) | Distributed storage and parallel mining method for state monitoring data |
US20140304401A1 (en) * | 2013-04-06 | 2014-10-09 | Citrix Systems, Inc. | Systems and methods to collect logs from multiple nodes in a cluster of load balancers |
CN104182506A (en) * | 2014-08-19 | 2014-12-03 | 浪潮(北京)电子信息产业有限公司 | Log management method |
CN104301360A (en) * | 2013-07-19 | 2015-01-21 | 阿里巴巴集团控股有限公司 | Method, log server and system for recording log data |
US20150081668A1 (en) * | 2013-09-13 | 2015-03-19 | Nec Laboratories America, Inc. | Systems and methods for tuning multi-store systems to speed up big data query workload |
CN104616092A (en) * | 2014-12-16 | 2015-05-13 | 国家电网公司 | Distributed log analysis based distributed mode handling method |
CN104969213A (en) * | 2013-01-31 | 2015-10-07 | 脸谱公司 | Data stream splitting for low-latency data access |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100481077C (en) * | 2006-01-12 | 2009-04-22 | 国际商业机器公司 | Visual method and device for strengthening search result guide |
CN103036921B (en) * | 2011-09-29 | 2015-09-23 | 北京新媒传信科技有限公司 | A kind of user behavior analysis system and method |
CN103955502B (en) * | 2014-04-24 | 2017-07-28 | 科技谷(厦门)信息技术有限公司 | A kind of visualization OLAP application realization method and system |
CN104317958B (en) * | 2014-11-12 | 2018-01-16 | 北京国双科技有限公司 | A kind of real-time data processing method and system |
-
2015
- 2015-12-02 CN CN201510875453.3A patent/CN106815274B/en active Active
-
2016
- 2016-08-30 WO PCT/CN2016/097363 patent/WO2017092444A1/en active Application Filing
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6732123B1 (en) * | 1998-02-23 | 2004-05-04 | International Business Machines Corporation | Database recovery to any point in time in an online environment utilizing disaster recovery technology |
US20070055687A1 (en) * | 2005-09-02 | 2007-03-08 | International Business Machines Corporation | System and method for minimizing data outage time and data loss while handling errors detected during recovery |
KR20090050405A (en) * | 2007-11-15 | 2009-05-20 | 한국전자통신연구원 | Method and apparatus for classifying user behaviors based on the event log generated from the context aware system environment |
CN101483557A (en) * | 2009-03-03 | 2009-07-15 | 中兴通讯股份有限公司 | Log statistic, storing method and system used for deep packet detection apparatus |
US20100306286A1 (en) * | 2009-03-05 | 2010-12-02 | Chi-Hsien Chiu | Distributed steam processing |
CN102685221A (en) * | 2012-04-29 | 2012-09-19 | 华北电力大学(保定) | Distributed storage and parallel mining method for state monitoring data |
CN104969213A (en) * | 2013-01-31 | 2015-10-07 | 脸谱公司 | Data stream splitting for low-latency data access |
US20140304401A1 (en) * | 2013-04-06 | 2014-10-09 | Citrix Systems, Inc. | Systems and methods to collect logs from multiple nodes in a cluster of load balancers |
CN104301360A (en) * | 2013-07-19 | 2015-01-21 | 阿里巴巴集团控股有限公司 | Method, log server and system for recording log data |
US20150081668A1 (en) * | 2013-09-13 | 2015-03-19 | Nec Laboratories America, Inc. | Systems and methods for tuning multi-store systems to speed up big data query workload |
CN104182506A (en) * | 2014-08-19 | 2014-12-03 | 浪潮(北京)电子信息产业有限公司 | Log management method |
CN104616092A (en) * | 2014-12-16 | 2015-05-13 | 国家电网公司 | Distributed log analysis based distributed mode handling method |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391645A (en) * | 2017-07-12 | 2017-11-24 | 广州市昊链信息科技股份有限公司 | A kind of logistics information automatic push and practical operation specification form system and method |
CN107391645B (en) * | 2017-07-12 | 2018-04-10 | 广州市昊链信息科技股份有限公司 | A kind of logistics information automatic push and practical operation specification form system and method |
Also Published As
Publication number | Publication date |
---|---|
CN106815274B (en) | 2022-02-18 |
WO2017092444A1 (en) | 2017-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106815274A (en) | Daily record data method for digging and system based on Hadoop | |
RU2628208C2 (en) | Cloud-border topologies | |
CN102682059B (en) | Method and system for distributing users to clusters | |
CN102137029B (en) | A kind of instant communication contacts approaches to IM and device | |
CN103620601A (en) | Joining tables in a mapreduce procedure | |
WO2019024496A1 (en) | Enterprise recommendation method and application server | |
US10387815B2 (en) | Continuously variable resolution of resource allocation | |
CN111506801A (en) | Sequencing method and device for sub-applications in application App | |
CN108021673A (en) | A kind of user interest model generation method, position recommend method and computing device | |
CN105335409A (en) | Target user determination method and device and network server | |
CN104182506A (en) | Log management method | |
CN104133765B (en) | The test case sending method of network activity and test case server | |
CN105630800A (en) | Node importance ranking method and system | |
CN104217030A (en) | Method and device for classifying users according to search log data of server | |
CN106815254A (en) | A kind of data processing method and device | |
WO2018144048A1 (en) | Gain adjustment component for computer network routing infrastructure | |
CN103593393A (en) | Social circle digging method and device based on microblog interactive relationships | |
CN104503831A (en) | Equipment optimization method and device | |
US11321318B2 (en) | Dynamic access paths | |
CN103870571B (en) | Cube reconstructing method and device in Multi-dimension on-line analytical process system | |
US20190163671A1 (en) | Determining collaboration recommendations from file path information | |
US20170116256A1 (en) | Reliance measurement technique in master data management (mdm) repositories and mdm repositories on clouded federated databases with linkages | |
CN110928917A (en) | Target user determination method and device, computing equipment and medium | |
CN102724290B (en) | Method, device and system for getting target customer group | |
JP2022096632A (en) | Computer-implemented method, computer system, and computer program (ranking datasets based on data attributes) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |