A kind of mass data analytic statistics methods based on HDFS clusters
Technical field
This invention relates generally to mass data processing method, and it is more specifically it relates to a kind of based on HDFS clusters
Mass data analytic statistics methods.
Background technology
To develop with the explosion type of mobile Internet, people from traditional PC ends, are quickly switched into mobile phone terminal, and
Mobile phone terminal has stood firm " the first screen " of user.Under this trend, vast application developer has the demand of realization, naturally meeting
Selection access moving advertising platform, and party in request's platform (Demand-Side Platform, DSP) has the need for precisely launching advertisement
Ask, under this market demand, advertisement transaction platform can complete the advertisement transaction of tens times daily, and the thing followed is service
Device produces TB up to a hundred advertisement transaction data daily., it is necessary to enter in based on mass data processing such as terminal device, region, GPS
The analysis of row advertisement transaction data, the analysis of user behavior, the analysis (malice brush amount) of user's cheating, party in request's platform is every
The data statistic analysis of the variety classeses such as its consumption or dimension.
For at present, the statistical analysis for magnanimity big data is usually to be come using MapReduce in traditional Hadoop
Realize.Hadoop is a kind of distributed system architecture, realizes a distributed file system (Hadoop
Distributed File System).The design that Hadoop framework is most crucial is exactly HDFS and MapReduce.HDFS is sea
The data of amount provide storage, and MapReduce provides calculating for the data of magnanimity.In actual applications, hadoop
HDFS server clusters can receive transaction log caused by different trading servers or other data;Management through HDFS after this
Node (NameNode), also known as main control server, run MapReduce on multiple working nodes (DataNode) after distribution and make
Industry, the data statistic analysis of single dimension is realized, its specific algorithm is as follows:Whole numbers are carried out according to the algorithm of statistical dimension first
According to scanning, to carry out packet and distribute task accordingly to multiple working nodes, that is, (map) process is mapped, is then made multiple
Working node carries out the parallel processing of data sorting and merging, and the data after finally each node is merged uniformly collect, and performs
Reduction (reduce) is operated, and the result data after statistical analysis is stored.Hereafter MapReduce is repeated in perform
The data statistics of remaining dimension.MapReduce schematic flow sheet is shown in the accompanying drawings.
MapReduce is required for carrying out the scanning of total data in the statistical analysis computing of each one dimension of execution,
To complete data statistic analysis.It counts the estimation equation of total time:T (n)=∑ (Td (n)+Tm (n)+Tr (n)), wherein,
T (n) is that the statistical analysis of n kinds runs total time, and Td (n) is n full table scan total time, and Tm (n) is the map of n kind statistical analyses
Total time, Tr (n) are the reduce total times of n kind statistical analyses.When needing to be compared more data statistic analysis, the party
Method can take relatively limited HDFS server cluster resources for a long time.Also, because this statistical analysis technique calculates the used time
Too long, the requirement settled accounts daily to developer and party in request can not have been met at present.
The content of the invention
In view of the above-mentioned problems, the present invention solves by a kind of mass data analytic statistics methods based on HDFS clusters
Prior art, i.e. MapReduce methods, limited HDFS server cluster resource times are taken in statistical analysis mass data
The problem of time of oversize and statistical analysis itself is oversize.
To achieve these goals, the present invention adopts the following technical scheme that.
A kind of mass data analytic statistics methods based on HDFS clusters, it is characterised in that:
Branch line scheduler is established, is realized by a single management node;
Branch line is created using configuration file, the branch line is minimum statistic unit, and a branch line completes a data dimension
The statistics of degree;
Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task;
Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines
Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line
Id information,;
Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch
Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line,
The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas;
Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key
Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line;
Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed
System resource release shared by line.
The system resource includes database connection resource, file resource, data acquisition system resource.
The management node of the present invention for realizing branch line scheduler is different from management node of the prior art
Namenode, management node of the present invention are the allotment for realizing branch line, i.e. several working nodes shared by branch line
Allotment.
Further, the composition of the packet key is:Branch line ID_ statistics keys.
Further, after branch line is initialized, also branch line ID is verified, to avoid branch line ID and other specification weight
Close.
Further, the ratio of the quantity of the branch line and the quantity of working node is more than 0 and no more than 2, preferred ratio
Between 0.67-1.When the quantity ratio of branch line quantity and working node is excessive, the data between node cluster can be dramatically increased
Transmission, is also unfavorable for the scheduling of branch line scheduler, and the machine risk of delaying of the management node of branch line scheduler is realized in increase.When branch line number
When the quantity ratio of amount and working node is between 0.67-1, the operation efficiency of system is more excellent.
From foregoing invention scheme, the system of the mass data analytic statistics methods of the present invention based on HDFS clusters
Meter total time estimation equation is that T (n)=∑ (Td (1)+Tm (n)+Tr (n))+Δ T, Td (1) is once full table scan total time,
Tm (n) is the map total times of n kinds statistics, and Tr (n) is the reduction total time of n kinds statistics, and Δ T is needed for branch line scheduler handle
Operation time.
And in actual applications, in the case where branch line number and the ratio of working node number are no more than 2, at branch line scheduler
Operation time is very short needed for reason, can be ignored.That is the statistics total time of the data analysis statistical approach of present disclosure is estimated
Calculation formula is T (n)=∑ (Td (1)+Tm (n)+Tr (n)).
Accordingly, with respect to prior art, the present invention is by setting branch line, by the number of total data scanning from data dimension
Number is reduced to 1 time, so as to greatly improve the efficiency of statistical data analysis.Also, because method provided by the invention uses " branch line ",
After the statistics of a data dimension is completed, being responsible for the release of the system resource shared by the branch line of dimension statistics will be released
Put, these system resources can operate with other data statistics and analysis again, in the case where not increasing hardware device, meets and works as
The high demand of preceding advertising bound pair business datum statistics.
In addition, the present invention merges the data biography that can effectively reduce between node cluster by the data of local working node
It is defeated, advantageously reduce the operation time of mass data analytic statistics.
Brief description of the drawings
Fig. 1 is MapReduce schematic flow sheets in prior art Hadoop;
Fig. 2 is the schematic flow sheet according to a preferred embodiment disclosed by the invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings.The refer to the attached drawing
The embodiment of description is exemplary, is only used for explaining the present invention, and can not be considered as limitation of the present invention.In order to avoid need not
Strategic point obscures the embodiment, and this part to some techniques knowns, i.e., is aobvious to those skilled in the art
And the technology being clear to, it is not described in detail.
Embodiment 1
A kind of mass data analytic statistics methods based on HDFS clusters, its flow are illustrated as shown in Fig. 2 its feature exists
In:
Branch line scheduler is established, is realized by a single management node;
Branch line 1, branch line 2 and branch line 3 are created using configuration file, the branch line is minimum statistic unit, a branch line
Complete the statistics of a data dimension;
Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task, branch
The ID of line 1 is subline_1, and the ID of branch line 2 is subline_2, and the ID of branch line 3 is subline_3, and 3 ID are entered
Row verification, to avoid branch line ID from being overlapped with other specification;
Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines
Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line
Id information;
Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch
Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line,
The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas;
Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key
Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line;
Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed
System resource release shared by line.
Embodiment 2
The advertising platform data of 16TB magnitudes are entered based on the mass data analytic statistics methods of HDFS clusters using a kind of
What the analysis of row advertisement transaction data, the analysis of user behavior, the analysis of user's cheating and party in request's platform were consumed daily
Analysis, it is characterised in that:
Branch line scheduler is established, is realized by a single management node;
Branch line 1, branch line 2, branch line 3 and branch line 4 are created using configuration file and correspond to advertisement transaction data, Yong Huhang respectively
Consume the analysis of this 4 dimensions daily for, user's cheating and party in request's platform, the branch line is minimum statistic unit,
One branch line completes the statistics of a data dimension;
Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task, branch
The ID of line 1 is subline_1, and the ID of branch line 2 is subline_2, and the ID of branch line 3 is subline_3, and the ID of branch line 4 is
Subline_4, and 4 ID are verified, to avoid branch line ID from being overlapped with other specification;
Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines
Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line
Id information;
Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch
Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line,
The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas;
Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key
Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line;
Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed
System resource release shared by line.
Embodiment 3-8 is as shown in table 1 except the magnitude, analysis number of dimensions, branch line number of advertising platform transaction data, with
And corresponding branch line ID have corresponding to become it is outside the pale of civilization, remaining is same as Example 2.
Comparative example 1-7 is realized to advertising platform number of deals described in embodiment 2-8 using MapReduce in traditional Hadoop
According to data analysis.
The HDFS clusters that embodiment 2-8 and comparative example 1-7 are based on have the working node (12) of identical quantity, i.e., real
Border data processing operation ability is identical.
T (n) is that the statistical analysis of n kinds runs total time, and Td (n) is n full table scan total time, and Tm (n) is n kind statisticals
The map total times of analysis, Tr (n) are the reduce total times of n kind statistical analyses, and Δ T is needed for branch line scheduler handle during computing
Between.T (n), Td (n), Tm (n), Tr (n) and Δ T in table 1 are that gained is right after being repeated 3 times embodiment 2-8 and ratio 1-7
Answer the average of parameter.Table 1
|
Data magnitude |
Analyze dimension |
Branch line number |
Td(n) |
Tm(n) |
Tr(n) |
ΔT |
T(n) |
Embodiment 2 |
16TB |
4 |
4 |
6h |
3h |
1.5h |
2min |
10.5h |
Comparative example 1 |
16TB |
4 |
\ |
23h |
3.2h |
2h |
0 |
28.2h |
Embodiment 3 |
10TB |
5 |
5 |
4.25h |
2h |
1.25h |
3min |
7.5h |
Comparative example 2 |
10TB |
5 |
\ |
17.5h |
2.1h |
1.5h |
0 |
21.1h |
Embodiment 4 |
10TB |
15 |
15 |
5h |
2.3h |
1.4h |
12min |
8.9h |
Comparative example 3 |
10TB |
15 |
\ |
70h |
2.5h |
1.75h |
0 |
74.25h |
Embodiment 5 |
5TB |
8 |
8 |
2h |
1.5h |
1.2h |
5min |
4.8h |
Comparative example 4 |
5TB |
8 |
\ |
15.5h |
1.75h |
1.5h |
0 |
18.75 |
Embodiment 6 |
5TB |
30 |
30 |
8.5h |
2.5h |
2.1h |
35min |
13.7h |
Comparative example 5 |
5TB |
30 |
\ |
56h |
2h |
1.75h |
0 |
59.75h |
Embodiment 7 |
1.5TB |
12 |
12 |
0.7h |
1h |
0.75h |
8min |
2.6h |
Comparative example 6 |
1.5TB |
12 |
\ |
7.8h |
1.2h |
1h |
0 |
10h |
Embodiment 8 |
1.5TB |
36 |
12 |
2h |
2.8h |
2.2h |
26min |
7.4h |
Comparative example 7 |
1.5TB |
36 |
\ |
23h |
3.3h |
2.5h |
0 |
28.8h |