CN107025140B

CN107025140B - A kind of mass data analytic statistics methods based on HDFS clusters

Info

Publication number: CN107025140B
Application number: CN201710206439.3A
Authority: CN
Inventors: 林森; 唐宁; 马娜
Original assignee: Beijing Friends Of Century Polytron Technologies Inc
Current assignee: Tianjin kuaiyou Century Technology Co., Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2018-03-09
Anticipated expiration: 2037-03-31
Also published as: CN107025140A

Abstract

The present invention relates to a kind of mass data analytic statistics methods based on HDFS clusters, it is characterised in that：Including establishing branch line scheduler, branch line is created using configuration file, initializes branch line, packet, data statistics output, and release resource.The number that total data scans is reduced to 1 time, so as to greatly improve the efficiency of statistical data analysis by the present invention by setting branch line from the data dimension number of degrees.And, due to using " branch line ", after the statistics of a data dimension is completed, being responsible for the release of the system resource shared by the branch line of dimension statistics will be released, these system resources can operate with other data statistics and analysis again, in the case where not increasing hardware device, the high demand that Current ad industry counts to business datum is met.

Description

A kind of mass data analytic statistics methods based on HDFS clusters

Technical field

This invention relates generally to mass data processing method, and it is more specifically it relates to a kind of based on HDFS clusters Mass data analytic statistics methods.

Background technology

To develop with the explosion type of mobile Internet, people from traditional PC ends, are quickly switched into mobile phone terminal, and Mobile phone terminal has stood firm " the first screen " of user.Under this trend, vast application developer has the demand of realization, naturally meeting Selection access moving advertising platform, and party in request's platform (Demand-Side Platform, DSP) has the need for precisely launching advertisement Ask, under this market demand, advertisement transaction platform can complete the advertisement transaction of tens times daily, and the thing followed is service Device produces TB up to a hundred advertisement transaction data daily., it is necessary to enter in based on mass data processing such as terminal device, region, GPS The analysis of row advertisement transaction data, the analysis of user behavior, the analysis (malice brush amount) of user's cheating, party in request's platform is every The data statistic analysis of the variety classeses such as its consumption or dimension.

For at present, the statistical analysis for magnanimity big data is usually to be come using MapReduce in traditional Hadoop Realize.Hadoop is a kind of distributed system architecture, realizes a distributed file system (Hadoop Distributed File System).The design that Hadoop framework is most crucial is exactly HDFS and MapReduce.HDFS is sea The data of amount provide storage, and MapReduce provides calculating for the data of magnanimity.In actual applications, hadoop HDFS server clusters can receive transaction log caused by different trading servers or other data；Management through HDFS after this Node (NameNode), also known as main control server, run MapReduce on multiple working nodes (DataNode) after distribution and make Industry, the data statistic analysis of single dimension is realized, its specific algorithm is as follows：Whole numbers are carried out according to the algorithm of statistical dimension first According to scanning, to carry out packet and distribute task accordingly to multiple working nodes, that is, (map) process is mapped, is then made multiple Working node carries out the parallel processing of data sorting and merging, and the data after finally each node is merged uniformly collect, and performs Reduction (reduce) is operated, and the result data after statistical analysis is stored.Hereafter MapReduce is repeated in perform The data statistics of remaining dimension.MapReduce schematic flow sheet is shown in the accompanying drawings.

MapReduce is required for carrying out the scanning of total data in the statistical analysis computing of each one dimension of execution, To complete data statistic analysis.It counts the estimation equation of total time：T (n)=∑ (Td (n)+Tm (n)+Tr (n)), wherein, T (n) is that the statistical analysis of n kinds runs total time, and Td (n) is n full table scan total time, and Tm (n) is the map of n kind statistical analyses Total time, Tr (n) are the reduce total times of n kind statistical analyses.When needing to be compared more data statistic analysis, the party Method can take relatively limited HDFS server cluster resources for a long time.Also, because this statistical analysis technique calculates the used time Too long, the requirement settled accounts daily to developer and party in request can not have been met at present.

The content of the invention

In view of the above-mentioned problems, the present invention solves by a kind of mass data analytic statistics methods based on HDFS clusters Prior art, i.e. MapReduce methods, limited HDFS server cluster resource times are taken in statistical analysis mass data The problem of time of oversize and statistical analysis itself is oversize.

To achieve these goals, the present invention adopts the following technical scheme that.

A kind of mass data analytic statistics methods based on HDFS clusters, it is characterised in that：

Branch line scheduler is established, is realized by a single management node；

Branch line is created using configuration file, the branch line is minimum statistic unit, and a branch line completes a data dimension The statistics of degree；

Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task；

Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line Id information,；

Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line, The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas；

Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line；

Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed System resource release shared by line.

The system resource includes database connection resource, file resource, data acquisition system resource.

The management node of the present invention for realizing branch line scheduler is different from management node of the prior art Namenode, management node of the present invention are the allotment for realizing branch line, i.e. several working nodes shared by branch line Allotment.

Further, the composition of the packet key is：Branch line ID_ statistics keys.

Further, after branch line is initialized, also branch line ID is verified, to avoid branch line ID and other specification weight Close.

Further, the ratio of the quantity of the branch line and the quantity of working node is more than 0 and no more than 2, preferred ratio Between 0.67-1.When the quantity ratio of branch line quantity and working node is excessive, the data between node cluster can be dramatically increased Transmission, is also unfavorable for the scheduling of branch line scheduler, and the machine risk of delaying of the management node of branch line scheduler is realized in increase.When branch line number When the quantity ratio of amount and working node is between 0.67-1, the operation efficiency of system is more excellent.

From foregoing invention scheme, the system of the mass data analytic statistics methods of the present invention based on HDFS clusters Meter total time estimation equation is that T (n)=∑ (Td (1)+Tm (n)+Tr (n))+Δ T, Td (1) is once full table scan total time, Tm (n) is the map total times of n kinds statistics, and Tr (n) is the reduction total time of n kinds statistics, and Δ T is needed for branch line scheduler handle Operation time.

And in actual applications, in the case where branch line number and the ratio of working node number are no more than 2, at branch line scheduler Operation time is very short needed for reason, can be ignored.That is the statistics total time of the data analysis statistical approach of present disclosure is estimated Calculation formula is T (n)=∑ (Td (1)+Tm (n)+Tr (n)).

Accordingly, with respect to prior art, the present invention is by setting branch line, by the number of total data scanning from data dimension Number is reduced to 1 time, so as to greatly improve the efficiency of statistical data analysis.Also, because method provided by the invention uses " branch line ", After the statistics of a data dimension is completed, being responsible for the release of the system resource shared by the branch line of dimension statistics will be released Put, these system resources can operate with other data statistics and analysis again, in the case where not increasing hardware device, meets and works as The high demand of preceding advertising bound pair business datum statistics.

In addition, the present invention merges the data biography that can effectively reduce between node cluster by the data of local working node It is defeated, advantageously reduce the operation time of mass data analytic statistics.

Brief description of the drawings

Fig. 1 is MapReduce schematic flow sheets in prior art Hadoop；

Fig. 2 is the schematic flow sheet according to a preferred embodiment disclosed by the invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings.The refer to the attached drawing The embodiment of description is exemplary, is only used for explaining the present invention, and can not be considered as limitation of the present invention.In order to avoid need not Strategic point obscures the embodiment, and this part to some techniques knowns, i.e., is aobvious to those skilled in the art And the technology being clear to, it is not described in detail.

Embodiment 1

A kind of mass data analytic statistics methods based on HDFS clusters, its flow are illustrated as shown in Fig. 2 its feature exists In：

Branch line 1, branch line 2 and branch line 3 are created using configuration file, the branch line is minimum statistic unit, a branch line Complete the statistics of a data dimension；

Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task, branch The ID of line 1 is subline_1, and the ID of branch line 2 is subline_2, and the ID of branch line 3 is subline_3, and 3 ID are entered Row verification, to avoid branch line ID from being overlapped with other specification；

Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line Id information；

Embodiment 2

The advertising platform data of 16TB magnitudes are entered based on the mass data analytic statistics methods of HDFS clusters using a kind of What the analysis of row advertisement transaction data, the analysis of user behavior, the analysis of user's cheating and party in request's platform were consumed daily Analysis, it is characterised in that：

Branch line 1, branch line 2, branch line 3 and branch line 4 are created using configuration file and correspond to advertisement transaction data, Yong Huhang respectively Consume the analysis of this 4 dimensions daily for, user's cheating and party in request's platform, the branch line is minimum statistic unit, One branch line completes the statistics of a data dimension；

Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task, branch The ID of line 1 is subline_1, and the ID of branch line 2 is subline_2, and the ID of branch line 3 is subline_3, and the ID of branch line 4 is Subline_4, and 4 ID are verified, to avoid branch line ID from being overlapped with other specification；

Embodiment 3-8 is as shown in table 1 except the magnitude, analysis number of dimensions, branch line number of advertising platform transaction data, with And corresponding branch line ID have corresponding to become it is outside the pale of civilization, remaining is same as Example 2.

Comparative example 1-7 is realized to advertising platform number of deals described in embodiment 2-8 using MapReduce in traditional Hadoop According to data analysis.

The HDFS clusters that embodiment 2-8 and comparative example 1-7 are based on have the working node (12) of identical quantity, i.e., real Border data processing operation ability is identical.

T (n) is that the statistical analysis of n kinds runs total time, and Td (n) is n full table scan total time, and Tm (n) is n kind statisticals The map total times of analysis, Tr (n) are the reduce total times of n kind statistical analyses, and Δ T is needed for branch line scheduler handle during computing Between.T (n), Td (n), Tm (n), Tr (n) and Δ T in table 1 are that gained is right after being repeated 3 times embodiment 2-8 and ratio 1-7 Answer the average of parameter.Table 1

Data magnitude

Analyze dimension

Branch line number

Td(n)

Tm(n)

Tr(n)

ΔT

T(n)

Embodiment 2

16TB

4

6h

3h

1.5h

2min

10.5h

Comparative example 1

16TB

4

\

23h

3.2h

2h

0

28.2h

Embodiment 3

10TB

5

4.25h

2h

1.25h

3min

7.5h

Comparative example 2

10TB

5

\

17.5h

2.1h

1.5h

0

21.1h

Embodiment 4

10TB

15

5h

2.3h

1.4h

12min

8.9h

Comparative example 3

10TB

15

\

70h

2.5h

1.75h

0

74.25h

Embodiment 5

5TB

8

2h

1.5h

1.2h

5min

4.8h

Comparative example 4

5TB

8

\

15.5h

1.75h

1.5h

0

18.75

Embodiment 6

5TB

30

8.5h

2.5h

2.1h

35min

13.7h

Comparative example 5

5TB

30

\

56h

2h

1.75h

0

59.75h

Embodiment 7

1.5TB

12

0.7h

1h

0.75h

8min

2.6h

Comparative example 6

1.5TB

12

\

7.8h

1.2h

1h

0

10h

Embodiment 8

1.5TB

36

12

2h

2.8h

2.2h

26min

7.4h

Comparative example 7

1.5TB

36

\

23h

3.3h

2.5h

0

28.8h

Claims

A kind of 1. mass data analytic statistics methods based on HDFS clusters, it is characterised in that：

Branch line scheduler is established, is realized by a single management node；

Branch line is created using configuration file, the branch line is minimum statistic unit, and a branch line completes data dimension Statistics；

Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task；

Packet, total data scanning is carried out, the data dimension that initial data is counted according to required for different branch lines is carried out Divide and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line ID letters Breath；

Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch line ID Go in branch line scheduler to find corresponding branch line, then the key-value pair with same packets key is included into corresponding branch line, each The data value in key-value pair is counted in branch line, is merged into a pair of grouped datas；

Data statistics is exported, and the grouped data of each node in cluster is pooled together, by point with same packets key Group data carry out reduction, finally export reduction result to corresponding branch line, then exported by branch line；

Resource is discharged, branch line scheduler will be responsible for the branch line institute of data dimension statistics after the statistics of a data dimension is completed The system resource release of occupancy；

The ratio of the quantity of the branch line and the quantity of working node is between 0.67-1.
2. the mass data analytic statistics methods according to claim 1 based on HDFS clusters, it is characterised in that：Described point Group key composition be：Branch line ID_ statistics keys.
3. the mass data analytic statistics methods according to claim 1 based on HDFS clusters, it is characterised in that：Initial After changing branch line, also branch line ID is verified, to avoid branch line ID from being overlapped with other specification.