CN107025140B - A kind of mass data analytic statistics methods based on HDFS clusters - Google Patents

A kind of mass data analytic statistics methods based on HDFS clusters Download PDF

Info

Publication number
CN107025140B
CN107025140B CN201710206439.3A CN201710206439A CN107025140B CN 107025140 B CN107025140 B CN 107025140B CN 201710206439 A CN201710206439 A CN 201710206439A CN 107025140 B CN107025140 B CN 107025140B
Authority
CN
China
Prior art keywords
branch line
data
statistics
key
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710206439.3A
Other languages
Chinese (zh)
Other versions
CN107025140A (en
Inventor
林森
唐宁
马娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin kuaiyou Century Technology Co., Ltd
Original Assignee
Beijing Friends Of Century Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Friends Of Century Polytron Technologies Inc filed Critical Beijing Friends Of Century Polytron Technologies Inc
Priority to CN201710206439.3A priority Critical patent/CN107025140B/en
Publication of CN107025140A publication Critical patent/CN107025140A/en
Application granted granted Critical
Publication of CN107025140B publication Critical patent/CN107025140B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present invention relates to a kind of mass data analytic statistics methods based on HDFS clusters, it is characterised in that:Including establishing branch line scheduler, branch line is created using configuration file, initializes branch line, packet, data statistics output, and release resource.The number that total data scans is reduced to 1 time, so as to greatly improve the efficiency of statistical data analysis by the present invention by setting branch line from the data dimension number of degrees.And, due to using " branch line ", after the statistics of a data dimension is completed, being responsible for the release of the system resource shared by the branch line of dimension statistics will be released, these system resources can operate with other data statistics and analysis again, in the case where not increasing hardware device, the high demand that Current ad industry counts to business datum is met.

Description

A kind of mass data analytic statistics methods based on HDFS clusters
Technical field
This invention relates generally to mass data processing method, and it is more specifically it relates to a kind of based on HDFS clusters Mass data analytic statistics methods.
Background technology
To develop with the explosion type of mobile Internet, people from traditional PC ends, are quickly switched into mobile phone terminal, and Mobile phone terminal has stood firm " the first screen " of user.Under this trend, vast application developer has the demand of realization, naturally meeting Selection access moving advertising platform, and party in request's platform (Demand-Side Platform, DSP) has the need for precisely launching advertisement Ask, under this market demand, advertisement transaction platform can complete the advertisement transaction of tens times daily, and the thing followed is service Device produces TB up to a hundred advertisement transaction data daily., it is necessary to enter in based on mass data processing such as terminal device, region, GPS The analysis of row advertisement transaction data, the analysis of user behavior, the analysis (malice brush amount) of user's cheating, party in request's platform is every The data statistic analysis of the variety classeses such as its consumption or dimension.
For at present, the statistical analysis for magnanimity big data is usually to be come using MapReduce in traditional Hadoop Realize.Hadoop is a kind of distributed system architecture, realizes a distributed file system (Hadoop Distributed File System).The design that Hadoop framework is most crucial is exactly HDFS and MapReduce.HDFS is sea The data of amount provide storage, and MapReduce provides calculating for the data of magnanimity.In actual applications, hadoop HDFS server clusters can receive transaction log caused by different trading servers or other data;Management through HDFS after this Node (NameNode), also known as main control server, run MapReduce on multiple working nodes (DataNode) after distribution and make Industry, the data statistic analysis of single dimension is realized, its specific algorithm is as follows:Whole numbers are carried out according to the algorithm of statistical dimension first According to scanning, to carry out packet and distribute task accordingly to multiple working nodes, that is, (map) process is mapped, is then made multiple Working node carries out the parallel processing of data sorting and merging, and the data after finally each node is merged uniformly collect, and performs Reduction (reduce) is operated, and the result data after statistical analysis is stored.Hereafter MapReduce is repeated in perform The data statistics of remaining dimension.MapReduce schematic flow sheet is shown in the accompanying drawings.
MapReduce is required for carrying out the scanning of total data in the statistical analysis computing of each one dimension of execution, To complete data statistic analysis.It counts the estimation equation of total time:T (n)=∑ (Td (n)+Tm (n)+Tr (n)), wherein, T (n) is that the statistical analysis of n kinds runs total time, and Td (n) is n full table scan total time, and Tm (n) is the map of n kind statistical analyses Total time, Tr (n) are the reduce total times of n kind statistical analyses.When needing to be compared more data statistic analysis, the party Method can take relatively limited HDFS server cluster resources for a long time.Also, because this statistical analysis technique calculates the used time Too long, the requirement settled accounts daily to developer and party in request can not have been met at present.
The content of the invention
In view of the above-mentioned problems, the present invention solves by a kind of mass data analytic statistics methods based on HDFS clusters Prior art, i.e. MapReduce methods, limited HDFS server cluster resource times are taken in statistical analysis mass data The problem of time of oversize and statistical analysis itself is oversize.
To achieve these goals, the present invention adopts the following technical scheme that.
A kind of mass data analytic statistics methods based on HDFS clusters, it is characterised in that:
Branch line scheduler is established, is realized by a single management node;
Branch line is created using configuration file, the branch line is minimum statistic unit, and a branch line completes a data dimension The statistics of degree;
Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task;
Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line Id information,;
Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line, The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas;
Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line;
Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed System resource release shared by line.
The system resource includes database connection resource, file resource, data acquisition system resource.
The management node of the present invention for realizing branch line scheduler is different from management node of the prior art Namenode, management node of the present invention are the allotment for realizing branch line, i.e. several working nodes shared by branch line Allotment.
Further, the composition of the packet key is:Branch line ID_ statistics keys.
Further, after branch line is initialized, also branch line ID is verified, to avoid branch line ID and other specification weight Close.
Further, the ratio of the quantity of the branch line and the quantity of working node is more than 0 and no more than 2, preferred ratio Between 0.67-1.When the quantity ratio of branch line quantity and working node is excessive, the data between node cluster can be dramatically increased Transmission, is also unfavorable for the scheduling of branch line scheduler, and the machine risk of delaying of the management node of branch line scheduler is realized in increase.When branch line number When the quantity ratio of amount and working node is between 0.67-1, the operation efficiency of system is more excellent.
From foregoing invention scheme, the system of the mass data analytic statistics methods of the present invention based on HDFS clusters Meter total time estimation equation is that T (n)=∑ (Td (1)+Tm (n)+Tr (n))+Δ T, Td (1) is once full table scan total time, Tm (n) is the map total times of n kinds statistics, and Tr (n) is the reduction total time of n kinds statistics, and Δ T is needed for branch line scheduler handle Operation time.
And in actual applications, in the case where branch line number and the ratio of working node number are no more than 2, at branch line scheduler Operation time is very short needed for reason, can be ignored.That is the statistics total time of the data analysis statistical approach of present disclosure is estimated Calculation formula is T (n)=∑ (Td (1)+Tm (n)+Tr (n)).
Accordingly, with respect to prior art, the present invention is by setting branch line, by the number of total data scanning from data dimension Number is reduced to 1 time, so as to greatly improve the efficiency of statistical data analysis.Also, because method provided by the invention uses " branch line ", After the statistics of a data dimension is completed, being responsible for the release of the system resource shared by the branch line of dimension statistics will be released Put, these system resources can operate with other data statistics and analysis again, in the case where not increasing hardware device, meets and works as The high demand of preceding advertising bound pair business datum statistics.
In addition, the present invention merges the data biography that can effectively reduce between node cluster by the data of local working node It is defeated, advantageously reduce the operation time of mass data analytic statistics.
Brief description of the drawings
Fig. 1 is MapReduce schematic flow sheets in prior art Hadoop;
Fig. 2 is the schematic flow sheet according to a preferred embodiment disclosed by the invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings.The refer to the attached drawing The embodiment of description is exemplary, is only used for explaining the present invention, and can not be considered as limitation of the present invention.In order to avoid need not Strategic point obscures the embodiment, and this part to some techniques knowns, i.e., is aobvious to those skilled in the art And the technology being clear to, it is not described in detail.
Embodiment 1
A kind of mass data analytic statistics methods based on HDFS clusters, its flow are illustrated as shown in Fig. 2 its feature exists In:
Branch line scheduler is established, is realized by a single management node;
Branch line 1, branch line 2 and branch line 3 are created using configuration file, the branch line is minimum statistic unit, a branch line Complete the statistics of a data dimension;
Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task, branch The ID of line 1 is subline_1, and the ID of branch line 2 is subline_2, and the ID of branch line 3 is subline_3, and 3 ID are entered Row verification, to avoid branch line ID from being overlapped with other specification;
Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line Id information;
Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line, The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas;
Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line;
Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed System resource release shared by line.
Embodiment 2
The advertising platform data of 16TB magnitudes are entered based on the mass data analytic statistics methods of HDFS clusters using a kind of What the analysis of row advertisement transaction data, the analysis of user behavior, the analysis of user's cheating and party in request's platform were consumed daily Analysis, it is characterised in that:
Branch line scheduler is established, is realized by a single management node;
Branch line 1, branch line 2, branch line 3 and branch line 4 are created using configuration file and correspond to advertisement transaction data, Yong Huhang respectively Consume the analysis of this 4 dimensions daily for, user's cheating and party in request's platform, the branch line is minimum statistic unit, One branch line completes the statistics of a data dimension;
Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task, branch The ID of line 1 is subline_1, and the ID of branch line 2 is subline_2, and the ID of branch line 3 is subline_3, and the ID of branch line 4 is Subline_4, and 4 ID are verified, to avoid branch line ID from being overlapped with other specification;
Packet, carry out total data scanning, the data dimension that initial data is counted according to required for different branch lines Divided and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line Id information;
Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch Line ID is gone in branch line scheduler to find corresponding branch line, and then the key-value pair with same packets key is included into corresponding branch line, The data value in key-value pair is counted in each branch line, is merged into a pair of grouped datas;
Data statistics is exported, and the grouped data of each node in cluster is pooled together, will have same packets key Grouped data carry out reduction, finally reduction result is exported to corresponding branch line, then is exported by branch line;
Resource is discharged, branch line scheduler will be responsible for the branch of data dimension statistics after the statistics of a data dimension is completed System resource release shared by line.
Embodiment 3-8 is as shown in table 1 except the magnitude, analysis number of dimensions, branch line number of advertising platform transaction data, with And corresponding branch line ID have corresponding to become it is outside the pale of civilization, remaining is same as Example 2.
Comparative example 1-7 is realized to advertising platform number of deals described in embodiment 2-8 using MapReduce in traditional Hadoop According to data analysis.
The HDFS clusters that embodiment 2-8 and comparative example 1-7 are based on have the working node (12) of identical quantity, i.e., real Border data processing operation ability is identical.
T (n) is that the statistical analysis of n kinds runs total time, and Td (n) is n full table scan total time, and Tm (n) is n kind statisticals The map total times of analysis, Tr (n) are the reduce total times of n kind statistical analyses, and Δ T is needed for branch line scheduler handle during computing Between.T (n), Td (n), Tm (n), Tr (n) and Δ T in table 1 are that gained is right after being repeated 3 times embodiment 2-8 and ratio 1-7 Answer the average of parameter.Table 1
Data magnitude Analyze dimension Branch line number Td(n) Tm(n) Tr(n) ΔT T(n)
Embodiment 2 16TB 4 4 6h 3h 1.5h 2min 10.5h
Comparative example 1 16TB 4 \ 23h 3.2h 2h 0 28.2h
Embodiment 3 10TB 5 5 4.25h 2h 1.25h 3min 7.5h
Comparative example 2 10TB 5 \ 17.5h 2.1h 1.5h 0 21.1h
Embodiment 4 10TB 15 15 5h 2.3h 1.4h 12min 8.9h
Comparative example 3 10TB 15 \ 70h 2.5h 1.75h 0 74.25h
Embodiment 5 5TB 8 8 2h 1.5h 1.2h 5min 4.8h
Comparative example 4 5TB 8 \ 15.5h 1.75h 1.5h 0 18.75
Embodiment 6 5TB 30 30 8.5h 2.5h 2.1h 35min 13.7h
Comparative example 5 5TB 30 \ 56h 2h 1.75h 0 59.75h
Embodiment 7 1.5TB 12 12 0.7h 1h 0.75h 8min 2.6h
Comparative example 6 1.5TB 12 \ 7.8h 1.2h 1h 0 10h
Embodiment 8 1.5TB 36 12 2h 2.8h 2.2h 26min 7.4h
Comparative example 7 1.5TB 36 \ 23h 3.3h 2.5h 0 28.8h

Claims (3)

  1. A kind of 1. mass data analytic statistics methods based on HDFS clusters, it is characterised in that:
    Branch line scheduler is established, is realized by a single management node;
    Branch line is created using configuration file, the branch line is minimum statistic unit, and a branch line completes data dimension Statistics;
    Branch line is initialized, branch line scheduler assigns branch line globally unique branch line ID in this statistics task;
    Packet, total data scanning is carried out, the data dimension that initial data is counted according to required for different branch lines is carried out Divide and assign corresponding packet key, form [packet key, data value] key-value pair, the packet key contains corresponding branch line ID letters Breath;
    Data merge, and the packet key according to key-value pair in each working node parses branch line ID, further according to branch line ID Go in branch line scheduler to find corresponding branch line, then the key-value pair with same packets key is included into corresponding branch line, each The data value in key-value pair is counted in branch line, is merged into a pair of grouped datas;
    Data statistics is exported, and the grouped data of each node in cluster is pooled together, by point with same packets key Group data carry out reduction, finally export reduction result to corresponding branch line, then exported by branch line;
    Resource is discharged, branch line scheduler will be responsible for the branch line institute of data dimension statistics after the statistics of a data dimension is completed The system resource release of occupancy;
    The ratio of the quantity of the branch line and the quantity of working node is between 0.67-1.
  2. 2. the mass data analytic statistics methods according to claim 1 based on HDFS clusters, it is characterised in that:Described point Group key composition be:Branch line ID_ statistics keys.
  3. 3. the mass data analytic statistics methods according to claim 1 based on HDFS clusters, it is characterised in that:Initial After changing branch line, also branch line ID is verified, to avoid branch line ID from being overlapped with other specification.
CN201710206439.3A 2017-03-31 2017-03-31 A kind of mass data analytic statistics methods based on HDFS clusters Active CN107025140B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710206439.3A CN107025140B (en) 2017-03-31 2017-03-31 A kind of mass data analytic statistics methods based on HDFS clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710206439.3A CN107025140B (en) 2017-03-31 2017-03-31 A kind of mass data analytic statistics methods based on HDFS clusters

Publications (2)

Publication Number Publication Date
CN107025140A CN107025140A (en) 2017-08-08
CN107025140B true CN107025140B (en) 2018-03-09

Family

ID=59527458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710206439.3A Active CN107025140B (en) 2017-03-31 2017-03-31 A kind of mass data analytic statistics methods based on HDFS clusters

Country Status (1)

Country Link
CN (1) CN107025140B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120174110A1 (en) * 2011-01-05 2012-07-05 International Business Machines Corporation Amortizing costs of shared scans
US8549518B1 (en) * 2011-08-10 2013-10-01 Nutanix, Inc. Method and system for implementing a maintenanece service for managing I/O and storage for virtualization environment
CN105005570A (en) * 2014-04-23 2015-10-28 国家电网公司 Method and apparatus for mining massive intelligent power consumption data based on cloud computing
CN106354813A (en) * 2016-08-29 2017-01-25 北京首信科技股份有限公司 Mass data dimension user positioning method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120174110A1 (en) * 2011-01-05 2012-07-05 International Business Machines Corporation Amortizing costs of shared scans
US8549518B1 (en) * 2011-08-10 2013-10-01 Nutanix, Inc. Method and system for implementing a maintenanece service for managing I/O and storage for virtualization environment
CN105005570A (en) * 2014-04-23 2015-10-28 国家电网公司 Method and apparatus for mining massive intelligent power consumption data based on cloud computing
CN106354813A (en) * 2016-08-29 2017-01-25 北京首信科技股份有限公司 Mass data dimension user positioning method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MRShare: Sharing Across Multiple Queries in MapReduce;Tomasz Nykiel et al;《Proceedings or the VLDB Endowment》;20100917;第3卷(第1期);第494-505页 *
PLANET:Massivel Parallel Learning of Tree Ensembles with MapReduce;Biswanath Panda et al;《VLDB"S 09》;ACM;20091231;全文 *

Also Published As

Publication number Publication date
CN107025140A (en) 2017-08-08

Similar Documents

Publication Publication Date Title
US10565022B2 (en) Systems for parallel processing of datasets with dynamic skew compensation
CN104902001B (en) Web request load-balancing method based on operating system virtualization
CN106155817A (en) Business information processing method, server and system
Ai et al. Resource allocation and scheduling of multiple composite web services in cloud computing using cooperative coevolution genetic algorithm
CN104615765A (en) Data processing method and data processing device for browsing internet records of mobile subscribers
CN102006174B (en) Data processing method and device based on online behavior of mobile phone user
Gupta et al. Enhanced Virtualization‐Based Dynamic Bin‐Packing Optimized Energy Management Solution for Heterogeneous Clouds
CN102932271A (en) Method and device for realizing load balancing
CN106326339A (en) Task allocating method and device
CN103744918A (en) Vertical domain based micro blog searching ranking method and system
CN108256182A (en) A kind of layout method of dynamic reconfigurable FPGA
Adrian et al. Analysis of K-means algorithm for VM allocation in cloud computing
CN105872082A (en) Fine-grained resource response system based on load balancing algorithm of container cluster
CN107025140B (en) A kind of mass data analytic statistics methods based on HDFS clusters
CN106790368A (en) Resource regulating method and device in a kind of distributed system
US11374869B2 (en) Managing bandwidth based on user behavior
EP2622499B1 (en) Techniques to support large numbers of subscribers to a real-time event
KR101219816B1 (en) Cloud server to stably migrate data of member service system without being interrupted
CN111061697A (en) Log data processing method and device, electronic equipment and storage medium
CN109660623A (en) A kind of distribution method, device and the computer readable storage medium of cloud service resource
CN113297436B (en) User policy distribution method and device based on relational graph network and electronic equipment
CN103647712A (en) Distributed route processing business method and distributed route processing business system
CN106385385B (en) Resource allocation method and device
CN110276647A (en) Online service method based on user integral
CN106250205B (en) A kind of virtual machine method for customizing and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 100020 room 2, unit 5, building 1, Futong East Street, Chaoyang District, Beijing, 324502.

Patentee after: Beijing friends of the century Polytron Technologies Inc

Address before: 100020 room 315, building 7, 2 North Road, Chaoyang District, Beijing.

Patentee before: Beijing friends of the century Polytron Technologies Inc

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200701

Address after: 2601, 26F, Baozheng building, 637 Jinchang Road, Binhai New Area, Tianjin

Patentee after: Tianjin kuaiyou Century Technology Co., Ltd

Address before: Room 324502, unit 2, building 5, yard 1, Futong East Street, Chaoyang District, Beijing 100020

Patentee before: BEIJING ADVIEW TECHNOLOGY Co.,Ltd.