CN106951552A

CN106951552A - A kind of user behavior data processing method based on Hadoop

Info

Publication number: CN106951552A
Application number: CN201710191813.7A
Authority: CN
Inventors: 陈粤龙; 陈敏俊; 温亮生; 张治中; 赵瑞莉
Original assignee: Chongqing University of Post and Telecommunications; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: Chongqing University of Post and Telecommunications; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2017-07-14

Abstract

The present invention relates to a kind of user behavior data processing method based on Hadoop, methods described includes：User's history data source is imported into distributed file system HDFS；The historical behavior tables of data of user is generated based on the user's history data source；The real-time behavioral data stream of user is collected by Flume；The Kafka data that record is collected from the Flume in real time；According to the different service types of real-time behavioral data stream, the real time data of user behavior generation is handled in real time with real-time Computational frame Spark, to generate the real time data table of user；With the real time data table and historical behavior tables of data of the IMSI number association user in the IMSI storehouses, the wide table of behavioral data of user is obtained；The wide table of the behavioral data of the user is exported and is saved in HBase databases according to preset configuration file；By inquiry system Impala and HBase database integrations, to provide the inquiry entrance of user behavior data to outside.The technical scheme that the present invention is provided, can set up user behavior data business system that is efficient, becoming more meticulous.

Description

A kind of user behavior data processing method based on Hadoop

Technical field

The invention belongs to communication technical field, it is related to a kind of user behavior data processing method based on Hadoop.

Background technology

Commercialization and widespread deployment with 4G networks, when mobile communication business formally enters mobile Internet comprehensively Generation, the mobile network's bandwidth developed rapidly directly brings numerous and diverse application and user behavior, and the data in communication network are complicated Degree, information content are all increased rapidly therewith, and the complexity and operand requirement for causing data processing all have higher requirement therewith, The data-handling capacity of traditional database system receives great challenge.And in face of mass data processing demand and it is lower when Ductility limitation require, traditional data system input CPU computing capabilitys, internal memory response and handle up, the network bandwidth suffer from it is huge Benchmark, and face under high security, polycentric development trend many bottlenecks.The arrival in big data epoch makes single node Computation schema can not meet the demand of data processing, distributed data processing is progressively flat as big data with storage system The preferred framework of platform, big data technology becomes the focus of many mutually researchs.And Hadoop big data platforms are based primarily upon static number According to the parallel processing of file, although handle up in mass data, calculate, having high efficiency in terms of storage, but real-time compared with Difference, belongs to height and handles up, high concurrent, the framework of high time delay, and the process performance for small documents is always its unavoidable problem, Therefore for helpless under the higher data processing of some real-times and usage scenario.

It is special there is presently no a kind of method handled for Internet user's real time data and history (offline) Data Integration It is not the lean operation method that can adapt to operator's big data development.

The content of the invention

In view of this, it is an object of the invention to provide a kind of user behavior data processing method based on Hadoop, energy It is enough to set up user behavior data business system that is efficient, becoming more meticulous.

To reach above-mentioned purpose, the present invention provides following technical scheme：

A kind of user behavior data processing method based on Hadoop, methods described includes：

User's history data source is imported into distributed file system HDFS, to provide data access by the HDFS Interface；Wherein, the user's history data source includes international mobile subscriber identity IMSI storehouses, International Mobile Equipment Identity code At least one in IMEI storehouses and reptile storehouse；

The historical behavior tables of data of user is generated based on the user's history data source；

The real-time behavioral data stream of user is collected by metadata acquisition tool Flume, the real-time behavioral data stream includes The real-time internet log of user and user internet behavior real time parsing data；

The distributed ordering system Kafka data that record is collected from the Flume in real time, and be as message format component Real-time Computational frame provides data；

According to the different service types of real-time behavioral data stream, user's row is handled in real time with real-time Computational frame Spark For the real time data of generation, to generate the real time data table of user；

With the real time data table and historical behavior tables of data of the IMSI number association user in the IMSI storehouses, user is obtained The wide table of behavioral data；

The wide table of the behavioral data of the user is exported and is saved in HBase databases according to preset configuration file；

By inquiry system Impala and HBase database integrations, to provide the inquiry entrance of user behavior data to outside.

Further, the historical behavior tables of data for generating user based on the user's history data source includes：

All historical behavior data of the user are associated by the IMSI number in the IMSI storehouses, and by the user's All historical behavior data are mapped in Tool for Data Warehouse Hive, to form the historical behavior tables of data of the user.

Further, it is described after the distributed ordering system Kafka data that record is collected from the Flume in real time Method also includes：

Judge whether pending data have been buffered in Kafka configuration files；If so, by the pending data Send to the real-time Computational frame Spark；If it is not, by the data feedback to processing to the distributed ordering system Kafka。

Further, the IMSI storehouses, IMEI storehouses and reptile storehouse imported into HDFS by Sqoop from relevant database In.

Further, the fact that the user behavioral data stream include user mobile terminal access characteristics, search Information and flow consume corresponding real time data.

Further, obtaining the wide table of behavioral data of user includes：

Based on different service logics, obtain the real time data table of all input users with Map/Reduce frameworks and go through The output valve of history behavioral data table, to form the wide table of the behavioral data；Wherein, an IMSI number characterizes a user.

Further, the structure of table is numbered including IMSI number with business in the HBase databases combination and be used for Deposit the row of the specific business information of user.

The beneficial effects of the present invention are：

(1) the magnanimity history initial data of user is stored on HDFS by the present invention, is provided for initial data and possesses Gao Rong Wrong, height is handled up, the memory space of low cost, supports to access the data in file system in the form of streaming；Pass through data acquisition work Has the real time data that Flume collects user behavior, real time data is real-time including the real-time internet log of user, the behavior of user internet Data, the Kafka data that record is collected from Flume in real time are parsed, and are the real-time Computational frame in upper strata as message format component Authentic data support is provided, the real time data of user behavior generation is then handled in real time with Spark internal memories Computational frame.Pass through The real time data and historical data of IMSI number association user, obtain the wide table of unified user behavior data, and be stored in distribution In database HBase, a feasible solution is provided for the storage of mass users behavioral data, conventional method is alleviated Middle unit stores the pressure of customer data.

(2) present invention is based on Hadoop platform, will set up the user behavior system task that becomes more meticulous and is distributed to by low configuration In the cluster environment of computer composition, integrated with Impala and HBase and the efficient query engine of user behavior data is provided, reduced Query time postpones, and the execution speed than primary MapReduce and Hive is many soon.

(3) user behavior data generation method of the present invention, for the single data of legacy user, the party Method establish efficiently, the user behavior data business system that becomes more meticulous, be provided simultaneously with high scalability, effectively lifting operator is fine Change operation ability.

Brief description of the drawings

In order that the purpose of the present invention, technical scheme and beneficial effect are clearer, the present invention provides drawings described below and carried out Explanation：

A kind of flow chart for user behavior data generation method based on Hadoop that Fig. 1 provides for the present invention；

Fig. 2 is the design diagram of user's history behavioral data table in the present invention；

Fig. 3 is the modelling schematic diagram of the real-time behavioral data of user in the present invention；

Fig. 4 is the design diagram of the wide table of user behavior data in the present invention；

Fig. 5 is HBase storage organization figures in the present invention.

Embodiment

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described in detail.

Referring to Fig. 1, the application embodiment provides a kind of user behavior data processing method based on Hadoop, it is described Method includes：

In the present embodiment, the historical behavior tables of data for generating user based on the user's history data source includes：

In the present embodiment, distributed ordering system Kafka in real time record from the Flume collect data it Afterwards, methods described also includes：

In the present embodiment, the IMSI storehouses, IMEI storehouses and reptile storehouse are imported by Sqoop from relevant database Into HDFS.

In the present embodiment, behavioral data stream includes access spy of the user in mobile terminal to the fact that the user Property, search information and flow consume corresponding real time data.

In the present embodiment, obtaining the wide table of behavioral data of user includes：

In the present embodiment, in the HBase databases structure of table include combination that IMSI number and business number with And for depositing the row of the specific business information of user.

In the present embodiment, Hadoop is an open source projects of Apache organization and administration, has been obtained at present substantial amounts of Using, Hadoop has been grown into including Hadoop common, HDFS, MapReduce, ZooKeeper, Avro, Chukwa, 10 sub-projects including HBase, Hive, Mahout, Pig, Hadoop core is by Hadoop Common, HDFS (Hadoop Distributed File System) and Map Reduce three subsystems are constituted.Wherein Hadoop Common parts provide the foundation support sexual function for the overall frameworks of Hadoop, mainly include file system, remote process Invocation protocol and data serializing storehouse；HDFS is distributed file system, with high fault tolerance and use cost than relatively low spy Point；Map Reduce are mainly used in writing the parallelisation procedure for quickly handling mass data on large-scale computer cluster It is a programming model and software frame.

Spark is a distributed internal memory Computational frame, is characterized in that large-scale data can be handled, calculating speed is fast. The integrated Hadoop of Spark needs distributed file system could be operated, and the MapReduce that it has continued Hadoop calculates mould Type, by contrast Spark calculating process be maintained in internal memory, reduce disk read-write, can by it is multiple operation merge After calculate, therefore improve calculating speed.Spark must be ridden in hadoop cluster, and its data source is HDFS, substantially It is a Computational frame on Yarn, as MapReduce.Spark cores are divided into RDD.Spark SQL、Spark The core components such as Streaming, MLlib, GraphX, SparkR solve the problems, such as many big datas, its perfect framework day It is welcome.Its corresponding ecological environment in terms of visualization, just grows stronger day by day including zepplin etc..Spark read and write process unlike Hadoop overflows write-in disk, is all based on internal memory, therefore speed is quickly.The width of other DAG job scheduling systems, which is relied on, to be allowed Spark speed is improved.

Sqoop is the instrument of the efficient transfer data between relevant database and HDFS, can be by a relational data Data in storehouse are imported into Hadoop HDFS, can also be led HDFS data into relevant database.

Flume is the High Availabitity that Cloudera is provided, highly reliable, distributed massive logs collection, polymerization With the system of transmission, Flume supports to customize Various types of data sender in log system, for collecting data；Meanwhile, Flume There is provided and simple process is carried out to data, and write the ability of various data receivings (customizable).

Kafka is that a kind of distributed post of high-throughput subscribes to message system, and it can handle the net of consumer's scale Everything flow data in standing.Kafka can record the data collected from metadata acquisition tool Flume in real time, and conduct disappears Breath Buffer Unit provides authentic data support for the real-time Computational frame in upstream.

HBase is a high reliability, high-performance, towards row, telescopic distributed memory system, utilizes HBase skills Art can erect large-scale structure storage cluster on cheap PC Server.HBase is different from general relational database, It is a database for being suitable for unstructured data storage.

Impala is by the big data real-time query analysis tool of the leading exploitation of Cloudera companies, than being based on originally MapReduce HiveSQL inquiry velocities lift 3~90 times, and more flexibly easy-to-use.Class SQL query statement is provided, can Inquiry is stored in the PB level big datas in Hadoop HDFS and HBase.Inquiry velocity is its maximum advantage soon.Impala makees For big data real-time query analysis tool, fast with inquiry velocity, flexibility is high, easily integrates, the features such as scalability is strong.

Currently used APP durations on the day of with user's history behavioral data (area attribute, user handle set meal) and user Exemplified by, the technical scheme that the present invention is provided comprises the following steps：

Step 1：User's history data source is imported into distributed file system HDFS, the number of high-throughput is provided by HDFS According to access ability, wherein data source includes IMSI storehouses, IMEI storehouses, reptile storehouse；IMSI storehouses, IMEI storehouses, reptile storehouse by Sqoop from Relevant database imported into HDFS, and the sheet format of user's history data source is as shown in Figure 2.

Step 2：The real time data of user behavior is collected by metadata acquisition tool Flume, real time data is with the day of user Exemplified by currently used App durations, the Kafka data that record is collected from Flume in real time, and be that upper strata is real as message format component When Computational frame provide authentic data support.

Step 3：Illustrated according to the example of step 2, handle user App's by Spark real time data processings instrument Using duration, so that each App's used on the day of calculating active user uses duration and exports in real time, similarly, when meter One day enough is calculated, one week, in January, data can be exported, real time data structure is as shown in Figure 3.

Step 4：According to different service logics, when service logic in this example handles set meal and App use for user It is long, the output valve of all input users (IMSI represents a user) is obtained with Map/Reduce frameworks, user is formed Behavior table, sheet format is as shown in Figure 4.

Step 5：According to configuration file, user behavior data is saved in HBase, Impala and HBase is integrated and provides The inquiry entrance of user behavior data, compared to primary MapReduce and Hive execution speed, will be significantly increased use The statistical analysis speed of the wide table of family behavioral data.Storage organization in HBase as shown in figure 5, RowKey be IMSI+ business numbering, There is a row Data in row cluster：Label, deposits the specific business information of user.

The beneficial effects of the present invention are：

Finally illustrate, preferred embodiment above is merely illustrative of the technical solution of the present invention and unrestricted, although logical Cross above preferred embodiment the present invention is described in detail, it is to be understood by those skilled in the art that can be Various changes are made to it in form and in details, without departing from claims of the present invention limited range.

Claims

1. a kind of user behavior data processing method based on Hadoop, it is characterised in that methods described includes：

User's history data source is imported into distributed file system HDFS, connect with providing data access by the HDFS Mouthful；Wherein, the user's history data source includes international mobile subscriber identity IMSI storehouses, International Mobile Equipment Identity code IMEI At least one in storehouse and reptile storehouse；

The real-time behavioral data stream of user is collected by metadata acquisition tool Flume, the real-time behavioral data stream includes user Real-time internet log and user internet behavior real time parsing data；

The distributed ordering system Kafka data that record is collected from the Flume in real time, and be real-time as message format component Computational frame provides data；

According to the different service types of real-time behavioral data stream, user behavior production is handled in real time with real-time Computational frame Spark Raw real time data, to generate the real time data table of user；

With the real time data table and historical behavior tables of data of the IMSI number association user in the IMSI storehouses, the row of user is obtained For the wide table of data；

2. according to the method described in claim 1, it is characterised in that the history of user is generated based on the user's history data source Behavioral data table includes：

All historical behavior data of the user, and owning the user are associated by the IMSI number in the IMSI storehouses Historical behavior data are mapped in Tool for Data Warehouse Hive, to form the historical behavior tables of data of the user.

3. according to the method described in claim 1, it is characterised in that recorded in real time from described in distributed ordering system Kafka After the data that Flume is collected, methods described also includes：

Judge whether pending data have been buffered in Kafka configuration files；If so, the pending data are sent To the real-time Computational frame Spark；If it is not, by the pending data feedback to the distributed ordering system Kafka.

4. according to the method described in claim 1, it is characterised in that the IMSI storehouses, IMEI storehouses and reptile storehouse pass through Sqoop From relevant database imported into HDFS.

5. according to the method described in claim 1, it is characterised in that behavioral data stream includes user and existed by the fact that the user Access characteristics, search information and the flow of mobile terminal consume corresponding real time data.

6. according to the method described in claim 1, it is characterised in that obtaining the wide table of behavioral data of user includes：

Based on different service logics, the real time data table and history row of all input users is obtained with Map/Reduce frameworks For the output valve of tables of data, to form the wide table of the behavioral data；Wherein, an IMSI number characterizes a user.

7. according to the method described in claim 1, it is characterised in that the structure of table includes IMSI number in the HBase databases The combination numbered with business and the row for depositing the specific business information of user.