CN106570153A

CN106570153A - Data extraction method and system for mass URLs

Info

Publication number: CN106570153A
Application number: CN201610970427.3A
Authority: CN
Inventors: 欧阳涛
Original assignee: Shanghai Feixun Data Communication Technology Co Ltd
Current assignee: Shanghai Feixun Data Communication Technology Co Ltd
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2017-04-19

Abstract

The invention discloses a data extraction method for mass URLs. The method comprises the following steps of S10, respectively collecting each text data into a local file pool by using a distributed web sever framework; S20, uploading overall text data acquired by accumulating in the local file pool into a hadoop cloud distributed file system hdfsl; and S40, extracting keywords of the URLs in a distributed manner from the overall text data in the cloud distributed file system by using a hadoop data warehouse tool hive. According to the method provided by the invention, under a big data application scene, after each ext data is collected into the local file pool, the overall text data is uploaded into the cloud distributed file system, and the hive is used for performing distributed computation to perform distribution extraction; and the method has the advantages of high efficiency and low resource consumption.

Description

A kind of data extraction method and system of magnanimity URL

Technical field

The invention belongs to data abstraction techniques field, more particularly to the data extraction method and system of magnanimity URL.

Background technology

In today that the Internet is developed rapidly, the rule showed when using Internet resources to user, personalization Custom be analyzed (namely user behavior analysis) after；Extract and recognize the interest of user.On the one hand, can be to user Personalized customization and push, provide more active, intelligentized service for website caller.On the other hand, from user behavior Different manifestations, find its interest and preference, membership credentials between the page can be optimized, improve web station system framework, so as to subtract Light user finds the burden of information so as to operate simpler, saves time and effort.

At present, when being analyzed to user behavior, as large-scale website typically possesses huge online user, and produce Real-time behavior and contextual information amount it is huge.Therefore, the storage capacity and calculating speed of system are higher, will could divide in time Analysis result feeds back to user.

In prior art, most of user behavior analysis systems are directly using local relational database technology and tradition Data extraction method.However, with data magnanimity increase, data extraction method of the prior art can consume ample resources and Internal memory, and inefficiency, it is impossible to meet the efficient analysis of mass data very well.

The content of the invention

The technical scheme that the present invention is provided is as follows：

The present invention provides a kind of data extraction method of magnanimity URL, comprises the following steps：S10, using distribution Web take Each text data is collected local file pond by business device framework respectively；S20, by the local file pond add up obtain it is total Text data is uploaded to high in the clouds distributed file system hdfs1 of hadoop；S40, using the Tool for Data Warehouse of hadoop The keyword of hive distributed extraction URL in total text data from high in the clouds distributed file system hdfs1.

Further, it is further comprising the steps of：S30, the Tool for Data Warehouse hive send meter to the Computational frame TEZ that increases income Calculate request；S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data, And be stored in the data base of high in the clouds distributed file system hdfs1.

Further, step S40 is further included：S41, using the Tool for Data Warehouse hive UDF functions from The keyword of compressed text extracting data URL；The type and frequency of output user accesses data.

Further, step S20 is further included：S21, the text data in the local file pond is carried Take；S22, according to high in the clouds distributed file system hdfs1 block size, cumulative being merged into is carried out to the text data Total text data；S23, total text data is uploaded to into high in the clouds distribution using local distributed file system hdfs2 Formula file system hdfs1.

Further, step S21 is further included：Router in S211, the filename of the extraction text data MAC and timestamp；S212, identify whether the router mac and timestamp run into mess code；S213, when the router mac When mess code is run into timestamp, after cleaning to the mess code, jump to step S22；Otherwise, jump directly to step S22.

Further, also included before step S10：S01, the cluster environment for building Hadoop, and configure the number According to warehouse instrument hive, high in the clouds distributed file system hdfs1, local distributed file system hdfs2；S02, in the cluster Web server distributed type assemblies are built in environment at each node, and adds load balancing；S03, realize the Tool for Data Warehouse Hive, high in the clouds distributed file system hdfs1, the table of building of local distributed file system hdfs2 are associated；Reconstruct the data bins The UDF functions of storehouse instrument hive.

Further, further include in step S01：S011, the main section that the first predetermined number is built on Hadoop Point master, the second predetermined number from node slave；It is connected with each other between each host node master, each host node Master is connected from node slave with each respectively.

Further, step S01 is further comprised：Setting up on S012, each host node master has metadata to take Business component metastore, relational database mysql.

The present invention also provides a kind of system of the data extraction method of magnanimity URL, including：Web server framework, utilizes and divides Each text data is collected local file pond by cloth web server framework respectively；Local distributed file system hdfs2, will The total text data for obtaining that adds up in the local file pond is uploaded to high in the clouds distributed file system hdfs1 of hadoop；Number It is according to warehouse instrument hive, total from high in the clouds distributed file system hdfs1 using the Tool for Data Warehouse hive of hadoop The keyword of distributed extraction URL in text data.

Further, also include：Increase income Computational frame TEZ, and the Tool for Data Warehouse hive is sent out to the Computational frame TEZ that increases income Send computation requests；The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data, And be stored in the data base of high in the clouds distributed file system hdfs1.

Compared with prior art, the data extraction method and system of magnanimity URL that the present invention is provided, with following beneficial effect Really：

1) in the present invention under big data application scenarios, after each text data is converged in local file pond, will total text Notebook data is uploaded in the distributed file system of high in the clouds, recycles hive to carry out Distributed Calculation to carry out distributed extraction；Tool Effective percentage is high and consumes the low advantage of resource.

2) coded treatment is compressed to total text data in the present invention, the total text data after compression can reduce occupancy Space, efficiently solves the resource consumption problem and memory problem of local relational database；Total text data after coding can So that ordered pair data are extracted, it is ensured that extraction smooth can run.

3) in the present invention according to high in the clouds distributed file system block size, text data is carried out it is cumulative be merged into it is total After text data；Again total text data is uploaded；Total text book data can be prevented excessive, caused to distributed file system Blocking.

4) extracting in the filename of text data in the present invention carries out router mac and timestamp, when mess code is run into, Which is cleaned；Provide safeguard smoothly to extract data.

Description of the drawings

Below by the way of clearly understandable, preferred implementation is described with reference to the drawings, a kind of data of magnanimity URL are carried Above-mentioned characteristic, technical characteristic, advantage and its implementation for taking method and system is further described.

Fig. 1 is a kind of schematic flow sheet of the data extraction method of magnanimity URL of the invention；

Fig. 2 is the schematic flow sheet of the data extraction method of another kind magnanimity URL of the invention；

Fig. 3 is the schematic flow sheet of step S20 in the present invention；

Fig. 4 is the part schematic flow sheet of the data extraction method of magnanimity URL in the present invention；

Fig. 5 is the part schematic flow sheet of step S01 in the present invention；

Fig. 6 is a kind of structural representation of the data extraction system of magnanimity URL of the invention；

Fig. 7 is the schematic diagram of the data extraction method of another magnanimity URL of the invention；

Fig. 8 is the composition structural representation of the data extraction method of another magnanimity URL of the invention.

Specific embodiment

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below by control description of the drawings The specific embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, can be obtaining other according to these accompanying drawings Accompanying drawing, and obtain other embodiments.

To make simplified form, part related to the present invention in each figure, is only schematically show, they do not represent Its practical structures as product.In addition, so that simplified form is readily appreciated, with identical structure or function in some figures Part, only symbolically depicts one of those, or has only marked one of those.Herein, " one " is not only represented " only this ", it is also possible to represent the situation of " more than one ".

As shown in figure 1, according to one embodiment of present invention, a kind of data extraction method of magnanimity URL, including following step Suddenly：S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；Recycle and divide Each text data is collected local file pond by cloth web server framework respectively.

S20, the total text data obtained adding up in the local file pond are uploaded to the distributed text in high in the clouds of hadoop Part system hdfs1；Hadoop is distributed system base frame.

S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total text The keyword of distributed extraction URL in data.

Specifically, Flume is the High Availabitity that Cloudera is provided, and highly reliable, distributed massive logs are adopted Collection, polymerization and the system transmitted, Flume support Various types of data sender is customized in log system, for collecting data；Together When, Flume is provided and data is carried out with simple process, and writes the ability of various data receivings (customizable).

Hive is a Tool for Data Warehouse based on Hadoop, structurized data file can be mapped as a number According to storehouse table, and complete sql query functions are provided, sql sentences can be converted to MapReduce tasks and be run.hive Use the HDFS (distributed file system of hadoop) of hadoop, hive using computation model be mapreduce.

As shown in Fig. 2 according to another embodiment of the invention, a kind of data extraction method of magnanimity URL, including it is following Step：S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；Recycle Each text data is collected local file pond by distribution Web server framework respectively.

S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income；

S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data According to, and be stored in the data base of high in the clouds distributed file system hdfs1；Wherein, boil down to ORC compressions；File is stored Form is ORC.

S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URL Key word；The type and frequency of output user accesses data.For example：8CAB8EC6CC40 186222539** { " mobile phone "：3, it is " strong Health "：27}；8CAB8EB144A8138350092** { " automobile "：11, " picture "：127, " music "：26, " recruitment "：1, " handss Machine "：13, " health "：2907, " life "：8, " video "：4, " shopping "：7, " social activity "：84, " live "：1}； 8CAB8EC00880136605272** { " music "：1, " mobile phone "：4, " health "：54, " video "：2, " shopping "：1, " social activity "： 4}。

Specifically, Tez is the Computational frame of increasing income of the newest support DAG operations of Apache.Tez not region be directly facing finally User --- in fact it allow developer be end user build performance faster, the more preferable application program of autgmentability.Tez projects Target be support height customize, so it just disclosure satisfy that the needs of various use-cases, allow people need not be by others outside Portion's mode can just complete the work of oneself, if project as Hive and Pig uses Tez rather than MapReduce is used as which The backbone of data processing, then their response time will be obviously improved.Tez is built on YARN, and the latter is Hadoop The new resources Governance framework for being used.

Hive file memory formats：1.textfile is default form, storage mode：Row storage, disk expense are big, data Parsing expense is big；The text files hive of compression cannot be merged and be split.2.sequencefile, binary file, with< key,value>Form sequence in file；Storage mode：Row storage, divisible, compression；Block compressions are typically chosen, it is excellent It is compatible that gesture is the mapfile in file and Hadoop api.3.rcfile, storage mode：Data press row piecemeal, often Block is according to row storage；Compression is fast, fast column access；The block that read record is related to as far as possible is minimum；The row that reading needs are only needed to Read the head definition of each row group.The operating characteristics for reading full dose data may be than sequencefile without obvious Advantage.4.orc, storage mode：Data press row piecemeal, per block according to row storage；Compression is fast, fast column access；Efficiency ratio Rcfile is high, is the modified version of rcfile.5. user-defined format.User can by realize inputformat and Outputformat carrys out self-defined input/output format.

Wherein, textfile memory spaces are consumed than larger, and the text for compressing cannot be split and merge；The effect of inquiry Rate is minimum, directly can store, the speed highest of loading data.Sequencefile memory spaces consume maximum, the text of compression Part can be split and merge, and search efficiency is high, needs by text file translations to load.Rcfile memory spaces are minimum, look into The efficiency highest of inquiry, needs by text file translations to load, and the speed of loading is minimum.

As shown in Figure 2 and Figure 3, according to still a further embodiment, a kind of data extraction method of magnanimity URL, including Following steps：S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；Again Using distribution Web server framework, each text data is collected into local file pond respectively.

S21, the text data in the local file pond is extracted；

Step S21 is further included：S211, extract router mac in the filename of the text data and when Between stab；

S212, identify whether the router mac and timestamp run into mess code；

S213, when the router mac and timestamp run into mess code, the mess code is cleaned using Python Afterwards, jump to step S22；Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer program Design language.

S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added up It is merged into total text data；

S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2 Formula file system hdfs1；Hadoop is distributed system base frame.

Specifically, hadoop and mapreduce are the foundation of hive frameworks.Hive frameworks include following component：CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver (Complier, Optimizer and Executor), these components can be divided into two big class：Service end component and client component.

Service end component：1st, Driver components：The component includes Complier, Optimizer and Executor, its work With being that HiveQL (class SQL) sentence for writing us carries out parsing, compiling optimization, implement plan is generated, bottom is then called Mapreduce Computational frames.

2nd, Metastore components：Metadata Service component, this component store the metadata of hive, the metadata of hive It is stored in relational database, the relational database that hive is supported has derby, mysql.Metadata is particularly significant for hive, Therefore hive supports metastore to service independent, is installed in long-range server cluster, so as to decouple hive services Service with metastore, it is ensured that the vigorousness of hive operations.

3rd, Thrift services：Thrift be facebook exploitation a software frame, it be used for carry out it is expansible and across The exploitation of the service of language, hive are integrated with the service, and different programming languages can be allowed to call the interface of hive.

Client component：1、CLI：Command line interface, command line interface.

2nd, Thrift clients：Thrift client do not write in Organization Chart above, but many of hive frameworks Client-side interface is built upon on thrift clients, including JDBC and ODBC interfaces.

3、WEBGUI：Hive clients access the service provided by hive by way of webpage there is provided a kind of.This The hwi components (hive web interface) of interface correspondence hive, will start hwi services using before.

As shown in Figure 2, Figure 3, Figure 4, according to still another embodiment of the invention, a kind of data extraction method of magnanimity URL, Comprise the following steps：S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the clouds Part system hdfs1, local distributed file system hdfs2；Namenode HA and ResourceManager HA are set Put.

S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing；It is negative Balanced foundation is carried on existing network infrastructure, it provides a kind of cheap effectively transparent method extended network equipment and service The bandwidth of device, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.

S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed field system System hdfs2's builds table association；Reconstruct the UDF functions of the Tool for Data Warehouse hive.

S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted；Again Using distribution Web server framework, each text data is collected into local file pond respectively.

S21, the text data in the local file pond is extracted；

S212, identify whether the router mac and timestamp run into mess code；

Specifically, load balancing english abbreviation SLB, its main algorithm are as follows：WRR (WRR) algorithm：It is per platform One weight of distribution, weight represent that relative to other servers itself can process the ability of connection.For n, weight represents that SLB is under Before one server-assignment flow, to newly connect for this server-assignment n bar.

Weighting Smallest connection (WLC) algorithm：New connection can be distributed to and be flexibly connected the minimum real server of number by SLB. Be every real server distribution weight m, the ability that server process is flexibly connected equal to m divided by Servers-all weight it With.SLB can be distributed to the real server for being flexibly connected number far fewer than its limit of power by new connection.

During using weighting Smallest connection (WLC) algorithm, SLB is controlled using a kind of mode of slow turn-on to new plus true clothes The access of business device." slow turn-on " limits new establishment of connection frequency and allows gradually to increase, and carrys out prevention service device with this Overload.

The configuration of Namenode HA, it is specific as follows：

1.1 hadoop-2.3.0-cdh5.0.0.tar.gz is unziped to/opt/boh/ under, and RNTO hadoop, Modification etc/hadoop/core-site.xml.

1.2 modification hdfs-site.xml.

1.3 editors/etc/hadoop/slaves；Addition hadoop3, hadoop4.

1.4 editors/etc/profile；Addition HADOOP_HOME=/opt/boh/hadoop；PATH=$ HADOOP_ HOME/bin:$HADOOP_HOME/sbin:$PATH；The configuration by more than copies to all nodes.

1.5 start respective services；

1.5.1 start journalnode；Sbin/hadoop- is performed on hadoop0, hadoop1, hadoop2 daemon.sh start journalnode；

1.5.2 format zookeeper；Bin/hdfs zkfc-formatZK are performed on hadoop1；

1.5.3 hadoop1 nodes are formatted and are started；bin/hdfs namenode-format；sbin/ hadoop-daemon.sh start namenode；

1.5.4 hadoop2 nodes are formatted and are started；bin/hdfs namenode- bootstrapStandby；sbin/hadoop-daemon.sh start namenode；

1.5.5 start zkfc services on hadoop1 and hadoop2；sbin/hadoop-daemon.sh start zkfc；Now hadoop1 and hadoop2 just have a node and are changed into active states；

1.5.6 start datanode；Order sbin/hadoop-daemons.sh start are performed on hadoop1 datanode；

1.5.7 verify whether successfully；Browser is opened, hadoop1 is accessed:50070 and hadoop2:50070, two Namenode mono- is active and another is standby.Then kill falls the namenode processes of wherein active, another The naemnode of individual standby will be automatically converted to active states.

The configuration of ResourceManager HA, it is specific as follows：

2.1 modification mapred-site.xml.

2.2 modification yarn-site.xml.

Configuration file is distributed to each node by 2.3.

Yarn-site.xml on 2.4 modification hadoop2.

2.5 create directory and give authority：

2.5.1 create local directory；

2.5.2, after starting hdfs, perform following order；Create log catalogues；Under establishment hdfs /tmp；If do not create/ Tmp is according to specified power, then the other assemblies of CDH will be problematic.Especially, if not creating, other processes This catalogue can be automatically created with strict authority, thus influence whether that other programs are suitable for.hadoop fs-mkdir/ tmp；hadoop fs-chmod-R777/tmp.

2.6 start yarn and jobhistory server；

2.6.1 start on hadoop1：sbin/start-yarn.sh；This script will start on hadoop1 Resourcemanager and all of nodemanager.

2.6.2 start resourcemanager on hadoop2：yarn-daemon.sh start resourcemanager；

2.6.3 start jobhistory server on hadoop2；sbin/mr-jobhistory- daemon.shstart historyserver。

2.7 verify whether configuration successful.Browser is opened, hadoop1 is accessed:23188 or hadoop2:23188.

As shown in Fig. 2～Fig. 5, of the invention and another embodiment, a kind of data extraction method of magnanimity URL, Comprise the following steps：S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the clouds Part system hdfs1, local distributed file system hdfs2；Namenode HA and ResourceManager HA are set Put.

Step S01 is further included：S011, the first predetermined number (such as first default is built on Hadoop Number is host node master 4), the second predetermined number (such as the second predetermined number be 7) from node slave；Each main section It is connected with each other between point master, each host node master is connected from node slave with each respectively.

On S012, each host node master set up have Metadata Service component metastore, relational database mysql, HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not started Make, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requests As a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is open API clients such as JDBC, ODBC are provided and are preferably supported.

S21, the text data in the local file pond is extracted；

S212, identify whether the router mac and timestamp run into mess code；

Specifically, concepts of the Master/Slave equivalent to Server and agent.Master provides web interface and allows user May operate in master the machine or be assigned on slave and run to manage job and slave, job.One master can Serviced with associating multiple slave for as the different configurations of different job or identical job.

The metastore components of Hive are storing places in hive metadata sets.Metastore components include two parts： Metastore services the storage with back-end data.The medium of back-end data storage is exactly relational database, such as hive acquiescences Embedded disk database derby, also mysql data bases.Metastore services be built upon back-end data storage medium it On, and the serviced component that can be interacted with hive services, under default situations, metastore services and hive services are It is installed together, operates in the middle of same process.Metastore can also be serviced from hive services and be stripped out, Metastore is independently mounted in a cluster, hive far calls metastore services, metadata can be put this layer To after fire wall, client accesses hive services, it is possible to be connected to metadata this layer, so as to provide preferably management Property and safety guarantee.Serviced using long-range metastore, metastore services and hive service operations can be allowed in difference Process in, so also ensure that the stability of hive, improve hive service efficiency.

As shown in fig. 6, according to one embodiment of present invention, a kind of data extraction system of magnanimity URL, including： Hadoop, builds the cluster environment of Hadoop, and configures the Tool for Data Warehouse hive, high in the clouds distributed file system Hdfs1, local distributed file system hdfs2；Namenode HA and ResourceManager HA are configured.

Preferably, build on Hadoop the first predetermined number (such as the first predetermined number is 4) host node master, Second predetermined number (such as the second predetermined number be 7) from node slave；It is connected with each other between each host node master, Each host node master is connected from node slave with each respectively.

On each host node master set up have Metadata Service component metastore, relational database mysql, HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not started Make, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requests As a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is open API clients such as JDBC, ODBC are provided and are preferably supported.

Web server distributed type assemblies are built at each node in the cluster environment, and adds load balancing；Load is equal Weighing apparatus is set up on existing network infrastructure, and it provides a kind of cheap effectively transparent method extended network equipment and server Bandwidth, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.

Realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file system Hdfs2's builds table association；Reconstruct the UDF functions of the Tool for Data Warehouse hive.

Distribution Web server framework, is gathered after text subdata using Flume, aggregates into text data, and to text Data are transmitted；Distribution Web server framework is recycled, each text data is collected into local file pond respectively.

The distribution Web server framework, extracts to the text data in the local file pond；Extract described Router mac and timestamp in the filename of text data；Identify whether the router mac and timestamp run into unrest Code；When the router mac and timestamp run into mess code, the mess code is cleaned using Python.Python is one Plant object-oriented, literal translation formula computer programming language.

The distribution Web server framework, it is according to the size of the block of high in the clouds distributed file system hdfs1, right The text data carries out adding up and is merged into total text data.

Local distributed file system hdfs2, using local distributed file system hdfs2 by the local file pond The cumulative total text data for obtaining is uploaded to high in the clouds distributed file system hdfs1 of hadoop；Hadoop is distributed system Base frame.

Tool for Data Warehouse hive, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income；Open Source Computational frame TEZ, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data According to, and be stored in the data base of high in the clouds distributed file system hdfs1；Wherein, boil down to ORC compressions；File is stored Form is ORC.

Tool for Data Warehouse hive, using the UDF functions of the Tool for Data Warehouse hive from the compressed text data The keyword of middle extraction URL；The type and frequency of output user accesses data.For example：8CAB8EC6CC40 186222539** { " mobile phone "：3, " health "：27}；8CAB8EB144A8138350092** { " automobile "：11, " picture "：127, " music "：26, " Recruitment "：1, " mobile phone "：13, " health "：2907, " life "：8, " video "：4, " shopping "：7, " social activity "：84, " live "：1}； 8CAB8EC00880136605272** { " music "：1, " mobile phone "：4, " health "：54, " video "：2, " shopping "：1, " social activity "： 4}。

As shown in Figure 7, Figure 8, according to still a further embodiment, a kind of data extraction method of magnanimity URL, bag Include：The cluster environment (deployment 4 master, 7 slave) of Hadoop2.7.1 is built, and has configured the environment such as HIVE, HDFS With configuration (Hive Metastore, mysql, hiveserver2 etc. are set up on a master).And set Namenode HA and ResourceManager HA, make distributed system meet high availability！Each node builds tomcat distributed type assemblies, adds Loading is balanced.That realizes hive, hdfs builds table association, develops the UDF functions of corresponding hive, and Test extraction function Normally.

By distributed web server framework, text data is collected into local file pond；By in File Pool File carries out cleaning, extracts, merges, and carries out cumulative merging according to the size of the block of HDFS；Utilization local HDFS are merged complete Into the efficient upload of data；The UDF functions getNUM of the Hive for being developed by oneself again is mainly closed come the URL in complete paired data Key word is extracted, and completes the high efficiency extraction of URL mass datas.

Operation cleaning, merging, upload, high compression coding, the program of distributed extraction；By result output and further deeply Analysis；By the Distributed Calculation of the UDF functions and Hadoop clusters of hive, the extraction for completing mass data is calculated, is used The type of the access data at family the frequency, quickly obtain the online feature of user, the Products Show and service for user provide according to According to.

hive：Technology that apache increases income, data warehouse software provide to be stored in it is distributed in large data collection Inquiry and management, itself is built upon on Apache Hadoop.What Hive SQL were represented is based on tradition Sql like language of the Mapreduce for core.

The present invention mainly the cleaning by the UDF self adaptations development function and Python of hive, merge, upload and The ORC compressions of hive are combined, and define a high performance data extraction method.

It should be noted that above-described embodiment can independent assortment as needed.The above is only the preferred of the present invention Embodiment, it is noted that for those skilled in the art, in the premise without departing from the principle of the invention Under, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of data extraction method of magnanimity URL, it is characterised in that comprise the following steps：

S10, using distribution Web server framework, each text data is collected into local file pond respectively；

S20, the total text data obtained adding up in the local file pond are uploaded to the high in the clouds distributed field system of hadoop System hdfs1；

S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total text data In distributed extraction URL keyword.

2. the data extraction method of magnanimity URL as claimed in claim 1, it is characterised in that further comprising the steps of：

S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data, and It is stored in the data base of high in the clouds distributed file system hdfs1.

3. the data extraction method of magnanimity URL as claimed in claim 2, it is characterised in that step S40 is further wrapped Include：

S41, using the UDF functions of the Tool for Data Warehouse hive from the keyword of compressed text extracting data URL； The type and frequency of output user accesses data.

4. the data extraction method of magnanimity URL as claimed in claim 1, it is characterised in that step S20 is further wrapped Include：

S21, the text data in the local file pond is extracted；

S22, according to high in the clouds distributed file system hdfs1 block size, cumulative merging is carried out to the text data Into total text data；

S23, total text data is uploaded to into the high in the clouds distributed field system using local distributed file system hdfs2 System hdfs1.

5. the data extraction method of magnanimity URL as claimed in claim 4, it is characterised in that step S21 is further wrapped Include：

Router mac and timestamp in S211, the filename of the extraction text data；

S212, identify whether the router mac and timestamp run into mess code；

S213, when the router mac and timestamp run into mess code, after cleaning to the mess code, jump to step S22；Otherwise, jump directly to step S22.

6. the data extraction method of magnanimity URL as described in any one in Claims 1 to 5, it is characterised in that in the step Also include before rapid S10：

S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, high in the clouds distributed file system Hdfs1, local distributed file system hdfs2；

S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing；

S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file system Hdfs2's builds table association；Reconstruct the UDF functions of the Tool for Data Warehouse hive.

7. the data extraction method of magnanimity URL as claimed in claim 6, it is characterised in that further wrap in step S01 Include：

S011, the host node master that the first predetermined number is built on Hadoop, the second predetermined number from node slave； It is connected with each other between each host node master, each host node master is connected from node slave with each respectively.

8. the data extraction method of magnanimity URL as claimed in claim 7, it is characterised in that step S01 is further also wrapped Include：

Setting up on S012, each host node master has Metadata Service component metastore, relational database mysql.

9. a kind of system for applying the data extraction method in magnanimity URL as described in any one in claim 1～8, its It is characterised by, including：

Each text data, using distribution Web server framework, is collected local file pond by web server framework respectively；

Local distributed file system hdfs2, the total text data obtained adding up in the local file pond are uploaded to High in the clouds distributed file system hdfs1 of hadoop；

Tool for Data Warehouse hive, using the Tool for Data Warehouse hive of hadoop from the high in the clouds distributed file system In hdfs1 in total text data distributed extraction URL keyword.

10. the data extraction system of magnanimity URL as claimed in claim 9, it is characterised in that also include：

Increase income Computational frame TEZ, and the Tool for Data Warehouse hive sends computation requests to the Computational frame TEZ that increases income；

The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data, and stores In the data base of high in the clouds distributed file system hdfs1.