CN106570153A - Data extraction method and system for mass URLs - Google Patents
Data extraction method and system for mass URLs Download PDFInfo
- Publication number
- CN106570153A CN106570153A CN201610970427.3A CN201610970427A CN106570153A CN 106570153 A CN106570153 A CN 106570153A CN 201610970427 A CN201610970427 A CN 201610970427A CN 106570153 A CN106570153 A CN 106570153A
- Authority
- CN
- China
- Prior art keywords
- data
- text data
- file system
- hive
- hdfs1
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1744—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data extraction method for mass URLs. The method comprises the following steps of S10, respectively collecting each text data into a local file pool by using a distributed web sever framework; S20, uploading overall text data acquired by accumulating in the local file pool into a hadoop cloud distributed file system hdfsl; and S40, extracting keywords of the URLs in a distributed manner from the overall text data in the cloud distributed file system by using a hadoop data warehouse tool hive. According to the method provided by the invention, under a big data application scene, after each ext data is collected into the local file pool, the overall text data is uploaded into the cloud distributed file system, and the hive is used for performing distributed computation to perform distribution extraction; and the method has the advantages of high efficiency and low resource consumption.
Description
Technical field
The invention belongs to data abstraction techniques field, more particularly to the data extraction method and system of magnanimity URL.
Background technology
In today that the Internet is developed rapidly, the rule showed when using Internet resources to user, personalization
Custom be analyzed (namely user behavior analysis) after;Extract and recognize the interest of user.On the one hand, can be to user
Personalized customization and push, provide more active, intelligentized service for website caller.On the other hand, from user behavior
Different manifestations, find its interest and preference, membership credentials between the page can be optimized, improve web station system framework, so as to subtract
Light user finds the burden of information so as to operate simpler, saves time and effort.
At present, when being analyzed to user behavior, as large-scale website typically possesses huge online user, and produce
Real-time behavior and contextual information amount it is huge.Therefore, the storage capacity and calculating speed of system are higher, will could divide in time
Analysis result feeds back to user.
In prior art, most of user behavior analysis systems are directly using local relational database technology and tradition
Data extraction method.However, with data magnanimity increase, data extraction method of the prior art can consume ample resources and
Internal memory, and inefficiency, it is impossible to meet the efficient analysis of mass data very well.
The content of the invention
The technical scheme that the present invention is provided is as follows:
The present invention provides a kind of data extraction method of magnanimity URL, comprises the following steps:S10, using distribution Web take
Each text data is collected local file pond by business device framework respectively;S20, by the local file pond add up obtain it is total
Text data is uploaded to high in the clouds distributed file system hdfs1 of hadoop;S40, using the Tool for Data Warehouse of hadoop
The keyword of hive distributed extraction URL in total text data from high in the clouds distributed file system hdfs1.
Further, it is further comprising the steps of:S30, the Tool for Data Warehouse hive send meter to the Computational frame TEZ that increases income
Calculate request;S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data,
And be stored in the data base of high in the clouds distributed file system hdfs1.
Further, step S40 is further included:S41, using the Tool for Data Warehouse hive UDF functions from
The keyword of compressed text extracting data URL;The type and frequency of output user accesses data.
Further, step S20 is further included:S21, the text data in the local file pond is carried
Take;S22, according to high in the clouds distributed file system hdfs1 block size, cumulative being merged into is carried out to the text data
Total text data;S23, total text data is uploaded to into high in the clouds distribution using local distributed file system hdfs2
Formula file system hdfs1.
Further, step S21 is further included:Router in S211, the filename of the extraction text data
MAC and timestamp;S212, identify whether the router mac and timestamp run into mess code;S213, when the router mac
When mess code is run into timestamp, after cleaning to the mess code, jump to step S22;Otherwise, jump directly to step S22.
Further, also included before step S10:S01, the cluster environment for building Hadoop, and configure the number
According to warehouse instrument hive, high in the clouds distributed file system hdfs1, local distributed file system hdfs2;S02, in the cluster
Web server distributed type assemblies are built in environment at each node, and adds load balancing;S03, realize the Tool for Data Warehouse
Hive, high in the clouds distributed file system hdfs1, the table of building of local distributed file system hdfs2 are associated;Reconstruct the data bins
The UDF functions of storehouse instrument hive.
Further, further include in step S01:S011, the main section that the first predetermined number is built on Hadoop
Point master, the second predetermined number from node slave;It is connected with each other between each host node master, each host node
Master is connected from node slave with each respectively.
Further, step S01 is further comprised:Setting up on S012, each host node master has metadata to take
Business component metastore, relational database mysql.
The present invention also provides a kind of system of the data extraction method of magnanimity URL, including:Web server framework, utilizes and divides
Each text data is collected local file pond by cloth web server framework respectively;Local distributed file system hdfs2, will
The total text data for obtaining that adds up in the local file pond is uploaded to high in the clouds distributed file system hdfs1 of hadoop;Number
It is according to warehouse instrument hive, total from high in the clouds distributed file system hdfs1 using the Tool for Data Warehouse hive of hadoop
The keyword of distributed extraction URL in text data.
Further, also include:Increase income Computational frame TEZ, and the Tool for Data Warehouse hive is sent out to the Computational frame TEZ that increases income
Send computation requests;The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data,
And be stored in the data base of high in the clouds distributed file system hdfs1.
Compared with prior art, the data extraction method and system of magnanimity URL that the present invention is provided, with following beneficial effect
Really:
1) in the present invention under big data application scenarios, after each text data is converged in local file pond, will total text
Notebook data is uploaded in the distributed file system of high in the clouds, recycles hive to carry out Distributed Calculation to carry out distributed extraction;Tool
Effective percentage is high and consumes the low advantage of resource.
2) coded treatment is compressed to total text data in the present invention, the total text data after compression can reduce occupancy
Space, efficiently solves the resource consumption problem and memory problem of local relational database;Total text data after coding can
So that ordered pair data are extracted, it is ensured that extraction smooth can run.
3) in the present invention according to high in the clouds distributed file system block size, text data is carried out it is cumulative be merged into it is total
After text data;Again total text data is uploaded;Total text book data can be prevented excessive, caused to distributed file system
Blocking.
4) extracting in the filename of text data in the present invention carries out router mac and timestamp, when mess code is run into,
Which is cleaned;Provide safeguard smoothly to extract data.
Description of the drawings
Below by the way of clearly understandable, preferred implementation is described with reference to the drawings, a kind of data of magnanimity URL are carried
Above-mentioned characteristic, technical characteristic, advantage and its implementation for taking method and system is further described.
Fig. 1 is a kind of schematic flow sheet of the data extraction method of magnanimity URL of the invention;
Fig. 2 is the schematic flow sheet of the data extraction method of another kind magnanimity URL of the invention;
Fig. 3 is the schematic flow sheet of step S20 in the present invention;
Fig. 4 is the part schematic flow sheet of the data extraction method of magnanimity URL in the present invention;
Fig. 5 is the part schematic flow sheet of step S01 in the present invention;
Fig. 6 is a kind of structural representation of the data extraction system of magnanimity URL of the invention;
Fig. 7 is the schematic diagram of the data extraction method of another magnanimity URL of the invention;
Fig. 8 is the composition structural representation of the data extraction method of another magnanimity URL of the invention.
Specific embodiment
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below by control description of the drawings
The specific embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, for
For those of ordinary skill in the art, on the premise of not paying creative work, can be obtaining other according to these accompanying drawings
Accompanying drawing, and obtain other embodiments.
To make simplified form, part related to the present invention in each figure, is only schematically show, they do not represent
Its practical structures as product.In addition, so that simplified form is readily appreciated, with identical structure or function in some figures
Part, only symbolically depicts one of those, or has only marked one of those.Herein, " one " is not only represented
" only this ", it is also possible to represent the situation of " more than one ".
As shown in figure 1, according to one embodiment of present invention, a kind of data extraction method of magnanimity URL, including following step
Suddenly:S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;Recycle and divide
Each text data is collected local file pond by cloth web server framework respectively.
S20, the total text data obtained adding up in the local file pond are uploaded to the distributed text in high in the clouds of hadoop
Part system hdfs1;Hadoop is distributed system base frame.
S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total text
The keyword of distributed extraction URL in data.
Specifically, Flume is the High Availabitity that Cloudera is provided, and highly reliable, distributed massive logs are adopted
Collection, polymerization and the system transmitted, Flume support Various types of data sender is customized in log system, for collecting data;Together
When, Flume is provided and data is carried out with simple process, and writes the ability of various data receivings (customizable).
Hive is a Tool for Data Warehouse based on Hadoop, structurized data file can be mapped as a number
According to storehouse table, and complete sql query functions are provided, sql sentences can be converted to MapReduce tasks and be run.hive
Use the HDFS (distributed file system of hadoop) of hadoop, hive using computation model be mapreduce.
As shown in Fig. 2 according to another embodiment of the invention, a kind of data extraction method of magnanimity URL, including it is following
Step:S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;Recycle
Each text data is collected local file pond by distribution Web server framework respectively.
S20, the total text data obtained adding up in the local file pond are uploaded to the distributed text in high in the clouds of hadoop
Part system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data
According to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is stored
Form is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URL
Key word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strong
Health ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handss
Machine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};
8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":
4}。
Specifically, Tez is the Computational frame of increasing income of the newest support DAG operations of Apache.Tez not region be directly facing finally
User --- in fact it allow developer be end user build performance faster, the more preferable application program of autgmentability.Tez projects
Target be support height customize, so it just disclosure satisfy that the needs of various use-cases, allow people need not be by others outside
Portion's mode can just complete the work of oneself, if project as Hive and Pig uses Tez rather than MapReduce is used as which
The backbone of data processing, then their response time will be obviously improved.Tez is built on YARN, and the latter is Hadoop
The new resources Governance framework for being used.
Hive file memory formats:1.textfile is default form, storage mode:Row storage, disk expense are big, data
Parsing expense is big;The text files hive of compression cannot be merged and be split.2.sequencefile, binary file, with<
key,value>Form sequence in file;Storage mode:Row storage, divisible, compression;Block compressions are typically chosen, it is excellent
It is compatible that gesture is the mapfile in file and Hadoop api.3.rcfile, storage mode:Data press row piecemeal, often
Block is according to row storage;Compression is fast, fast column access;The block that read record is related to as far as possible is minimum;The row that reading needs are only needed to
Read the head definition of each row group.The operating characteristics for reading full dose data may be than sequencefile without obvious
Advantage.4.orc, storage mode:Data press row piecemeal, per block according to row storage;Compression is fast, fast column access;Efficiency ratio
Rcfile is high, is the modified version of rcfile.5. user-defined format.User can by realize inputformat and
Outputformat carrys out self-defined input/output format.
Wherein, textfile memory spaces are consumed than larger, and the text for compressing cannot be split and merge;The effect of inquiry
Rate is minimum, directly can store, the speed highest of loading data.Sequencefile memory spaces consume maximum, the text of compression
Part can be split and merge, and search efficiency is high, needs by text file translations to load.Rcfile memory spaces are minimum, look into
The efficiency highest of inquiry, needs by text file translations to load, and the speed of loading is minimum.
As shown in Figure 2 and Figure 3, according to still a further embodiment, a kind of data extraction method of magnanimity URL, including
Following steps:S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;Again
Using distribution Web server framework, each text data is collected into local file pond respectively.
S21, the text data in the local file pond is extracted;
Step S21 is further included:S211, extract router mac in the filename of the text data and when
Between stab;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, the mess code is cleaned using Python
Afterwards, jump to step S22;Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer program
Design language.
S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added up
It is merged into total text data;
S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2
Formula file system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data
According to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is stored
Form is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URL
Key word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strong
Health ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handss
Machine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};
8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":
4}。
Specifically, hadoop and mapreduce are the foundation of hive frameworks.Hive frameworks include following component:CLI
(command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver
(Complier, Optimizer and Executor), these components can be divided into two big class:Service end component and client component.
Service end component:1st, Driver components:The component includes Complier, Optimizer and Executor, its work
With being that HiveQL (class SQL) sentence for writing us carries out parsing, compiling optimization, implement plan is generated, bottom is then called
Mapreduce Computational frames.
2nd, Metastore components:Metadata Service component, this component store the metadata of hive, the metadata of hive
It is stored in relational database, the relational database that hive is supported has derby, mysql.Metadata is particularly significant for hive,
Therefore hive supports metastore to service independent, is installed in long-range server cluster, so as to decouple hive services
Service with metastore, it is ensured that the vigorousness of hive operations.
3rd, Thrift services:Thrift be facebook exploitation a software frame, it be used for carry out it is expansible and across
The exploitation of the service of language, hive are integrated with the service, and different programming languages can be allowed to call the interface of hive.
Client component:1、CLI:Command line interface, command line interface.
2nd, Thrift clients:Thrift client do not write in Organization Chart above, but many of hive frameworks
Client-side interface is built upon on thrift clients, including JDBC and ODBC interfaces.
3、WEBGUI:Hive clients access the service provided by hive by way of webpage there is provided a kind of.This
The hwi components (hive web interface) of interface correspondence hive, will start hwi services using before.
As shown in Figure 2, Figure 3, Figure 4, according to still another embodiment of the invention, a kind of data extraction method of magnanimity URL,
Comprise the following steps:S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the clouds
Part system hdfs1, local distributed file system hdfs2;Namenode HA and ResourceManager HA are set
Put.
S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing;It is negative
Balanced foundation is carried on existing network infrastructure, it provides a kind of cheap effectively transparent method extended network equipment and service
The bandwidth of device, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.
S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed field system
System hdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;Again
Using distribution Web server framework, each text data is collected into local file pond respectively.
S21, the text data in the local file pond is extracted;
Step S21 is further included:S211, extract router mac in the filename of the text data and when
Between stab;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, the mess code is cleaned using Python
Afterwards, jump to step S22;Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer program
Design language.
S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added up
It is merged into total text data;
S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2
Formula file system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data
According to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is stored
Form is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URL
Key word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strong
Health ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handss
Machine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};
8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":
4}。
Specifically, load balancing english abbreviation SLB, its main algorithm are as follows:WRR (WRR) algorithm:It is per platform
One weight of distribution, weight represent that relative to other servers itself can process the ability of connection.For n, weight represents that SLB is under
Before one server-assignment flow, to newly connect for this server-assignment n bar.
Weighting Smallest connection (WLC) algorithm:New connection can be distributed to and be flexibly connected the minimum real server of number by SLB.
Be every real server distribution weight m, the ability that server process is flexibly connected equal to m divided by Servers-all weight it
With.SLB can be distributed to the real server for being flexibly connected number far fewer than its limit of power by new connection.
During using weighting Smallest connection (WLC) algorithm, SLB is controlled using a kind of mode of slow turn-on to new plus true clothes
The access of business device." slow turn-on " limits new establishment of connection frequency and allows gradually to increase, and carrys out prevention service device with this
Overload.
The configuration of Namenode HA, it is specific as follows:
1.1 hadoop-2.3.0-cdh5.0.0.tar.gz is unziped to/opt/boh/ under, and RNTO hadoop,
Modification etc/hadoop/core-site.xml.
1.2 modification hdfs-site.xml.
1.3 editors/etc/hadoop/slaves;Addition hadoop3, hadoop4.
1.4 editors/etc/profile;Addition HADOOP_HOME=/opt/boh/hadoop;PATH=$ HADOOP_
HOME/bin:$HADOOP_HOME/sbin:$PATH;The configuration by more than copies to all nodes.
1.5 start respective services;
1.5.1 start journalnode;Sbin/hadoop- is performed on hadoop0, hadoop1, hadoop2
daemon.sh start journalnode;
1.5.2 format zookeeper;Bin/hdfs zkfc-formatZK are performed on hadoop1;
1.5.3 hadoop1 nodes are formatted and are started;bin/hdfs namenode-format;sbin/
hadoop-daemon.sh start namenode;
1.5.4 hadoop2 nodes are formatted and are started;bin/hdfs namenode-
bootstrapStandby;sbin/hadoop-daemon.sh start namenode;
1.5.5 start zkfc services on hadoop1 and hadoop2;sbin/hadoop-daemon.sh start
zkfc;Now hadoop1 and hadoop2 just have a node and are changed into active states;
1.5.6 start datanode;Order sbin/hadoop-daemons.sh start are performed on hadoop1
datanode;
1.5.7 verify whether successfully;Browser is opened, hadoop1 is accessed:50070 and hadoop2:50070, two
Namenode mono- is active and another is standby.Then kill falls the namenode processes of wherein active, another
The naemnode of individual standby will be automatically converted to active states.
The configuration of ResourceManager HA, it is specific as follows:
2.1 modification mapred-site.xml.
2.2 modification yarn-site.xml.
Configuration file is distributed to each node by 2.3.
Yarn-site.xml on 2.4 modification hadoop2.
2.5 create directory and give authority:
2.5.1 create local directory;
2.5.2, after starting hdfs, perform following order;Create log catalogues;Under establishment hdfs /tmp;If do not create/
Tmp is according to specified power, then the other assemblies of CDH will be problematic.Especially, if not creating, other processes
This catalogue can be automatically created with strict authority, thus influence whether that other programs are suitable for.hadoop fs-mkdir/
tmp;hadoop fs-chmod-R777/tmp.
2.6 start yarn and jobhistory server;
2.6.1 start on hadoop1:sbin/start-yarn.sh;This script will start on hadoop1
Resourcemanager and all of nodemanager.
2.6.2 start resourcemanager on hadoop2:yarn-daemon.sh start
resourcemanager;
2.6.3 start jobhistory server on hadoop2;sbin/mr-jobhistory-
daemon.shstart historyserver。
2.7 verify whether configuration successful.Browser is opened, hadoop1 is accessed:23188 or hadoop2:23188.
As shown in Fig. 2~Fig. 5, of the invention and another embodiment, a kind of data extraction method of magnanimity URL,
Comprise the following steps:S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, the distributed text in high in the clouds
Part system hdfs1, local distributed file system hdfs2;Namenode HA and ResourceManager HA are set
Put.
Step S01 is further included:S011, the first predetermined number (such as first default is built on Hadoop
Number is host node master 4), the second predetermined number (such as the second predetermined number be 7) from node slave;Each main section
It is connected with each other between point master, each host node master is connected from node slave with each respectively.
On S012, each host node master set up have Metadata Service component metastore, relational database mysql,
HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not started
Make, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requests
As a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is open
API clients such as JDBC, ODBC are provided and are preferably supported.
S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing;It is negative
Balanced foundation is carried on existing network infrastructure, it provides a kind of cheap effectively transparent method extended network equipment and service
The bandwidth of device, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.
S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed field system
System hdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
S10, using Flume collection text subdata after, aggregate into text data, and text data be transmitted;Again
Using distribution Web server framework, each text data is collected into local file pond respectively.
S21, the text data in the local file pond is extracted;
Step S21 is further included:S211, extract router mac in the filename of the text data and when
Between stab;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, the mess code is cleaned using Python
Afterwards, jump to step S22;Otherwise, jump directly to step S22.Python is a kind of object-oriented, literal translation formula computer program
Design language.
S22, according to high in the clouds distributed file system hdfs1 block size, the text data is added up
It is merged into total text data;
S23, the high in the clouds distribution that total text data is uploaded to hadoop using local distributed file system hdfs2
Formula file system hdfs1;Hadoop is distributed system base frame.
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data
According to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is stored
Form is ORC.
S41, using the UDF functions of the Tool for Data Warehouse hive from the pass of compressed text extracting data URL
Key word;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539** { " mobile phone ":3, it is " strong
Health ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, " recruitment ":1, " handss
Machine ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};
8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":
4}。
Specifically, concepts of the Master/Slave equivalent to Server and agent.Master provides web interface and allows user
May operate in master the machine or be assigned on slave and run to manage job and slave, job.One master can
Serviced with associating multiple slave for as the different configurations of different job or identical job.
The metastore components of Hive are storing places in hive metadata sets.Metastore components include two parts:
Metastore services the storage with back-end data.The medium of back-end data storage is exactly relational database, such as hive acquiescences
Embedded disk database derby, also mysql data bases.Metastore services be built upon back-end data storage medium it
On, and the serviced component that can be interacted with hive services, under default situations, metastore services and hive services are
It is installed together, operates in the middle of same process.Metastore can also be serviced from hive services and be stripped out,
Metastore is independently mounted in a cluster, hive far calls metastore services, metadata can be put this layer
To after fire wall, client accesses hive services, it is possible to be connected to metadata this layer, so as to provide preferably management
Property and safety guarantee.Serviced using long-range metastore, metastore services and hive service operations can be allowed in difference
Process in, so also ensure that the stability of hive, improve hive service efficiency.
As shown in fig. 6, according to one embodiment of present invention, a kind of data extraction system of magnanimity URL, including:
Hadoop, builds the cluster environment of Hadoop, and configures the Tool for Data Warehouse hive, high in the clouds distributed file system
Hdfs1, local distributed file system hdfs2;Namenode HA and ResourceManager HA are configured.
Preferably, build on Hadoop the first predetermined number (such as the first predetermined number is 4) host node master,
Second predetermined number (such as the second predetermined number be 7) from node slave;It is connected with each other between each host node master,
Each host node master is connected from node slave with each respectively.
On each host node master set up have Metadata Service component metastore, relational database mysql,
HiveServer2.By HiveServer2, client can be grasped to the data in Hive in the case where CLI is not started
Make, this and all allows Terminal Server Client to use various programming languages such as java, python etc. to fetch to hive submission requests
As a result.HiveServer2 is all based on Thrift's, and HiveServer2 supports the concurrent and certification of multi-client, is open
API clients such as JDBC, ODBC are provided and are preferably supported.
Web server distributed type assemblies are built at each node in the cluster environment, and adds load balancing;Load is equal
Weighing apparatus is set up on existing network infrastructure, and it provides a kind of cheap effectively transparent method extended network equipment and server
Bandwidth, the handling capacity that increases, Strengthens network data-handling capacity, the motility for improving network and availability.
Realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file system
Hdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
Distribution Web server framework, is gathered after text subdata using Flume, aggregates into text data, and to text
Data are transmitted;Distribution Web server framework is recycled, each text data is collected into local file pond respectively.
The distribution Web server framework, extracts to the text data in the local file pond;Extract described
Router mac and timestamp in the filename of text data;Identify whether the router mac and timestamp run into unrest
Code;When the router mac and timestamp run into mess code, the mess code is cleaned using Python.Python is one
Plant object-oriented, literal translation formula computer programming language.
The distribution Web server framework, it is according to the size of the block of high in the clouds distributed file system hdfs1, right
The text data carries out adding up and is merged into total text data.
Local distributed file system hdfs2, using local distributed file system hdfs2 by the local file pond
The cumulative total text data for obtaining is uploaded to high in the clouds distributed file system hdfs1 of hadoop;Hadoop is distributed system
Base frame.
Tool for Data Warehouse hive, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;Open
Source Computational frame TEZ, the Computational frame TEZ that increases income are compressed coded treatment into compressed text number to total text data
According to, and be stored in the data base of high in the clouds distributed file system hdfs1;Wherein, boil down to ORC compressions;File is stored
Form is ORC.
Tool for Data Warehouse hive, using the UDF functions of the Tool for Data Warehouse hive from the compressed text data
The keyword of middle extraction URL;The type and frequency of output user accesses data.For example:8CAB8EC6CC40 186222539**
{ " mobile phone ":3, " health ":27};8CAB8EB144A8138350092** { " automobile ":11, " picture ":127, " music ":26, "
Recruitment ":1, " mobile phone ":13, " health ":2907, " life ":8, " video ":4, " shopping ":7, " social activity ":84, " live ":1};
8CAB8EC00880136605272** { " music ":1, " mobile phone ":4, " health ":54, " video ":2, " shopping ":1, " social activity ":
4}。
As shown in Figure 7, Figure 8, according to still a further embodiment, a kind of data extraction method of magnanimity URL, bag
Include:The cluster environment (deployment 4 master, 7 slave) of Hadoop2.7.1 is built, and has configured the environment such as HIVE, HDFS
With configuration (Hive Metastore, mysql, hiveserver2 etc. are set up on a master).And set Namenode
HA and ResourceManager HA, make distributed system meet high availability!Each node builds tomcat distributed type assemblies, adds
Loading is balanced.That realizes hive, hdfs builds table association, develops the UDF functions of corresponding hive, and Test extraction function
Normally.
By distributed web server framework, text data is collected into local file pond;By in File Pool
File carries out cleaning, extracts, merges, and carries out cumulative merging according to the size of the block of HDFS;Utilization local HDFS are merged complete
Into the efficient upload of data;The UDF functions getNUM of the Hive for being developed by oneself again is mainly closed come the URL in complete paired data
Key word is extracted, and completes the high efficiency extraction of URL mass datas.
Operation cleaning, merging, upload, high compression coding, the program of distributed extraction;By result output and further deeply
Analysis;By the Distributed Calculation of the UDF functions and Hadoop clusters of hive, the extraction for completing mass data is calculated, is used
The type of the access data at family the frequency, quickly obtain the online feature of user, the Products Show and service for user provide according to
According to.
hive:Technology that apache increases income, data warehouse software provide to be stored in it is distributed in large data collection
Inquiry and management, itself is built upon on Apache Hadoop.What Hive SQL were represented is based on tradition
Sql like language of the Mapreduce for core.
The present invention mainly the cleaning by the UDF self adaptations development function and Python of hive, merge, upload and
The ORC compressions of hive are combined, and define a high performance data extraction method.
It should be noted that above-described embodiment can independent assortment as needed.The above is only the preferred of the present invention
Embodiment, it is noted that for those skilled in the art, in the premise without departing from the principle of the invention
Under, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.
Claims (10)
1. a kind of data extraction method of magnanimity URL, it is characterised in that comprise the following steps:
S10, using distribution Web server framework, each text data is collected into local file pond respectively;
S20, the total text data obtained adding up in the local file pond are uploaded to the high in the clouds distributed field system of hadoop
System hdfs1;
S40, using the Tool for Data Warehouse hive of hadoop from high in the clouds distributed file system hdfs1 total text data
In distributed extraction URL keyword.
2. the data extraction method of magnanimity URL as claimed in claim 1, it is characterised in that further comprising the steps of:
S30, the Tool for Data Warehouse hive send computation requests to the Computational frame TEZ that increases income;
S31, the Computational frame TEZ that increases income are compressed coded treatment into compressed text data to total text data, and
It is stored in the data base of high in the clouds distributed file system hdfs1.
3. the data extraction method of magnanimity URL as claimed in claim 2, it is characterised in that step S40 is further wrapped
Include:
S41, using the UDF functions of the Tool for Data Warehouse hive from the keyword of compressed text extracting data URL;
The type and frequency of output user accesses data.
4. the data extraction method of magnanimity URL as claimed in claim 1, it is characterised in that step S20 is further wrapped
Include:
S21, the text data in the local file pond is extracted;
S22, according to high in the clouds distributed file system hdfs1 block size, cumulative merging is carried out to the text data
Into total text data;
S23, total text data is uploaded to into the high in the clouds distributed field system using local distributed file system hdfs2
System hdfs1.
5. the data extraction method of magnanimity URL as claimed in claim 4, it is characterised in that step S21 is further wrapped
Include:
Router mac and timestamp in S211, the filename of the extraction text data;
S212, identify whether the router mac and timestamp run into mess code;
S213, when the router mac and timestamp run into mess code, after cleaning to the mess code, jump to step
S22;Otherwise, jump directly to step S22.
6. the data extraction method of magnanimity URL as described in any one in Claims 1 to 5, it is characterised in that in the step
Also include before rapid S10:
S01, the cluster environment for building Hadoop, and configure the Tool for Data Warehouse hive, high in the clouds distributed file system
Hdfs1, local distributed file system hdfs2;
S02, web server distributed type assemblies are built at each node in the cluster environment, and add load balancing;
S03, realize the Tool for Data Warehouse hive, high in the clouds distributed file system hdfs1, local distributed file system
Hdfs2's builds table association;Reconstruct the UDF functions of the Tool for Data Warehouse hive.
7. the data extraction method of magnanimity URL as claimed in claim 6, it is characterised in that further wrap in step S01
Include:
S011, the host node master that the first predetermined number is built on Hadoop, the second predetermined number from node slave;
It is connected with each other between each host node master, each host node master is connected from node slave with each respectively.
8. the data extraction method of magnanimity URL as claimed in claim 7, it is characterised in that step S01 is further also wrapped
Include:
Setting up on S012, each host node master has Metadata Service component metastore, relational database mysql.
9. a kind of system for applying the data extraction method in magnanimity URL as described in any one in claim 1~8, its
It is characterised by, including:
Each text data, using distribution Web server framework, is collected local file pond by web server framework respectively;
Local distributed file system hdfs2, the total text data obtained adding up in the local file pond are uploaded to
High in the clouds distributed file system hdfs1 of hadoop;
Tool for Data Warehouse hive, using the Tool for Data Warehouse hive of hadoop from the high in the clouds distributed file system
In hdfs1 in total text data distributed extraction URL keyword.
10. the data extraction system of magnanimity URL as claimed in claim 9, it is characterised in that also include:
Increase income Computational frame TEZ, and the Tool for Data Warehouse hive sends computation requests to the Computational frame TEZ that increases income;
The Computational frame TEZ that increases income is compressed coded treatment into compressed text data to total text data, and stores
In the data base of high in the clouds distributed file system hdfs1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610970427.3A CN106570153A (en) | 2016-10-28 | 2016-10-28 | Data extraction method and system for mass URLs |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610970427.3A CN106570153A (en) | 2016-10-28 | 2016-10-28 | Data extraction method and system for mass URLs |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106570153A true CN106570153A (en) | 2017-04-19 |
Family
ID=58541622
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610970427.3A Pending CN106570153A (en) | 2016-10-28 | 2016-10-28 | Data extraction method and system for mass URLs |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570153A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107145542A (en) * | 2017-04-25 | 2017-09-08 | 上海斐讯数据通信技术有限公司 | The high efficiency extraction subscription client ID method and system from URL |
CN107193903A (en) * | 2017-05-11 | 2017-09-22 | 上海斐讯数据通信技术有限公司 | The method and system of efficient process IP address zone location |
CN107256206A (en) * | 2017-05-24 | 2017-10-17 | 北京京东尚科信息技术有限公司 | The method and apparatus of character stream format conversion |
CN108133050A (en) * | 2018-01-17 | 2018-06-08 | 北京网信云服信息科技有限公司 | A kind of extracting method of data, system and device |
CN111935215A (en) * | 2020-06-29 | 2020-11-13 | 广东科徕尼智能科技有限公司 | Internet of things data management method, terminal, system and storage device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955502A (en) * | 2014-04-24 | 2014-07-30 | 科技谷(厦门)信息技术有限公司 | Visualized on-line analytical processing (OLAP) application realizing method and system |
CN104111996A (en) * | 2014-07-07 | 2014-10-22 | 山大地纬软件股份有限公司 | Health insurance outpatient clinic big data extraction system and method based on hadoop platform |
CN104301182A (en) * | 2014-10-22 | 2015-01-21 | 赛尔网络有限公司 | Method and device for inquiring slow website access abnormal information |
CN105512336A (en) * | 2015-12-29 | 2016-04-20 | 中国建设银行股份有限公司 | Method and device for mass data processing based on Hadoop |
CN105677842A (en) * | 2016-01-05 | 2016-06-15 | 北京汇商融通信息技术有限公司 | Log analysis system based on Hadoop big data processing technique |
-
2016
- 2016-10-28 CN CN201610970427.3A patent/CN106570153A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103955502A (en) * | 2014-04-24 | 2014-07-30 | 科技谷(厦门)信息技术有限公司 | Visualized on-line analytical processing (OLAP) application realizing method and system |
CN104111996A (en) * | 2014-07-07 | 2014-10-22 | 山大地纬软件股份有限公司 | Health insurance outpatient clinic big data extraction system and method based on hadoop platform |
CN104301182A (en) * | 2014-10-22 | 2015-01-21 | 赛尔网络有限公司 | Method and device for inquiring slow website access abnormal information |
CN105512336A (en) * | 2015-12-29 | 2016-04-20 | 中国建设银行股份有限公司 | Method and device for mass data processing based on Hadoop |
CN105677842A (en) * | 2016-01-05 | 2016-06-15 | 北京汇商融通信息技术有限公司 | Log analysis system based on Hadoop big data processing technique |
Non-Patent Citations (2)
Title |
---|
CHEN476328361: ""tez控制输出的文件是否压缩并指定文件名"", 《HTTPS://BLOG.CSDN.NET/THINKING2013/ARTICLE/DETAILS/48133137》 * |
XIAOJUN_0820: ""flume学习:flume将log4j日志数据写入到hdfs"", 《HTTPS://BLOG.CSDN.NET/XIAO_JUN_0820/ARTICLE/DETAILS/38110323》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107145542A (en) * | 2017-04-25 | 2017-09-08 | 上海斐讯数据通信技术有限公司 | The high efficiency extraction subscription client ID method and system from URL |
CN107193903A (en) * | 2017-05-11 | 2017-09-22 | 上海斐讯数据通信技术有限公司 | The method and system of efficient process IP address zone location |
CN107256206A (en) * | 2017-05-24 | 2017-10-17 | 北京京东尚科信息技术有限公司 | The method and apparatus of character stream format conversion |
CN107256206B (en) * | 2017-05-24 | 2021-04-30 | 北京京东尚科信息技术有限公司 | Method and device for converting character stream format |
CN108133050A (en) * | 2018-01-17 | 2018-06-08 | 北京网信云服信息科技有限公司 | A kind of extracting method of data, system and device |
CN111935215A (en) * | 2020-06-29 | 2020-11-13 | 广东科徕尼智能科技有限公司 | Internet of things data management method, terminal, system and storage device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11711420B2 (en) | Automated management of resource attributes across network-based services | |
US11797558B2 (en) | Generating data transformation workflows | |
Zhang et al. | A survey on emerging computing paradigms for big data | |
Das et al. | Big data analytics: A framework for unstructured data analysis | |
US9875265B2 (en) | Database table format conversion based on user data access patterns in a networked computing environment | |
US20170357653A1 (en) | Unsupervised method for enriching rdf data sources from denormalized data | |
CN106570153A (en) | Data extraction method and system for mass URLs | |
CN109964216A (en) | Identify unknown data object | |
CN102915365A (en) | Hadoop-based construction method for distributed search engine | |
Lai et al. | Towards a framework for large-scale multimedia data storage and processing on Hadoop platform | |
CN105144121A (en) | Caching content addressable data chunks for storage virtualization | |
CN107391502B (en) | Time interval data query method and device and index construction method and device | |
Siddiqui et al. | Pseudo-cache-based IoT small files management framework in HDFS cluster | |
Jeong et al. | Anomaly teletraffic intrusion detection systems on hadoop-based platforms: A survey of some problems and solutions | |
US10956499B2 (en) | Efficient property graph storage for streaming/multi-versioning graphs | |
CN112925954A (en) | Method and apparatus for querying data in a graph database | |
CN110781505A (en) | System construction method and device, retrieval method and device, medium and equipment | |
CN106570151A (en) | Data collection processing method and system for mass files | |
Um et al. | Distributed RDF store for efficient searching billions of triples based on Hadoop | |
CN106570152B (en) | Mass extraction method and system for mobile phone numbers | |
US10133713B2 (en) | Domain specific representation of document text for accelerated natural language processing | |
CN109063059A (en) | User behaviors log processing method, device and electronic equipment | |
Gupta et al. | Efficient query analysis and performance evaluation of the NoSQL data store for bigdata | |
CN112889039A (en) | Identification of records for post-clone tenant identifier conversion | |
CN106484379B (en) | A kind of processing method and processing device of application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170419 |