CN107145542A

CN107145542A - The high efficiency extraction subscription client ID method and system from URL

Info

Publication number: CN107145542A
Application number: CN201710275446.9A
Authority: CN
Inventors: 欧阳涛
Original assignee: Shanghai Feixun Data Communication Technology Co Ltd
Current assignee: Taizhou Jiji Intellectual Property Operation Co.,Ltd.
Priority date: 2017-04-25
Filing date: 2017-04-25
Publication date: 2017-09-08

Abstract

The present invention relates to a kind of method and system of the high efficiency extraction subscription client ID from URL, its method includes：S1：By the data of journal file collector unit collector journal file and it is stored in File Pool；S2：The data being collected into step S1 are pre-processed by the ETL in Hive, and by the Data Collection pre-processed into Hadoop clusters with by data carry out structuring processing；S3：Combined by hive UDF functions with extracting the function phase of client id from URL to extract client id.Advantage is：Call Hadoop to carry out Distributed Calculation using hive, subscription client ID function will be extracted from URL and hive UDF functions carry out integrated, the extraction efficiency of raising client id, the consumption of resource is reduced.

Description

The high efficiency extraction subscription client ID method and system from URL

Technical field

The invention belongs to field of computer, more particularly to a kind of side of the high efficiency extraction subscription client ID from URL Method and system.

Background technology

With developing rapidly for Internet technologies, the various application and service run on Internet are also a large amount of therewith Emerge in large numbers, the epoch of big data have arrived.Each website is an independent information system in itself, and network is passed through in these websites After interconnection so that whole internet becomes a huge information system.Client can leave it during browsing web sites The vestige that accesses, these vestiges can preserve in the form of web journal files.Various systems, program, O＆M, transaction etc. Obtaining daily record becomes more and more important, because it is the important evidence of the operations such as system recovery, error tracking, safety detection.

Because data source is numerous, each system user is various, frequent operation, TB grades even PB grades of sea can be produced daily Web daily record datas are measured, and traditional database can not have been met and counted easily now due to the limitation of scalability and process performance Ten G, hundreds of G, the requirement of the storage analyzing and processing of even upper T data volume.And in a lot of non-structured journal files Face, how quick-searching goes out data, how fast searching is to useful data, how to daily record progress statistical analysis, as urgently To be solved the problem of.Existing big data querying method can only directly be carried out simply by HBase line unit search and by Hive HQL is retrieved, and retrieval time delay is very big, and data results are also inaccurate, it is impossible to meet current demand；And counting greatly According under application scenarios, with the increase of data magnanimity, client id in the URL addresses of local computing a large number of users access is directly utilized Ample resources and internal memory, and inefficiency can be consumed.

In order to solve the above-mentioned technical problem, people have carried out long-term exploration, and such as Chinese patent discloses a kind of magnanimity Web daily record datas are inquired about and analysis method [application number：CN201410596395.6], comprise the following steps：Step 1, Hive is used In ETL the data of each data source are parsed, resolving include extract, cleaning, conversion and loading four steps, When being cleaned to data, useful information therein is subjected to distributed extraction with MapReduce programs and handled；Step 2, it will take out The data taken out are loaded into data warehouse；Step 3, Hive part Driver receives HiveQL sentences；Step 4, for Tilt data is optimized to receiving sentence, and preliminary map results are obtained after carry out table attended operation；Step 5, it will receive HiveQL sentences are converted into MapReduce tasks carryings and store Query Result；Step 6, enter for the web daily record datas of magnanimity Row data are split；Step 7, the genetic algorithm searched for using the global randomization of highly-parallel carries out analysis mining to data；Step Rapid 8, the data that data query and analysis part are drawn are loaded into Mysql databases.

Such scheme realizes the data mining of big data, improves the degree of accuracy of data results, but still has Deficiency, for example：1. such scheme can only recognize web situation, it is impossible to which the ID of client is extracted；2. distributed system Each machine between cannot carry out flow allocating, cause load imbalance.

The content of the invention

Regarding the issue above, the present invention provides a kind of high efficiency extraction from URL of high efficiency extraction client id Subscription client ID method；

It is another object of the present invention in view of the above-mentioned problems, providing a kind of based on the high efficiency extraction subscription client from URL The system of ID method.

To reach above-mentioned purpose, present invention employs following technical proposal：

High efficiency extraction subscription client ID method comprises the following steps from URL：

S1：By the data of journal file collector unit collector journal file and it is stored in File Pool；

S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the data pre-processed It is collected into Hadoop clusters data carrying out structuring processing；

S3：Combined by hive UDF functions with extracting the function phase of client id from URL to extract client id.

By above-mentioned technical proposal, by the adaptive development functions of hive UDF and the function that client id is extracted from URL Combination, realize subscription client ID high efficiency extraction.

In the above-mentioned high efficiency extraction subscription client ID from URL method, in step sl, described log collection Unit is the Flume systems that distributed massive logs file can be acquired, polymerize and be transmitted.

In the above-mentioned high efficiency extraction subscription client ID from URL method, in step s 2, by the following method will Data structured processing：

Set up the table structure of data file by hive, and by Mysql by hive and hdfs build table associate with incite somebody to action Data structured processing.

In the above-mentioned high efficiency extraction subscription client ID from URL method, in step s 2, described ETL journey Prelude is deployed in Hadoop clusters, and ETL program include data can be cleaned, be merged, being uploaded, high compression encode and A series of programs that distribution is extracted.

In the above-mentioned high efficiency extraction subscription client ID from URL method, the distributed system of the Hadoop is led to Cross following methods structure：

Build the cluster environment for the Hadoop2.7.1 for being deployed with least one main frame and at least one slave, to HIVE and HDFS environment is set up in a main frame with being configured, and by Hive Metastore, mysql and hiveserver2 On, and Namenode HA and ResourceManager HA are configured to build distributed system.

In the above-mentioned high efficiency extraction subscription client ID from URL method, in step s3, by the following method will UDF functions are combined with extracting the function phase of client id from URL：

S3-1：By developing, the corresponding hive UDF functions to IP address with normal extraction function have hive There are UDF functions；

S3-2 completes the program that client id is extracted from URL based on hive in locality connection Hadoop clusters Afterwards, compiled by UDF functions and complete to be combined with extracting the function phase of client id from URL.

In the above-mentioned high efficiency extraction subscription client ID from URL method, built in each node of distributed system There are tomcat distributed type assemblies, and the flow of machine where tomcat is allocated using Nginx.

In the above-mentioned high efficiency extraction subscription client ID from URL method, after step s 3, in addition to following step Suddenly：

Form is further analyzed and/or generated to result after being exported to the client id result of extraction.

In the above-mentioned high efficiency extraction subscription client ID from URL method, the result of output passes through visual configuration Carry out visualization to show, described visual configuration includes data collection visualization, data access visualization, data calculating visually Change configuration of any one or more combination in being visualized with data output.

A kind of system based on the high efficiency extraction subscription client ID method from URL.

Present invention high efficiency extraction subscription client ID method and system from URL has following excellent compared to prior art Point：

1st, call Hadoop to carry out Distributed Calculation to complete the extraction of client id in URL using hive, efficiency high and Consume resource low；

2nd, flow allocating is carried out to each machine, realizes load balancing；

3rd, data are subjected to structuring processing, in order to the extraction of client id.

Brief description of the drawings

Fig. 1 is the Technical Architecture figure of the embodiment of the present invention one；

Fig. 2 is the data flowchart of the embodiment of the present invention one.

Embodiment

The present invention accesses situation available for efficient client id with the region for precisely arriving user, is that user brings more high-quality Service, solves the problem of direct access client ID of prior art can consume ample resources and internal memory and inefficiency.

The following is the preferred embodiments of the present invention and with reference to accompanying drawing, technical scheme is further described, But the present invention is not limited to these embodiments.

Embodiment one

As depicted in figs. 1 and 2, high efficiency extraction subscription client ID method comprises the following steps from URL：

Wherein log collection unit is what distributed massive logs file can be acquired, polymerize and be transmitted Flume systems.

Flume systems are a High Availabitities, highly reliable, and what distributed massive logs were gathered, and polymerize and transmitted is System.

URL：URL, is position and one kind of access method of resource to that can be obtained from internet Succinct expression, is the address of standard resource on internet.Each file on internet has a unique URL, and it is wrapped The information contained points out how the position of file and browser should handle it.

S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the data pre-processed It is collected into Hadoop clusters data carrying out structuring processing.

Wherein, in order to by data structured processing, it is necessary to set up the table structure of data file by hive, so, pass through Realize that hive builds table with hdfs and associated to complete data structured processing, wherein hive builds table with hdfs and associated by Mysql pipes Reason is completed, and is saved the data in different tables, to gather way and improve flexibility.

ETL program is deployed in Hadoop clusters, and ETL program include data can be cleaned, be merged, on Pass, high compression is encoded and a series of distributed programs extracted.

Hadoop：Distributed system base frame is adapted to have super there is provided the data that high-throughput carrys out access application The application program of large data sets；

Its most crucial design：HDFS and MapReduce, HDFS provide storage for the data of magnanimity, then MapReduce is The data of magnanimity provide calculating.

hive：It is the technology that apache increases income, data warehouse software provides the large data collection to being stored in distribution Inquiry and management, itself is built upon on Apache Hadoop, and specifically, hive is one based on Hadoop The data file of structuring, can be mapped as a database table, and provide complete sql inquiry work(by Tool for Data Warehouse Can, sql sentences can be converted to MapReduce tasks and run.

Hive advantage is that learning cost is low, and simple MapReduce statistics can be quickly realized by class SQL statement, Special MapReduce applications need not be developed, be especially suitable for current embodiment require that data warehouse statistical analysis.

ETL processing：Process for describing data from source by extracting, changing, be loaded onto destination.

Further, Hadoop distributed system is built by the following method：

Build the cluster environment for the Hadoop2.7.1 for being deployed with least one main frame and at least one slave, to HIVE and HDFS environment is set up in a main frame with being configured, and by Hive Metastore, mysql and hiveserver2 On, and Namenode HA and ResourceManager HA are configured, arrange parameter so that distributed system meet height can It is defined with property, and preferably, 4 main frames and 7 slaves is disposed in the present embodiment.

Hive Metastore：Relational database, the metadata information for storage table；

Mysql：A kind of associated data base management system, is saved the data in different tables, rather than by all data It is placed in one big warehouse to gather way and improve flexibility.

hiveserver2：Hive servers；

Namenode HA：High availability data distributing server；

ResourceManager HA:Height configuration explorer.

UDF：Hive User-Defined Functions.

Described client id refers mainly to Jingdone district ID in the present embodiment, and the ID of other clients, client are can also be certainly End includes webpage client and application client, such as Baidu ID, wechat ID, Taobao ID.

Specifically, in step s3, by the following method by UDF functions and the function phase that client id is extracted from URL With reference to：

S3-1：In order to further complete the extraction of data, that develops corresponding hive has normal extraction to IP address The UDF functions of function make hive have UDF functions, and UDF functions have the function of being capable of normal extraction IP address；

In order to add load balancing, tomcat distributed type assemblies are built in each node of distributed system, and utilize Nginx is allocated to the flow of machine where tomcat, and Nginx is a kind of high performance HTTP and Reverse Proxy, Each machine in cluster, including the flow of main frame and slave is realized equally loaded by the design of load balancing, improve each The utilization rate of machine, simultaneously because balanced load, improves the processing speed of each machine.

Further, after step s 3, it is further comprising the steps of：

Form is further analyzed and/or generated to result after being exported to the client id result of extraction, for example, with Statement form exemplified by the ID of Jingdone district is as follows：Router, user MAC, Jingdone district ID frequency of occurrences

Here is to carry out principle analysis to form：

Different router address, the different URL addresses of correspondence, the home router MAC different by being associated with, and count The occurrence number of the different terminals under different routers is calculated, can be from the use of magnanimity so by efficient Distributed Calculation Obtain the situation of the real terminal access of Jingdone district user in user data, and situation precisely accessed to region of user with this, for Bring better service in family.

Further, the result of output carries out visualization by visual configuration and shows that described visual configuration includes Data collection visualization, data access visualization, data calculation visualization and data output visualization in any one or it is many The configuration of combination is planted, and having the visual function of customizable is kept to result display.

Below in conjunction with the accompanying drawings 1, the Technical Architecture of the present embodiment is specifically described：

Journal file is acquired place by existing framework such as Flume system architectures, distributed system base frame Local big File Pool is stored in after reason, Hadoop collection is uploaded to after the pretreatment operation such as then being accumulated, being cleaned, merged to file The HDFS of group, and it is uploaded to the extraction client id function logarithm from URL that the data in HDFS have had in itself using hive Extracted according to distribution is carried out tentatively to extract client id, meanwhile, hive initiates computation requests to TEZ Computational frames, reaches profit Call Hadoop to carry out Distributed Calculation with hive to complete the extraction of Jingdone district ID in URL, efficiency high and consumption resource is low.

Wherein, Tez is the Computational frame of increasing income of the newest support DAG operations of Apache, and it can have dependence by multiple Operation changing is an operation performance of DAG operations is substantially improved.Tez not region be directly facing end user --- in fact It allow developer be end user build performance faster, the more preferable application program of autgmentability, in the present embodiment for extending UDF functions so that hive have UDF adaptive exploitation supply, more accurately extract client id.Hadoop is traditionally one Mass data batch processing platform.But, there are many use-cases to need the almost performance of query processing in real time.Also a few thing is not then Too suitable MapReduce, such as machine learning, Tez purpose is just to aid in Hadoop and handles these use-case scenes.

The target of Tez projects is to support height to customize, and the need for so it just disclosure satisfy that various use-cases, allows people not Must by other external modes with regard to the work of oneself can be completed, if project as Hive and Pig using Tez rather than MapReduce as its data processing backbone, then will be obviously improved their response time, Tez build YARN it On, the latter is new resources Governance framework used in Hadoop.

Below in conjunction with the accompanying drawings 2, the present embodiment is specifically described：

The small documents of the daily record of distributed magnanimity are acquired by Flume systems, polymerize, and are transmitted to big text Part pond, the ETL programs then designed using Python are cleaned to data and merged with URL addresses effective at sub-sieve With invalid URL addresses, and filtering inspection processing is carried out to invalid URL addresses, underproof URL addresses are carried out at deletion Reason, and effectively URL addresses refer to the URL for including given client end ID, and by treated invalid URL addresses with having The URL addresses of effect carry out data and merged tentatively to complete the extraction of data, and the data tentatively extracted are uploaded into HDFS afterwards and entered Row is preserved, and then data is carried out by the adaptive development function after ETL compressing data coded treatments using UDF further Extract.

The present embodiment will extract from URL subscription client ID function and hive by hadoop distributed structure/architecture UDF functions are integrated, and realize the high efficiency extraction completed to subscription client ID, can obtain specifying visitor in the user data of magnanimity The situation of the real terminal access of family end subscriber, and situation is precisely accessed to the region of user with this, it is that user brings more high-quality Service.

Embodiment two

The present embodiment proposes a kind of system based on the high efficiency extraction subscription client ID method from URL.

Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Although more having used the terms such as Hive, UDF function, Hadoop clusters, Mysql herein, it is not precluded from making With the possibility of other terms.It is used for the purpose of more easily describing and explaining the essence of the present invention using these terms；It Be construed to any additional limitation and all disagreed with spirit of the present invention.

Claims

1. a kind of method of the high efficiency extraction subscription client ID from URL, it is characterised in that comprise the following steps：

S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the Data Collection pre-processed Into Hadoop clusters with by data carry out structuring processing；

2. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step In S1, described log collection unit is the Flume that distributed massive logs file can be acquired, polymerize and be transmitted System.

3. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step In S2, data structured is handled by the following method：

Set up the table structure of data file by hive, and hive and hdfs build table by Mysql and associate with by data Structuring is handled.

4. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step In S2, described ETL program is deployed in Hadoop clusters, and ETL program includes to clean data, closing And, upload, high compression coding and a series of distributed programs extracted.

5. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that described Hadoop distributed system is built by the following method：

The cluster environment for the Hadoop2.7.1 for being deployed with least one main frame and at least one slave is built, to HIVE and HDFS Environment with being configured, and by Hive Metastore, mysql and hiveserver2 set up on a main frame, and Namenode HA and ResourceManager HA are configured to build distributed system.

6. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step In S3, UDF functions are combined with extracting the function phase of client id from URL by the following method：

S3-1：By developing, the corresponding hive UDF functions to IP address with normal extraction function make hive have UDF Function；

S3-2 completes after the program of extraction client id, leading to from URL based on hive in locality connection Hadoop clusters The compiling of UDF functions is crossed to complete to be combined with extracting the function phase of client id from URL.

7. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in distribution Each node of formula system has built tomcat distributed type assemblies, and the flow of machine where tomcat is adjusted using Nginx Match somebody with somebody.

8. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step It is further comprising the steps of after S3：

9. the method for the high efficiency extraction subscription client ID according to claim 8 from URL, it is characterised in that output As a result carry out visualization by visual configuration to show, described visual configuration includes data collection visualization, data access The configuration of visualization, data calculation visualization and any one or more combination in data output visualization.

10. it is a kind of based on high efficiency extraction subscription client ID method is from URL described in claim 1-9 any one System.