CN107145542A - The high efficiency extraction subscription client ID method and system from URL - Google Patents
The high efficiency extraction subscription client ID method and system from URL Download PDFInfo
- Publication number
- CN107145542A CN107145542A CN201710275446.9A CN201710275446A CN107145542A CN 107145542 A CN107145542 A CN 107145542A CN 201710275446 A CN201710275446 A CN 201710275446A CN 107145542 A CN107145542 A CN 107145542A
- Authority
- CN
- China
- Prior art keywords
- url
- data
- client
- hive
- high efficiency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000006870 function Effects 0.000 claims abstract description 42
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 238000013480 data collection Methods 0.000 claims abstract description 6
- 238000012800 visualization Methods 0.000 claims description 13
- 230000000007 visual effect Effects 0.000 claims description 6
- 230000000712 assembly Effects 0.000 claims description 3
- 238000000429 assembly Methods 0.000 claims description 3
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000004458 analytical method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 241001269238 Data Species 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 230000004382 visual function Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention relates to a kind of method and system of the high efficiency extraction subscription client ID from URL, its method includes:S1:By the data of journal file collector unit collector journal file and it is stored in File Pool;S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the Data Collection pre-processed into Hadoop clusters with by data carry out structuring processing;S3:Combined by hive UDF functions with extracting the function phase of client id from URL to extract client id.Advantage is:Call Hadoop to carry out Distributed Calculation using hive, subscription client ID function will be extracted from URL and hive UDF functions carry out integrated, the extraction efficiency of raising client id, the consumption of resource is reduced.
Description
Technical field
The invention belongs to field of computer, more particularly to a kind of side of the high efficiency extraction subscription client ID from URL
Method and system.
Background technology
With developing rapidly for Internet technologies, the various application and service run on Internet are also a large amount of therewith
Emerge in large numbers, the epoch of big data have arrived.Each website is an independent information system in itself, and network is passed through in these websites
After interconnection so that whole internet becomes a huge information system.Client can leave it during browsing web sites
The vestige that accesses, these vestiges can preserve in the form of web journal files.Various systems, program, O&M, transaction etc.
Obtaining daily record becomes more and more important, because it is the important evidence of the operations such as system recovery, error tracking, safety detection.
Because data source is numerous, each system user is various, frequent operation, TB grades even PB grades of sea can be produced daily
Web daily record datas are measured, and traditional database can not have been met and counted easily now due to the limitation of scalability and process performance
Ten G, hundreds of G, the requirement of the storage analyzing and processing of even upper T data volume.And in a lot of non-structured journal files
Face, how quick-searching goes out data, how fast searching is to useful data, how to daily record progress statistical analysis, as urgently
To be solved the problem of.Existing big data querying method can only directly be carried out simply by HBase line unit search and by
Hive HQL is retrieved, and retrieval time delay is very big, and data results are also inaccurate, it is impossible to meet current demand;And counting greatly
According under application scenarios, with the increase of data magnanimity, client id in the URL addresses of local computing a large number of users access is directly utilized
Ample resources and internal memory, and inefficiency can be consumed.
In order to solve the above-mentioned technical problem, people have carried out long-term exploration, and such as Chinese patent discloses a kind of magnanimity
Web daily record datas are inquired about and analysis method [application number:CN201410596395.6], comprise the following steps:Step 1, Hive is used
In ETL the data of each data source are parsed, resolving include extract, cleaning, conversion and loading four steps,
When being cleaned to data, useful information therein is subjected to distributed extraction with MapReduce programs and handled;Step 2, it will take out
The data taken out are loaded into data warehouse;Step 3, Hive part Driver receives HiveQL sentences;Step 4, for
Tilt data is optimized to receiving sentence, and preliminary map results are obtained after carry out table attended operation;Step 5, it will receive
HiveQL sentences are converted into MapReduce tasks carryings and store Query Result;Step 6, enter for the web daily record datas of magnanimity
Row data are split;Step 7, the genetic algorithm searched for using the global randomization of highly-parallel carries out analysis mining to data;Step
Rapid 8, the data that data query and analysis part are drawn are loaded into Mysql databases.
Such scheme realizes the data mining of big data, improves the degree of accuracy of data results, but still has
Deficiency, for example:1. such scheme can only recognize web situation, it is impossible to which the ID of client is extracted;2. distributed system
Each machine between cannot carry out flow allocating, cause load imbalance.
The content of the invention
Regarding the issue above, the present invention provides a kind of high efficiency extraction from URL of high efficiency extraction client id
Subscription client ID method;
It is another object of the present invention in view of the above-mentioned problems, providing a kind of based on the high efficiency extraction subscription client from URL
The system of ID method.
To reach above-mentioned purpose, present invention employs following technical proposal:
High efficiency extraction subscription client ID method comprises the following steps from URL:
S1:By the data of journal file collector unit collector journal file and it is stored in File Pool;
S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the data pre-processed
It is collected into Hadoop clusters data carrying out structuring processing;
S3:Combined by hive UDF functions with extracting the function phase of client id from URL to extract client id.
By above-mentioned technical proposal, by the adaptive development functions of hive UDF and the function that client id is extracted from URL
Combination, realize subscription client ID high efficiency extraction.
In the above-mentioned high efficiency extraction subscription client ID from URL method, in step sl, described log collection
Unit is the Flume systems that distributed massive logs file can be acquired, polymerize and be transmitted.
In the above-mentioned high efficiency extraction subscription client ID from URL method, in step s 2, by the following method will
Data structured processing:
Set up the table structure of data file by hive, and by Mysql by hive and hdfs build table associate with incite somebody to action
Data structured processing.
In the above-mentioned high efficiency extraction subscription client ID from URL method, in step s 2, described ETL journey
Prelude is deployed in Hadoop clusters, and ETL program include data can be cleaned, be merged, being uploaded, high compression encode and
A series of programs that distribution is extracted.
In the above-mentioned high efficiency extraction subscription client ID from URL method, the distributed system of the Hadoop is led to
Cross following methods structure:
Build the cluster environment for the Hadoop2.7.1 for being deployed with least one main frame and at least one slave, to HIVE and
HDFS environment is set up in a main frame with being configured, and by Hive Metastore, mysql and hiveserver2
On, and Namenode HA and ResourceManager HA are configured to build distributed system.
In the above-mentioned high efficiency extraction subscription client ID from URL method, in step s3, by the following method will
UDF functions are combined with extracting the function phase of client id from URL:
S3-1:By developing, the corresponding hive UDF functions to IP address with normal extraction function have hive
There are UDF functions;
S3-2 completes the program that client id is extracted from URL based on hive in locality connection Hadoop clusters
Afterwards, compiled by UDF functions and complete to be combined with extracting the function phase of client id from URL.
In the above-mentioned high efficiency extraction subscription client ID from URL method, built in each node of distributed system
There are tomcat distributed type assemblies, and the flow of machine where tomcat is allocated using Nginx.
In the above-mentioned high efficiency extraction subscription client ID from URL method, after step s 3, in addition to following step
Suddenly:
Form is further analyzed and/or generated to result after being exported to the client id result of extraction.
In the above-mentioned high efficiency extraction subscription client ID from URL method, the result of output passes through visual configuration
Carry out visualization to show, described visual configuration includes data collection visualization, data access visualization, data calculating visually
Change configuration of any one or more combination in being visualized with data output.
A kind of system based on the high efficiency extraction subscription client ID method from URL.
Present invention high efficiency extraction subscription client ID method and system from URL has following excellent compared to prior art
Point:
1st, call Hadoop to carry out Distributed Calculation to complete the extraction of client id in URL using hive, efficiency high and
Consume resource low;
2nd, flow allocating is carried out to each machine, realizes load balancing;
3rd, data are subjected to structuring processing, in order to the extraction of client id.
Brief description of the drawings
Fig. 1 is the Technical Architecture figure of the embodiment of the present invention one;
Fig. 2 is the data flowchart of the embodiment of the present invention one.
Embodiment
The present invention accesses situation available for efficient client id with the region for precisely arriving user, is that user brings more high-quality
Service, solves the problem of direct access client ID of prior art can consume ample resources and internal memory and inefficiency.
The following is the preferred embodiments of the present invention and with reference to accompanying drawing, technical scheme is further described,
But the present invention is not limited to these embodiments.
Embodiment one
As depicted in figs. 1 and 2, high efficiency extraction subscription client ID method comprises the following steps from URL:
S1:By the data of journal file collector unit collector journal file and it is stored in File Pool;
Wherein log collection unit is what distributed massive logs file can be acquired, polymerize and be transmitted
Flume systems.
Flume systems are a High Availabitities, highly reliable, and what distributed massive logs were gathered, and polymerize and transmitted is
System.
URL:URL, is position and one kind of access method of resource to that can be obtained from internet
Succinct expression, is the address of standard resource on internet.Each file on internet has a unique URL, and it is wrapped
The information contained points out how the position of file and browser should handle it.
S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the data pre-processed
It is collected into Hadoop clusters data carrying out structuring processing.
Wherein, in order to by data structured processing, it is necessary to set up the table structure of data file by hive, so, pass through
Realize that hive builds table with hdfs and associated to complete data structured processing, wherein hive builds table with hdfs and associated by Mysql pipes
Reason is completed, and is saved the data in different tables, to gather way and improve flexibility.
ETL program is deployed in Hadoop clusters, and ETL program include data can be cleaned, be merged, on
Pass, high compression is encoded and a series of distributed programs extracted.
Hadoop:Distributed system base frame is adapted to have super there is provided the data that high-throughput carrys out access application
The application program of large data sets;
Its most crucial design:HDFS and MapReduce, HDFS provide storage for the data of magnanimity, then MapReduce is
The data of magnanimity provide calculating.
hive:It is the technology that apache increases income, data warehouse software provides the large data collection to being stored in distribution
Inquiry and management, itself is built upon on Apache Hadoop, and specifically, hive is one based on Hadoop
The data file of structuring, can be mapped as a database table, and provide complete sql inquiry work(by Tool for Data Warehouse
Can, sql sentences can be converted to MapReduce tasks and run.
Hive advantage is that learning cost is low, and simple MapReduce statistics can be quickly realized by class SQL statement,
Special MapReduce applications need not be developed, be especially suitable for current embodiment require that data warehouse statistical analysis.
ETL processing:Process for describing data from source by extracting, changing, be loaded onto destination.
Further, Hadoop distributed system is built by the following method:
Build the cluster environment for the Hadoop2.7.1 for being deployed with least one main frame and at least one slave, to HIVE and
HDFS environment is set up in a main frame with being configured, and by Hive Metastore, mysql and hiveserver2
On, and Namenode HA and ResourceManager HA are configured, arrange parameter so that distributed system meet height can
It is defined with property, and preferably, 4 main frames and 7 slaves is disposed in the present embodiment.
Hive Metastore:Relational database, the metadata information for storage table;
Mysql:A kind of associated data base management system, is saved the data in different tables, rather than by all data
It is placed in one big warehouse to gather way and improve flexibility.
hiveserver2:Hive servers;
Namenode HA:High availability data distributing server;
ResourceManager HA:Height configuration explorer.
S3:Combined by hive UDF functions with extracting the function phase of client id from URL to extract client id.
UDF:Hive User-Defined Functions.
Described client id refers mainly to Jingdone district ID in the present embodiment, and the ID of other clients, client are can also be certainly
End includes webpage client and application client, such as Baidu ID, wechat ID, Taobao ID.
Specifically, in step s3, by the following method by UDF functions and the function phase that client id is extracted from URL
With reference to:
S3-1:In order to further complete the extraction of data, that develops corresponding hive has normal extraction to IP address
The UDF functions of function make hive have UDF functions, and UDF functions have the function of being capable of normal extraction IP address;
S3-2 completes the program that client id is extracted from URL based on hive in locality connection Hadoop clusters
Afterwards, compiled by UDF functions and complete to be combined with extracting the function phase of client id from URL.
In order to add load balancing, tomcat distributed type assemblies are built in each node of distributed system, and utilize
Nginx is allocated to the flow of machine where tomcat, and Nginx is a kind of high performance HTTP and Reverse Proxy,
Each machine in cluster, including the flow of main frame and slave is realized equally loaded by the design of load balancing, improve each
The utilization rate of machine, simultaneously because balanced load, improves the processing speed of each machine.
Further, after step s 3, it is further comprising the steps of:
Form is further analyzed and/or generated to result after being exported to the client id result of extraction, for example, with
Statement form exemplified by the ID of Jingdone district is as follows:Router, user MAC, Jingdone district ID frequency of occurrences
Here is to carry out principle analysis to form:
Different router address, the different URL addresses of correspondence, the home router MAC different by being associated with, and count
The occurrence number of the different terminals under different routers is calculated, can be from the use of magnanimity so by efficient Distributed Calculation
Obtain the situation of the real terminal access of Jingdone district user in user data, and situation precisely accessed to region of user with this, for
Bring better service in family.
Further, the result of output carries out visualization by visual configuration and shows that described visual configuration includes
Data collection visualization, data access visualization, data calculation visualization and data output visualization in any one or it is many
The configuration of combination is planted, and having the visual function of customizable is kept to result display.
Below in conjunction with the accompanying drawings 1, the Technical Architecture of the present embodiment is specifically described:
Journal file is acquired place by existing framework such as Flume system architectures, distributed system base frame
Local big File Pool is stored in after reason, Hadoop collection is uploaded to after the pretreatment operation such as then being accumulated, being cleaned, merged to file
The HDFS of group, and it is uploaded to the extraction client id function logarithm from URL that the data in HDFS have had in itself using hive
Extracted according to distribution is carried out tentatively to extract client id, meanwhile, hive initiates computation requests to TEZ Computational frames, reaches profit
Call Hadoop to carry out Distributed Calculation with hive to complete the extraction of Jingdone district ID in URL, efficiency high and consumption resource is low.
Wherein, Tez is the Computational frame of increasing income of the newest support DAG operations of Apache, and it can have dependence by multiple
Operation changing is an operation performance of DAG operations is substantially improved.Tez not region be directly facing end user --- in fact
It allow developer be end user build performance faster, the more preferable application program of autgmentability, in the present embodiment for extending
UDF functions so that hive have UDF adaptive exploitation supply, more accurately extract client id.Hadoop is traditionally one
Mass data batch processing platform.But, there are many use-cases to need the almost performance of query processing in real time.Also a few thing is not then
Too suitable MapReduce, such as machine learning, Tez purpose is just to aid in Hadoop and handles these use-case scenes.
The target of Tez projects is to support height to customize, and the need for so it just disclosure satisfy that various use-cases, allows people not
Must by other external modes with regard to the work of oneself can be completed, if project as Hive and Pig using Tez rather than
MapReduce as its data processing backbone, then will be obviously improved their response time, Tez build YARN it
On, the latter is new resources Governance framework used in Hadoop.
Below in conjunction with the accompanying drawings 2, the present embodiment is specifically described:
The small documents of the daily record of distributed magnanimity are acquired by Flume systems, polymerize, and are transmitted to big text
Part pond, the ETL programs then designed using Python are cleaned to data and merged with URL addresses effective at sub-sieve
With invalid URL addresses, and filtering inspection processing is carried out to invalid URL addresses, underproof URL addresses are carried out at deletion
Reason, and effectively URL addresses refer to the URL for including given client end ID, and by treated invalid URL addresses with having
The URL addresses of effect carry out data and merged tentatively to complete the extraction of data, and the data tentatively extracted are uploaded into HDFS afterwards and entered
Row is preserved, and then data is carried out by the adaptive development function after ETL compressing data coded treatments using UDF further
Extract.
The present embodiment will extract from URL subscription client ID function and hive by hadoop distributed structure/architecture
UDF functions are integrated, and realize the high efficiency extraction completed to subscription client ID, can obtain specifying visitor in the user data of magnanimity
The situation of the real terminal access of family end subscriber, and situation is precisely accessed to the region of user with this, it is that user brings more high-quality
Service.
Embodiment two
The present embodiment proposes a kind of system based on the high efficiency extraction subscription client ID method from URL.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention
The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode
Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.
Although more having used the terms such as Hive, UDF function, Hadoop clusters, Mysql herein, it is not precluded from making
With the possibility of other terms.It is used for the purpose of more easily describing and explaining the essence of the present invention using these terms;It
Be construed to any additional limitation and all disagreed with spirit of the present invention.
Claims (10)
1. a kind of method of the high efficiency extraction subscription client ID from URL, it is characterised in that comprise the following steps:
S1:By the data of journal file collector unit collector journal file and it is stored in File Pool;
S2:The data being collected into step S1 are pre-processed by the ETL in Hive, and by the Data Collection pre-processed
Into Hadoop clusters with by data carry out structuring processing;
S3:Combined by hive UDF functions with extracting the function phase of client id from URL to extract client id.
2. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step
In S1, described log collection unit is the Flume that distributed massive logs file can be acquired, polymerize and be transmitted
System.
3. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step
In S2, data structured is handled by the following method:
Set up the table structure of data file by hive, and hive and hdfs build table by Mysql and associate with by data
Structuring is handled.
4. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step
In S2, described ETL program is deployed in Hadoop clusters, and ETL program includes to clean data, closing
And, upload, high compression coding and a series of distributed programs extracted.
5. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that described
Hadoop distributed system is built by the following method:
The cluster environment for the Hadoop2.7.1 for being deployed with least one main frame and at least one slave is built, to HIVE and HDFS
Environment with being configured, and by Hive Metastore, mysql and hiveserver2 set up on a main frame, and
Namenode HA and ResourceManager HA are configured to build distributed system.
6. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step
In S3, UDF functions are combined with extracting the function phase of client id from URL by the following method:
S3-1:By developing, the corresponding hive UDF functions to IP address with normal extraction function make hive have UDF
Function;
S3-2 completes after the program of extraction client id, leading to from URL based on hive in locality connection Hadoop clusters
The compiling of UDF functions is crossed to complete to be combined with extracting the function phase of client id from URL.
7. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in distribution
Each node of formula system has built tomcat distributed type assemblies, and the flow of machine where tomcat is adjusted using Nginx
Match somebody with somebody.
8. the method for the high efficiency extraction subscription client ID according to claim 1 from URL, it is characterised in that in step
It is further comprising the steps of after S3:
Form is further analyzed and/or generated to result after being exported to the client id result of extraction.
9. the method for the high efficiency extraction subscription client ID according to claim 8 from URL, it is characterised in that output
As a result carry out visualization by visual configuration to show, described visual configuration includes data collection visualization, data access
The configuration of visualization, data calculation visualization and any one or more combination in data output visualization.
10. it is a kind of based on high efficiency extraction subscription client ID method is from URL described in claim 1-9 any one
System.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710275446.9A CN107145542A (en) | 2017-04-25 | 2017-04-25 | The high efficiency extraction subscription client ID method and system from URL |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710275446.9A CN107145542A (en) | 2017-04-25 | 2017-04-25 | The high efficiency extraction subscription client ID method and system from URL |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107145542A true CN107145542A (en) | 2017-09-08 |
Family
ID=59774748
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710275446.9A Pending CN107145542A (en) | 2017-04-25 | 2017-04-25 | The high efficiency extraction subscription client ID method and system from URL |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145542A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259068A (en) * | 2020-04-28 | 2020-06-09 | 成都四方伟业软件股份有限公司 | Data development method and system based on data warehouse |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377260A (en) * | 2012-04-28 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Analysis method and device of URLs (Uniform Resource Locator) of weblog |
CN104239532A (en) * | 2014-09-19 | 2014-12-24 | 浪潮(北京)电子信息产业有限公司 | Method and device for self-making user extraction information tool in Hive |
CN106570152A (en) * | 2016-10-28 | 2017-04-19 | 上海斐讯数据通信技术有限公司 | Mobile phone number volume extracting method and system |
CN106570153A (en) * | 2016-10-28 | 2017-04-19 | 上海斐讯数据通信技术有限公司 | Data extraction method and system for mass URLs |
-
2017
- 2017-04-25 CN CN201710275446.9A patent/CN107145542A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103377260A (en) * | 2012-04-28 | 2013-10-30 | 阿里巴巴集团控股有限公司 | Analysis method and device of URLs (Uniform Resource Locator) of weblog |
CN104239532A (en) * | 2014-09-19 | 2014-12-24 | 浪潮(北京)电子信息产业有限公司 | Method and device for self-making user extraction information tool in Hive |
CN106570152A (en) * | 2016-10-28 | 2017-04-19 | 上海斐讯数据通信技术有限公司 | Mobile phone number volume extracting method and system |
CN106570153A (en) * | 2016-10-28 | 2017-04-19 | 上海斐讯数据通信技术有限公司 | Data extraction method and system for mass URLs |
Non-Patent Citations (1)
Title |
---|
KALOR: "使用Hive UDF和GeoIP库为Hive加入IP识别功能", 《HTTPS://WWW.CNBLOGS.COM/LIKAI198981/P/3465365.HTML》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259068A (en) * | 2020-04-28 | 2020-06-09 | 成都四方伟业软件股份有限公司 | Data development method and system based on data warehouse |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chitraa et al. | A survey on preprocessing methods for web usage data | |
US9536003B2 (en) | Method and system for hybrid information query | |
CN106874292B (en) | Topic processing method and device | |
CN103049575B (en) | A kind of academic conference search system of topic adaptation | |
CN103546326B (en) | Website traffic statistic method | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN106021583B (en) | Statistical method and system for page flow data | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
CN103838785A (en) | Vertical search engine in patent field | |
CN113254630B (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
CN105005600A (en) | Preprocessing method of URL (Uniform Resource Locator) in access log | |
CN111259220B (en) | Data acquisition method and system based on big data | |
CN104778164A (en) | Method and device for detecting repeated URL (Uniform Resource Locator) | |
CN109710767A (en) | Multilingual big data service platform | |
Sujatha | Improved user navigation pattern prediction technique from web log data | |
Jin | Research on data retrieval and analysis system based on Baidu reptile technology in big data era | |
CN107357919A (en) | User behaviors log inquiry system and method | |
CN110019152A (en) | A kind of big data cleaning method | |
Oo | Pattern discovery using association rule mining on clustered data | |
CN107145542A (en) | The high efficiency extraction subscription client ID method and system from URL | |
Das et al. | Adaptive web personalization system using splay tree | |
CN103631779A (en) | Word recommending system based on socialized dictionary | |
Maratea et al. | An heuristic approach to page recommendation in web usage mining | |
CN107193903A (en) | The method and system of efficient process IP address zone location | |
US11726972B2 (en) | Directed data indexing based on conceptual relevance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20201105 Address after: No. 2-3167, zone a, Nonggang City, No. 2388, Donghuan Avenue, Hongjia street, Jiaojiang District, Taizhou City, Zhejiang Province Applicant after: Taizhou Jiji Intellectual Property Operation Co.,Ltd. Address before: 201616 Shanghai city Songjiang District Sixian Road No. 3666 Applicant before: Phicomm (Shanghai) Co.,Ltd. |
|
TA01 | Transfer of patent application right | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170908 |
|
RJ01 | Rejection of invention patent application after publication |