CN109359231A - A kind of information crawler method, server and the storage medium of distributed network crawler - Google Patents

A kind of information crawler method, server and the storage medium of distributed network crawler Download PDF

Info

Publication number
CN109359231A
CN109359231A CN201711478979.3A CN201711478979A CN109359231A CN 109359231 A CN109359231 A CN 109359231A CN 201711478979 A CN201711478979 A CN 201711478979A CN 109359231 A CN109359231 A CN 109359231A
Authority
CN
China
Prior art keywords
crawler
url
information
distributed network
redis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711478979.3A
Other languages
Chinese (zh)
Inventor
徐松柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen TCL New Technology Co Ltd
Original Assignee
Guangzhou Tcl Smart Home Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Tcl Smart Home Technology Co Ltd filed Critical Guangzhou Tcl Smart Home Technology Co Ltd
Priority to CN201711478979.3A priority Critical patent/CN109359231A/en
Publication of CN109359231A publication Critical patent/CN109359231A/en
Pending legal-status Critical Current

Links

Landscapes

  • Computer And Data Communications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides information crawler method, server and the storage medium of a kind of distributed network crawler, by using the multiple IP got while carrying out network URL and crawl, and the URL crawled is encoded to the storage of key value into redis cluster;Multiple crawler clients obtain URL simultaneously from the redis cluster, and parse target information from the URL of acquisition.Information crawler method provided by the present invention is cooperated using multiple devices using multiple IP, while being crawled to all URL on Internet, is realized more preferably, faster, useful information is more accurately got from the Internet resources of magnanimity.

Description

A kind of information crawler method, server and the storage medium of distributed network crawler
Technical field
The present invention relates to information technology field more particularly to a kind of information crawler method of distributed network crawler, Server and storage medium.
Background technique
Currently, Internet surfs the Internet number of pages beyond 20,000,000,000, research shows that the page close to 30% is according to statistics It is duplicate, and there are also the presence of a large amount of dynamic pages.The application of client, Server-side Scripting Language is so that be directed toward identical The URL(Uniform Resource Locator of Web (World Wide Web, global wide area network or WWW) information, system One Resource orientation) quantity exponentially increases, if at this time we are with a server inside the webpage of Internet Want to crawl the information that we need, then wants the cost a large amount of time, user cannot obtain information needed in time, therefore will lead to all It is mostly inconvenient.
Therefore, the existing technology needs further improvement.
Summary of the invention
In view of the above shortcomings in the prior art, it is an object of the invention to provide a kind of distributed network for user Information crawler method, server and the storage medium of crawler, overcoming cannot quickly search in the prior art from mass network resource In crawl the defect of information needed.
The present invention provides first embodiment be a kind of distributed network crawler information crawler method, wherein including Following steps:
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
Optionally, before described the step of being crawled using the multiple IP got progress network URL further include:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
Optionally, the URL that will be crawled is encoded to the storage of key value to the step in redis cluster further include:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL In first major key of cluster.
Optionally, the method also includes:
The URL for having parsed target information be transferred in the second major key of redis cluster.
Optionally, the multiple crawler client obtains URL from the redis cluster, and parses from the URL of acquisition The step of target information out further include:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
The present invention provides second embodiment be a kind of server, wherein the server include: processor, storage Device and the information crawler control journey for being stored in the distributed network crawler that can be run on the memory and on the processor Sequence, wherein the information crawler control program of the distributed network crawler performs the steps of when being executed by the processor
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL In first major key of cluster;
The URL for having parsed target information be transferred in the second major key of redis cluster.
Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
The present invention provides 3rd embodiment be a kind of computer readable storage medium, wherein it is described computer-readable The information crawler control program of distributed network crawler, the information crawler of the distributed network crawler are stored on storage medium Control program is executed by processor the step of information crawler method for realizing the distributed network crawler.
Beneficial effect, the present invention provides a kind of information crawler method of distributed network crawler, server and storages to be situated between Matter is crawled by carrying out network URL using the multiple IP got, and the URL crawled is encoded to key value storage and is arrived In redis cluster, URL is obtained from the redis cluster, and parse target information from the URL of acquisition.The present invention is mentioned For information crawler method cooperated using multiple devices using multiple IP, while on Internet all URL carry out It crawls, realizes more preferably, faster, more accurately obtain our useful information on network.
Detailed description of the invention
Fig. 1 is the information crawler method and step flow chart of distributed network crawler of the present invention.
Fig. 2 is the control principle schematic diagram of client browser in the method for the invention concrete application embodiment.
Fig. 3 is the theory structure schematic diagram of server of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments The present invention is further described.It should be appreciated that specific embodiment described herein is used only for explaining the present invention, and do not have to It is of the invention in limiting.
The a large amount of time is needed due to using a server to carry out crawling for Internet resources, the invention discloses one The distributed multiple servers cooperation of kind carries out the method that Internet resources crawl and passes through more using MongoDB and redis technology Server carries out resource simultaneously and crawls, and realization quickly gets target information from the Internet resources of magnanimity.
Herein, it should be noted that MongoDB is a kind of distribution type file storing data library, is a high-performance, opens Source, the Document image analysis of non-mode.And Redis is being write using ANSI C language, supported network, can be based on an of open source Memory also can persistence log type, Key-Value database, and provide the API(Application of multilingual Programming Interface, application programming interface), redis cluster is the set of multiple redis.
The present invention provides first embodiment be a kind of distributed network crawler information crawler method, as shown in Figure 1, The following steps are included:
Step S1, network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to key value storage and is arrived In redis cluster.
In order to which efficient raising crawls URL, the acquisition of URL information is carried out in this step simultaneously using multiple IP.Having In body embodiment, multiple IP are packaged into client browser as request IP or directly and are realized to website URL Crawl.
In order to preferably store the URL crawled out, the URL crawled is carried out in this step to be encoded to key value Afterwards, storage is into redis cluster.It, can be to avoid since the attribute of redis cluster is that can be automatically deleted duplicate URL Storage to same URL, while also avoiding crawling the duplicate message of the URL value.
But since the mechanism that single IP connected reference number crosses multi-shielding IP has all been done in many websites, each IP accesses certain The number of a website is excessive, will lead to the website and shields the IP, therefore repeatedly carries out information visit to some websites to be able to achieve It asks, it is preferred that in this step further include: obtain the idle IP on network, and would sit idle for IP and be stored in MongoDB;It is multiple to climb Worm client obtains IP from the MongoDB.Since we obtain idle IP from network, and the idle IP that will acquire is deposited Storage facilitates the IP that crawler client obtains storage from MongoDB, and new IP is obtained from MongoDB in MongoDB Its used IP by some websites shielding has been replaced, the access to the website is re-initiated.
It is envisioned that this step can be used more clients as the producer cooperate with according to be utilized respectively from The IP value got in MongoDB carries out crawling for URL, crawls effect to obtain the URL of greater efficiency.
Due to the storage using MongoDB database progress IP in this step, the distribution that information may be implemented is deposited Storage, realizes the distributed collaborative of information crawler.
Step S2, multiple crawler clients obtain URL from the redis cluster, and parse mesh from the URL of acquisition Mark information.
When multiple crawler clients are (for realizing the client of web crawlers function, wherein realize that web crawlers function can To be automatically to grab the program or script of web message according to according to certain rules) overall network that will crawl After URL storage is into redis cluster, then more other crawler clients can get the URL of storage, and as consumption Person carries out webpage information to the URL that gets and crawls, and by goal task to be treated compared with webpage information pair, crawls out phase The target information answered.
It is described to crawl in order to avoid repeating to carry out information crawler to identical URL as the client of consumer URL is stored to the step in redis cluster
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to redis cluster with key value and URL The first major key in.
The URL for having parsed target information be transferred in the second major key of redis cluster.
In this step, the URL for being crossed using different major key storing and resolvings and not parsed respectively be that is to say using redis Two of cluster are different, and major key removes the URL for storing the URL for having crawled information needed respectively and not crawling information needed, from And make the URL for preferably identifying post-consumer and non-post-consumer, it avoids client from carrying out repeated resolution to URL, causes to increase On the other hand the workload of parsing also avoids omitting the URL not parsed.
In a particular embodiment, in order to carry out crawling for network URL as the producer further as more clients, it is It avoids storing duplicate URL in redis cluster, the client as consumer is caused to carry out weight to the URL parsed Multiple parsing, increases the workload of parsing, and optionally, the multiple crawler client obtains URL from the redis cluster, And the step of parsing target information from the URL of acquisition further include:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
Duplicate checking screening through the above steps, deletes and crawls from different crawler clients and be stored in redis cluster Repetition URL in first major key and the second major key avoids duplicate allocation and parsing to the same URL, improves parsing effect Rate.
The information crawler method of distributed network crawler provided by the present invention, in conjunction with redis distributed storage, Yi Jiwei One key value is managed automatically, and the non-relation data fragment storage of MongoDB is cooperated with producer consumer mode using multiple devices All URL on Internet are crawled simultaneously;Our creep speed will increase in geometry speed in this way, due to MongoDB database purchase amount is larger, and the storage of split blade type information may be implemented, thus greatly cannot very much without concern of data amount Storage, different IP, therefore obtaining for the network information can be got from different storage regions respectively by also facilitating client It takes and stores and provide convenience.
As shown in connection with fig. 2, the information crawler method of distributed network crawler that the present invention is mentioned in the specific implementation, including Following steps:
Step H1, the idle IP of network is obtained.
Step H2, client browser is simulated: the stored good idle IP in front is obtained from MongoDB by ours Program is packaged into a client browser, our software is prevented to be blocked when crawling webpage.
Step H3, connection obtains: opening more crawler clients as the producer and (object to be treated is stored in some area Domain) all URL of Internet are traversed, URL progress base64 is encoded to key value deposit redis, it is included only by redis One key value manages unique row of all URL, to realize that the distributed more machines cooperation of the producer obtains all URL.
Step H4, acquisition of information: screening parses the data of oneself needs there are no processed URL from URL, this The URL of the crawler processing of sample distinct device is also different, to realize distributed collaborative.
Step H5, URL unloading: multiple consumers' (task to be treated is obtained from some object) are obtained from redis URL is taken, and respective and processed URL is stored in another major key of redis, to know which URL is Through processed.
Since data volume is very huge, so huge information is stored using MongoDB fragment, us is also facilitated to read.
The present invention provides second embodiment be a kind of server, as shown in figure 3, the server 30 include: processing It device 310, memory 320 and is stored in the distributed network that can be run on the memory 320 and on the processor 310 and climbs The information crawler of worm controls program, wherein the information crawler control program of the distributed network crawler is executed by the processor When perform the steps of
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group;It realizes function as described in step S1.
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition, It realizes that function is as described in step S2.
Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to redis cluster with key value and URL The first major key in;
The URL for having parsed target information be transferred in the second major key of redis cluster.
Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with Lower step:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
Server provided by the present invention, by executing control journey corresponding to information crawler method provided by the present invention Sequence, and IP is carried out by using the database of MongoDB and crawls the distributed storage of information, it is crawled using redis The storage of URL information reaches the very fast effect for obtaining target information in network to realize more crawler client cooperative cooperatings Fruit.
Wherein, storage equipment 320 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile Software program, non-volatile computer executable program and module, processor 310 are stored in storage equipment 320 by operation In non-volatile software program, instruction and module, thereby executing the various function application and data processing of server, i.e., Realize the information crawler method of the distributed network crawler of above method embodiment.
Storing equipment 320 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function;Storage data area, which can be stored, uses institute according to wall hole designing system The data etc. of creation.In addition, storage equipment 320 may include high random access storage equipment, it can also include non-volatile Store equipment, a for example, at least disk storage equipment part, flush memory device or other non-volatile solid-state memory devices parts.? In some embodiments, optional storage equipment 320 includes the storage equipment remotely located relative to processor 310, these are remotely deposited Storing up equipment can be by network connection to the server.The example of above-mentioned network includes but is not limited to internet, enterprises Net, local area network, mobile radio communication and combinations thereof.
The present invention also provides 3rd embodiment be a kind of computer readable storage medium, the computer-readable storage The information crawler control program of distributed network crawler, the information crawler control of the distributed network crawler are stored on medium Program is executed by processor the step of information crawler method for realizing the distributed network crawler.
The present invention provides information crawler method, server and the storage mediums of a kind of distributed network crawler, by obtaining Network is taken to leave unused IP, and the multiple idle IP that will acquire are stored in MongoDB database;Each idle IP is packaged into Client browser;Network URL, and the URL that will be crawled are crawled using each client browser after packaging as the producer It stores in redis cluster;URL is obtained from redis cluster using multiple client browser as consumer, and from respectively obtaining Target information is parsed in the URL taken.Information crawler method provided by the present invention is right simultaneously using multiple devices cooperation All URL on Internet are crawled, and are realized more preferably, faster, more accurately obtain our useful information on network.
It, can according to the technique and scheme of the present invention and its hair it is understood that for those of ordinary skills Bright design is subject to equivalent substitution or change, and all these changes or replacement all should belong to the guarantor of appended claims of the invention Protect range.

Claims (10)

1. a kind of information crawler method of distributed network crawler, which comprises the following steps:
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
2. the information crawler method of distributed network crawler according to claim 1, which is characterized in that described to utilize acquisition To multiple IP carry out network URL crawl the step of before further include:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
3. the information crawler method of distributed network crawler according to claim 1, which is characterized in that described to crawl URL be encoded to the storage of key value to the step in redis cluster further include:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL In first major key of cluster.
4. the information crawler method of distributed network crawler according to claim 1, which is characterized in that the method is also wrapped It includes:
The URL for having parsed target information is transferred in the second major key of the redis cluster.
5. the information crawler method of distributed network crawler according to claim 4, which is characterized in that the multiple crawler Client obtains URL, and the step of parsing target information from the URL of acquisition from the redis cluster further include:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
6. a kind of server, which is characterized in that the server includes: processor, memory and is stored on the memory And the information crawler for the distributed network crawler that can be run on the processor controls program, wherein the distributed network is climbed The information crawler control program of worm performs the steps of when being executed by the processor
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
7. server according to claim 6, which is characterized in that the information crawler of the distributed network crawler controls journey It is also performed the steps of when sequence is executed by the processor
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
8. server according to claim 6, which is characterized in that the information crawler of the distributed network crawler controls journey It is also performed the steps of when sequence is executed by the processor
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL In first major key of cluster;
The URL for having parsed target information be transferred in the second major key of redis cluster.
9. server according to claim 8, which is characterized in that the information crawler of the distributed network crawler controls journey It is also performed the steps of when sequence is executed by the processor
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
10. a kind of computer readable storage medium, which is characterized in that be stored with distribution on the computer readable storage medium The information crawler of web crawlers controls program, and the information crawler control program of the distributed network crawler is executed by processor reality Now the step of information crawler method of the distributed network crawler as described in any one of claims 1 to 5.
CN201711478979.3A 2017-12-29 2017-12-29 A kind of information crawler method, server and the storage medium of distributed network crawler Pending CN109359231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711478979.3A CN109359231A (en) 2017-12-29 2017-12-29 A kind of information crawler method, server and the storage medium of distributed network crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711478979.3A CN109359231A (en) 2017-12-29 2017-12-29 A kind of information crawler method, server and the storage medium of distributed network crawler

Publications (1)

Publication Number Publication Date
CN109359231A true CN109359231A (en) 2019-02-19

Family

ID=65349598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711478979.3A Pending CN109359231A (en) 2017-12-29 2017-12-29 A kind of information crawler method, server and the storage medium of distributed network crawler

Country Status (1)

Country Link
CN (1) CN109359231A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN110929126A (en) * 2019-12-02 2020-03-27 杭州安恒信息技术股份有限公司 Distributed crawler scheduling method based on remote procedure call
CN112347325A (en) * 2019-08-07 2021-02-09 国际商业机器公司 Web crawler platform
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN112487268A (en) * 2020-12-14 2021-03-12 安徽经邦软件技术有限公司 Data crawling implementation method based on distributed crawler technology
CN113742549A (en) * 2020-05-28 2021-12-03 上海交通大学 Distributed crawler scheduling system and method based on computing resources
CN113992700A (en) * 2020-07-09 2022-01-28 Tcl科技集团股份有限公司 Instruction analysis method based on distributed network, terminal and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
US20150161257A1 (en) * 2013-12-11 2015-06-11 Ebay Inc. Web crawler optimization system
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN107193960A (en) * 2017-05-24 2017-09-22 南京大学 A kind of distributed reptile system and periodicity increment grasping means
CN107506502A (en) * 2017-10-10 2017-12-22 山东浪潮云服务信息科技有限公司 A kind of data collecting system and collecting method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932448A (en) * 2012-10-30 2013-02-13 工业和信息化部电信传输研究所 Distributed network crawler URL (uniform resource locator) duplicate removal system and method
US20150161257A1 (en) * 2013-12-11 2015-06-11 Ebay Inc. Web crawler optimization system
US9652538B2 (en) * 2013-12-11 2017-05-16 Ebay Inc. Web crawler optimization system
CN105677918A (en) * 2016-03-03 2016-06-15 浪潮软件股份有限公司 Distributed crawler architecture based on Kafka and Quartz and implementation method thereof
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN107193960A (en) * 2017-05-24 2017-09-22 南京大学 A kind of distributed reptile system and periodicity increment grasping means
CN107506502A (en) * 2017-10-10 2017-12-22 山东浪潮云服务信息科技有限公司 A kind of data collecting system and collecting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗娇敏等: "一种基于Redis的分布式爬虫系统设计与实现", 《软件》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948079A (en) * 2019-03-11 2019-06-28 湖南衍金征信数据服务有限公司 A kind of method that distributed capture discloses page data
CN112347325A (en) * 2019-08-07 2021-02-09 国际商业机器公司 Web crawler platform
CN110929126A (en) * 2019-12-02 2020-03-27 杭州安恒信息技术股份有限公司 Distributed crawler scheduling method based on remote procedure call
CN113742549A (en) * 2020-05-28 2021-12-03 上海交通大学 Distributed crawler scheduling system and method based on computing resources
CN113992700A (en) * 2020-07-09 2022-01-28 Tcl科技集团股份有限公司 Instruction analysis method based on distributed network, terminal and storage medium
CN113992700B (en) * 2020-07-09 2023-12-26 Tcl科技集团股份有限公司 Instruction analysis method, terminal and storage medium based on distributed network
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN112487268A (en) * 2020-12-14 2021-03-12 安徽经邦软件技术有限公司 Data crawling implementation method based on distributed crawler technology

Similar Documents

Publication Publication Date Title
CN109359231A (en) A kind of information crawler method, server and the storage medium of distributed network crawler
US10362050B2 (en) System and methods for scalably identifying and characterizing structural differences between document object models
EP2817730B1 (en) System and method for context specific website optimization
CN106126693A (en) The sending method of the related data of a kind of webpage and device
US20120016857A1 (en) System and method for providing search engine optimization analysis
CN103412890A (en) Webpage loading method and device
US11055268B2 (en) Automatic updates for a virtual index server
US8977969B2 (en) Dynamic web portal page
WO2017124692A1 (en) Method and apparatus for searching for conversion relationship between form pages and target pages
CN103716319B (en) A kind of apparatus and method of web access optimization
Mendoza et al. BrowStEx: A tool to aggregate browser storage artifacts for forensic analysis
CN103905434A (en) Method and device for processing network data
CN107391528A (en) Front end assemblies Dependency Specification searching method and equipment
Eltahir et al. Extracting knowledge from web server logs using web usage mining
CN110321510A (en) Page rendering method and system
CN109714397A (en) Internet proxy server management system
US20190286735A1 (en) Construction and Use of a Virtual Index Server
US20180203907A1 (en) Method and system for querying semantic information stored across several semantically enhanced resources of a resource structure
US9747262B1 (en) Methods, systems, and computer program products for retrieving information from a webpage and organizing the information in a table
Huang et al. Achieving fast page load for websites across multiple domains
Chen et al. Optimization research and application of enterprise website based on web service
Bakariya et al. An inclusive survey on data preprocessing methods used in web usage mining
Noskov Smart City Webgis Applications: Proof of Work Concept For High-Level Quality-Of-Service Assurance
Yang et al. Incorporating site-level knowledge for incremental crawling of web forums: A list-wise strategy
Verma et al. Web Usage mining framework for Data Cleaning and IP address Identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190321

Address after: 518000 No. 5, Industrial Avenue, Shekou Industrial Zone, Merchants Street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: SHENZHEN TCL NEW TECHNOLOGY Co.,Ltd.

Address before: 510000 Building A2, Science Avenue 187 Business Plaza, Science City, Luogang District, Guangzhou City, Guangdong Province

Applicant before: GUANGZHOU TCL SMART HOME TECHNOLOGY Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190219