CN109359231A - A kind of information crawler method, server and the storage medium of distributed network crawler - Google Patents
A kind of information crawler method, server and the storage medium of distributed network crawler Download PDFInfo
- Publication number
- CN109359231A CN109359231A CN201711478979.3A CN201711478979A CN109359231A CN 109359231 A CN109359231 A CN 109359231A CN 201711478979 A CN201711478979 A CN 201711478979A CN 109359231 A CN109359231 A CN 109359231A
- Authority
- CN
- China
- Prior art keywords
- crawler
- url
- information
- distributed network
- redis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Computer And Data Communications (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides information crawler method, server and the storage medium of a kind of distributed network crawler, by using the multiple IP got while carrying out network URL and crawl, and the URL crawled is encoded to the storage of key value into redis cluster;Multiple crawler clients obtain URL simultaneously from the redis cluster, and parse target information from the URL of acquisition.Information crawler method provided by the present invention is cooperated using multiple devices using multiple IP, while being crawled to all URL on Internet, is realized more preferably, faster, useful information is more accurately got from the Internet resources of magnanimity.
Description
Technical field
The present invention relates to information technology field more particularly to a kind of information crawler method of distributed network crawler,
Server and storage medium.
Background technique
Currently, Internet surfs the Internet number of pages beyond 20,000,000,000, research shows that the page close to 30% is according to statistics
It is duplicate, and there are also the presence of a large amount of dynamic pages.The application of client, Server-side Scripting Language is so that be directed toward identical
The URL(Uniform Resource Locator of Web (World Wide Web, global wide area network or WWW) information, system
One Resource orientation) quantity exponentially increases, if at this time we are with a server inside the webpage of Internet
Want to crawl the information that we need, then wants the cost a large amount of time, user cannot obtain information needed in time, therefore will lead to all
It is mostly inconvenient.
Therefore, the existing technology needs further improvement.
Summary of the invention
In view of the above shortcomings in the prior art, it is an object of the invention to provide a kind of distributed network for user
Information crawler method, server and the storage medium of crawler, overcoming cannot quickly search in the prior art from mass network resource
In crawl the defect of information needed.
The present invention provides first embodiment be a kind of distributed network crawler information crawler method, wherein including
Following steps:
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection
In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
Optionally, before described the step of being crawled using the multiple IP got progress network URL further include:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
Optionally, the URL that will be crawled is encoded to the storage of key value to the step in redis cluster further include:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL
In first major key of cluster.
Optionally, the method also includes:
The URL for having parsed target information be transferred in the second major key of redis cluster.
Optionally, the multiple crawler client obtains URL from the redis cluster, and parses from the URL of acquisition
The step of target information out further include:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first
The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
The present invention provides second embodiment be a kind of server, wherein the server include: processor, storage
Device and the information crawler control journey for being stored in the distributed network crawler that can be run on the memory and on the processor
Sequence, wherein the information crawler control program of the distributed network crawler performs the steps of when being executed by the processor
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection
In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with
Lower step:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with
Lower step:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL
In first major key of cluster;
The URL for having parsed target information be transferred in the second major key of redis cluster.
Optionally, the information crawler control program of the distributed network crawler when being executed by the processor also realization with
Lower step:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first
The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
The present invention provides 3rd embodiment be a kind of computer readable storage medium, wherein it is described computer-readable
The information crawler control program of distributed network crawler, the information crawler of the distributed network crawler are stored on storage medium
Control program is executed by processor the step of information crawler method for realizing the distributed network crawler.
Beneficial effect, the present invention provides a kind of information crawler method of distributed network crawler, server and storages to be situated between
Matter is crawled by carrying out network URL using the multiple IP got, and the URL crawled is encoded to key value storage and is arrived
In redis cluster, URL is obtained from the redis cluster, and parse target information from the URL of acquisition.The present invention is mentioned
For information crawler method cooperated using multiple devices using multiple IP, while on Internet all URL carry out
It crawls, realizes more preferably, faster, more accurately obtain our useful information on network.
Detailed description of the invention
Fig. 1 is the information crawler method and step flow chart of distributed network crawler of the present invention.
Fig. 2 is the control principle schematic diagram of client browser in the method for the invention concrete application embodiment.
Fig. 3 is the theory structure schematic diagram of server of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer and more explicit, right as follows in conjunction with drawings and embodiments
The present invention is further described.It should be appreciated that specific embodiment described herein is used only for explaining the present invention, and do not have to
It is of the invention in limiting.
The a large amount of time is needed due to using a server to carry out crawling for Internet resources, the invention discloses one
The distributed multiple servers cooperation of kind carries out the method that Internet resources crawl and passes through more using MongoDB and redis technology
Server carries out resource simultaneously and crawls, and realization quickly gets target information from the Internet resources of magnanimity.
Herein, it should be noted that MongoDB is a kind of distribution type file storing data library, is a high-performance, opens
Source, the Document image analysis of non-mode.And Redis is being write using ANSI C language, supported network, can be based on an of open source
Memory also can persistence log type, Key-Value database, and provide the API(Application of multilingual
Programming Interface, application programming interface), redis cluster is the set of multiple redis.
The present invention provides first embodiment be a kind of distributed network crawler information crawler method, as shown in Figure 1,
The following steps are included:
Step S1, network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to key value storage and is arrived
In redis cluster.
In order to which efficient raising crawls URL, the acquisition of URL information is carried out in this step simultaneously using multiple IP.Having
In body embodiment, multiple IP are packaged into client browser as request IP or directly and are realized to website URL
Crawl.
In order to preferably store the URL crawled out, the URL crawled is carried out in this step to be encoded to key value
Afterwards, storage is into redis cluster.It, can be to avoid since the attribute of redis cluster is that can be automatically deleted duplicate URL
Storage to same URL, while also avoiding crawling the duplicate message of the URL value.
But since the mechanism that single IP connected reference number crosses multi-shielding IP has all been done in many websites, each IP accesses certain
The number of a website is excessive, will lead to the website and shields the IP, therefore repeatedly carries out information visit to some websites to be able to achieve
It asks, it is preferred that in this step further include: obtain the idle IP on network, and would sit idle for IP and be stored in MongoDB;It is multiple to climb
Worm client obtains IP from the MongoDB.Since we obtain idle IP from network, and the idle IP that will acquire is deposited
Storage facilitates the IP that crawler client obtains storage from MongoDB, and new IP is obtained from MongoDB in MongoDB
Its used IP by some websites shielding has been replaced, the access to the website is re-initiated.
It is envisioned that this step can be used more clients as the producer cooperate with according to be utilized respectively from
The IP value got in MongoDB carries out crawling for URL, crawls effect to obtain the URL of greater efficiency.
Due to the storage using MongoDB database progress IP in this step, the distribution that information may be implemented is deposited
Storage, realizes the distributed collaborative of information crawler.
Step S2, multiple crawler clients obtain URL from the redis cluster, and parse mesh from the URL of acquisition
Mark information.
When multiple crawler clients are (for realizing the client of web crawlers function, wherein realize that web crawlers function can
To be automatically to grab the program or script of web message according to according to certain rules) overall network that will crawl
After URL storage is into redis cluster, then more other crawler clients can get the URL of storage, and as consumption
Person carries out webpage information to the URL that gets and crawls, and by goal task to be treated compared with webpage information pair, crawls out phase
The target information answered.
It is described to crawl in order to avoid repeating to carry out information crawler to identical URL as the client of consumer
URL is stored to the step in redis cluster
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to redis cluster with key value and URL
The first major key in.
The URL for having parsed target information be transferred in the second major key of redis cluster.
In this step, the URL for being crossed using different major key storing and resolvings and not parsed respectively be that is to say using redis
Two of cluster are different, and major key removes the URL for storing the URL for having crawled information needed respectively and not crawling information needed, from
And make the URL for preferably identifying post-consumer and non-post-consumer, it avoids client from carrying out repeated resolution to URL, causes to increase
On the other hand the workload of parsing also avoids omitting the URL not parsed.
In a particular embodiment, in order to carry out crawling for network URL as the producer further as more clients, it is
It avoids storing duplicate URL in redis cluster, the client as consumer is caused to carry out weight to the URL parsed
Multiple parsing, increases the workload of parsing, and optionally, the multiple crawler client obtains URL from the redis cluster,
And the step of parsing target information from the URL of acquisition further include:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first
The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
Duplicate checking screening through the above steps, deletes and crawls from different crawler clients and be stored in redis cluster
Repetition URL in first major key and the second major key avoids duplicate allocation and parsing to the same URL, improves parsing effect
Rate.
The information crawler method of distributed network crawler provided by the present invention, in conjunction with redis distributed storage, Yi Jiwei
One key value is managed automatically, and the non-relation data fragment storage of MongoDB is cooperated with producer consumer mode using multiple devices
All URL on Internet are crawled simultaneously;Our creep speed will increase in geometry speed in this way, due to
MongoDB database purchase amount is larger, and the storage of split blade type information may be implemented, thus greatly cannot very much without concern of data amount
Storage, different IP, therefore obtaining for the network information can be got from different storage regions respectively by also facilitating client
It takes and stores and provide convenience.
As shown in connection with fig. 2, the information crawler method of distributed network crawler that the present invention is mentioned in the specific implementation, including
Following steps:
Step H1, the idle IP of network is obtained.
Step H2, client browser is simulated: the stored good idle IP in front is obtained from MongoDB by ours
Program is packaged into a client browser, our software is prevented to be blocked when crawling webpage.
Step H3, connection obtains: opening more crawler clients as the producer and (object to be treated is stored in some area
Domain) all URL of Internet are traversed, URL progress base64 is encoded to key value deposit redis, it is included only by redis
One key value manages unique row of all URL, to realize that the distributed more machines cooperation of the producer obtains all URL.
Step H4, acquisition of information: screening parses the data of oneself needs there are no processed URL from URL, this
The URL of the crawler processing of sample distinct device is also different, to realize distributed collaborative.
Step H5, URL unloading: multiple consumers' (task to be treated is obtained from some object) are obtained from redis
URL is taken, and respective and processed URL is stored in another major key of redis, to know which URL is
Through processed.
Since data volume is very huge, so huge information is stored using MongoDB fragment, us is also facilitated to read.
The present invention provides second embodiment be a kind of server, as shown in figure 3, the server 30 include: processing
It device 310, memory 320 and is stored in the distributed network that can be run on the memory 320 and on the processor 310 and climbs
The information crawler of worm controls program, wherein the information crawler control program of the distributed network crawler is executed by the processor
When perform the steps of
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection
In group;It realizes function as described in step S1.
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition,
It realizes that function is as described in step S2.
Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with
Lower step:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with
Lower step:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to redis cluster with key value and URL
The first major key in;
The URL for having parsed target information be transferred in the second major key of redis cluster.
Specifically, the information crawler control program of the distributed network crawler when being executed by the processor also realization with
Lower step:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first
The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
Server provided by the present invention, by executing control journey corresponding to information crawler method provided by the present invention
Sequence, and IP is carried out by using the database of MongoDB and crawls the distributed storage of information, it is crawled using redis
The storage of URL information reaches the very fast effect for obtaining target information in network to realize more crawler client cooperative cooperatings
Fruit.
Wherein, storage equipment 320 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile
Software program, non-volatile computer executable program and module, processor 310 are stored in storage equipment 320 by operation
In non-volatile software program, instruction and module, thereby executing the various function application and data processing of server, i.e.,
Realize the information crawler method of the distributed network crawler of above method embodiment.
Storing equipment 320 may include storing program area and storage data area, wherein storing program area can store operation system
Application program required for system, at least one function;Storage data area, which can be stored, uses institute according to wall hole designing system
The data etc. of creation.In addition, storage equipment 320 may include high random access storage equipment, it can also include non-volatile
Store equipment, a for example, at least disk storage equipment part, flush memory device or other non-volatile solid-state memory devices parts.?
In some embodiments, optional storage equipment 320 includes the storage equipment remotely located relative to processor 310, these are remotely deposited
Storing up equipment can be by network connection to the server.The example of above-mentioned network includes but is not limited to internet, enterprises
Net, local area network, mobile radio communication and combinations thereof.
The present invention also provides 3rd embodiment be a kind of computer readable storage medium, the computer-readable storage
The information crawler control program of distributed network crawler, the information crawler control of the distributed network crawler are stored on medium
Program is executed by processor the step of information crawler method for realizing the distributed network crawler.
The present invention provides information crawler method, server and the storage mediums of a kind of distributed network crawler, by obtaining
Network is taken to leave unused IP, and the multiple idle IP that will acquire are stored in MongoDB database;Each idle IP is packaged into
Client browser;Network URL, and the URL that will be crawled are crawled using each client browser after packaging as the producer
It stores in redis cluster;URL is obtained from redis cluster using multiple client browser as consumer, and from respectively obtaining
Target information is parsed in the URL taken.Information crawler method provided by the present invention is right simultaneously using multiple devices cooperation
All URL on Internet are crawled, and are realized more preferably, faster, more accurately obtain our useful information on network.
It, can according to the technique and scheme of the present invention and its hair it is understood that for those of ordinary skills
Bright design is subject to equivalent substitution or change, and all these changes or replacement all should belong to the guarantor of appended claims of the invention
Protect range.
Claims (10)
1. a kind of information crawler method of distributed network crawler, which comprises the following steps:
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection
In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
2. the information crawler method of distributed network crawler according to claim 1, which is characterized in that described to utilize acquisition
To multiple IP carry out network URL crawl the step of before further include:
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
3. the information crawler method of distributed network crawler according to claim 1, which is characterized in that described to crawl
URL be encoded to the storage of key value to the step in redis cluster further include:
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL
In first major key of cluster.
4. the information crawler method of distributed network crawler according to claim 1, which is characterized in that the method is also wrapped
It includes:
The URL for having parsed target information is transferred in the second major key of the redis cluster.
5. the information crawler method of distributed network crawler according to claim 4, which is characterized in that the multiple crawler
Client obtains URL, and the step of parsing target information from the URL of acquisition from the redis cluster further include:
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first
The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
6. a kind of server, which is characterized in that the server includes: processor, memory and is stored on the memory
And the information crawler for the distributed network crawler that can be run on the processor controls program, wherein the distributed network is climbed
The information crawler control program of worm performs the steps of when being executed by the processor
Network URL is carried out using the multiple IP got to crawl, and the URL crawled is encoded to the storage of key value to redis collection
In group;
Multiple crawler clients obtain URL from the redis cluster, and parse target information from the URL of acquisition.
7. server according to claim 6, which is characterized in that the information crawler of the distributed network crawler controls journey
It is also performed the steps of when sequence is executed by the processor
The idle IP on network is obtained, and would sit idle for IP and be stored in MongoDB;
Multiple crawler clients obtain IP from the MongoDB.
8. server according to claim 6, which is characterized in that the information crawler of the distributed network crawler controls journey
It is also performed the steps of when sequence is executed by the processor
The URL crawled progress base64 is encoded into key value, and is corresponded and is saved to the redis with key value and URL
In first major key of cluster;
The URL for having parsed target information be transferred in the second major key of redis cluster.
9. server according to claim 8, which is characterized in that the information crawler of the distributed network crawler controls journey
It is also performed the steps of when sequence is executed by the processor
The multiple crawler client carries out duplicate checking screening according to the record in first major key and the second major key, main from first
The URL not being resolved is chosen in key as target URL;
The target URL is parsed to obtain target information.
10. a kind of computer readable storage medium, which is characterized in that be stored with distribution on the computer readable storage medium
The information crawler of web crawlers controls program, and the information crawler control program of the distributed network crawler is executed by processor reality
Now the step of information crawler method of the distributed network crawler as described in any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711478979.3A CN109359231A (en) | 2017-12-29 | 2017-12-29 | A kind of information crawler method, server and the storage medium of distributed network crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711478979.3A CN109359231A (en) | 2017-12-29 | 2017-12-29 | A kind of information crawler method, server and the storage medium of distributed network crawler |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109359231A true CN109359231A (en) | 2019-02-19 |
Family
ID=65349598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711478979.3A Pending CN109359231A (en) | 2017-12-29 | 2017-12-29 | A kind of information crawler method, server and the storage medium of distributed network crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109359231A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948079A (en) * | 2019-03-11 | 2019-06-28 | 湖南衍金征信数据服务有限公司 | A kind of method that distributed capture discloses page data |
CN110929126A (en) * | 2019-12-02 | 2020-03-27 | 杭州安恒信息技术股份有限公司 | Distributed crawler scheduling method based on remote procedure call |
CN112347325A (en) * | 2019-08-07 | 2021-02-09 | 国际商业机器公司 | Web crawler platform |
CN112422707A (en) * | 2020-10-22 | 2021-02-26 | 北京安博通科技股份有限公司 | Domain name data mining method and device and Redis server |
CN112487268A (en) * | 2020-12-14 | 2021-03-12 | 安徽经邦软件技术有限公司 | Data crawling implementation method based on distributed crawler technology |
CN113742549A (en) * | 2020-05-28 | 2021-12-03 | 上海交通大学 | Distributed crawler scheduling system and method based on computing resources |
CN113992700A (en) * | 2020-07-09 | 2022-01-28 | Tcl科技集团股份有限公司 | Instruction analysis method based on distributed network, terminal and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102932448A (en) * | 2012-10-30 | 2013-02-13 | 工业和信息化部电信传输研究所 | Distributed network crawler URL (uniform resource locator) duplicate removal system and method |
US20150161257A1 (en) * | 2013-12-11 | 2015-06-11 | Ebay Inc. | Web crawler optimization system |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
CN105893583A (en) * | 2016-04-01 | 2016-08-24 | 北京鼎泰智源科技有限公司 | Data acquisition method and system based on artificial intelligence |
CN107193960A (en) * | 2017-05-24 | 2017-09-22 | 南京大学 | A kind of distributed reptile system and periodicity increment grasping means |
CN107506502A (en) * | 2017-10-10 | 2017-12-22 | 山东浪潮云服务信息科技有限公司 | A kind of data collecting system and collecting method |
-
2017
- 2017-12-29 CN CN201711478979.3A patent/CN109359231A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102932448A (en) * | 2012-10-30 | 2013-02-13 | 工业和信息化部电信传输研究所 | Distributed network crawler URL (uniform resource locator) duplicate removal system and method |
US20150161257A1 (en) * | 2013-12-11 | 2015-06-11 | Ebay Inc. | Web crawler optimization system |
US9652538B2 (en) * | 2013-12-11 | 2017-05-16 | Ebay Inc. | Web crawler optimization system |
CN105677918A (en) * | 2016-03-03 | 2016-06-15 | 浪潮软件股份有限公司 | Distributed crawler architecture based on Kafka and Quartz and implementation method thereof |
CN105893583A (en) * | 2016-04-01 | 2016-08-24 | 北京鼎泰智源科技有限公司 | Data acquisition method and system based on artificial intelligence |
CN107193960A (en) * | 2017-05-24 | 2017-09-22 | 南京大学 | A kind of distributed reptile system and periodicity increment grasping means |
CN107506502A (en) * | 2017-10-10 | 2017-12-22 | 山东浪潮云服务信息科技有限公司 | A kind of data collecting system and collecting method |
Non-Patent Citations (1)
Title |
---|
罗娇敏等: "一种基于Redis的分布式爬虫系统设计与实现", 《软件》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948079A (en) * | 2019-03-11 | 2019-06-28 | 湖南衍金征信数据服务有限公司 | A kind of method that distributed capture discloses page data |
CN112347325A (en) * | 2019-08-07 | 2021-02-09 | 国际商业机器公司 | Web crawler platform |
CN110929126A (en) * | 2019-12-02 | 2020-03-27 | 杭州安恒信息技术股份有限公司 | Distributed crawler scheduling method based on remote procedure call |
CN113742549A (en) * | 2020-05-28 | 2021-12-03 | 上海交通大学 | Distributed crawler scheduling system and method based on computing resources |
CN113992700A (en) * | 2020-07-09 | 2022-01-28 | Tcl科技集团股份有限公司 | Instruction analysis method based on distributed network, terminal and storage medium |
CN113992700B (en) * | 2020-07-09 | 2023-12-26 | Tcl科技集团股份有限公司 | Instruction analysis method, terminal and storage medium based on distributed network |
CN112422707A (en) * | 2020-10-22 | 2021-02-26 | 北京安博通科技股份有限公司 | Domain name data mining method and device and Redis server |
CN112487268A (en) * | 2020-12-14 | 2021-03-12 | 安徽经邦软件技术有限公司 | Data crawling implementation method based on distributed crawler technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109359231A (en) | A kind of information crawler method, server and the storage medium of distributed network crawler | |
US10362050B2 (en) | System and methods for scalably identifying and characterizing structural differences between document object models | |
EP2817730B1 (en) | System and method for context specific website optimization | |
CN106126693A (en) | The sending method of the related data of a kind of webpage and device | |
US20120016857A1 (en) | System and method for providing search engine optimization analysis | |
CN103412890A (en) | Webpage loading method and device | |
US11055268B2 (en) | Automatic updates for a virtual index server | |
US8977969B2 (en) | Dynamic web portal page | |
WO2017124692A1 (en) | Method and apparatus for searching for conversion relationship between form pages and target pages | |
CN103716319B (en) | A kind of apparatus and method of web access optimization | |
Mendoza et al. | BrowStEx: A tool to aggregate browser storage artifacts for forensic analysis | |
CN103905434A (en) | Method and device for processing network data | |
CN107391528A (en) | Front end assemblies Dependency Specification searching method and equipment | |
Eltahir et al. | Extracting knowledge from web server logs using web usage mining | |
CN110321510A (en) | Page rendering method and system | |
CN109714397A (en) | Internet proxy server management system | |
US20190286735A1 (en) | Construction and Use of a Virtual Index Server | |
US20180203907A1 (en) | Method and system for querying semantic information stored across several semantically enhanced resources of a resource structure | |
US9747262B1 (en) | Methods, systems, and computer program products for retrieving information from a webpage and organizing the information in a table | |
Huang et al. | Achieving fast page load for websites across multiple domains | |
Chen et al. | Optimization research and application of enterprise website based on web service | |
Bakariya et al. | An inclusive survey on data preprocessing methods used in web usage mining | |
Noskov | Smart City Webgis Applications: Proof of Work Concept For High-Level Quality-Of-Service Assurance | |
Yang et al. | Incorporating site-level knowledge for incremental crawling of web forums: A list-wise strategy | |
Verma et al. | Web Usage mining framework for Data Cleaning and IP address Identification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190321 Address after: 518000 No. 5, Industrial Avenue, Shekou Industrial Zone, Merchants Street, Nanshan District, Shenzhen City, Guangdong Province Applicant after: SHENZHEN TCL NEW TECHNOLOGY Co.,Ltd. Address before: 510000 Building A2, Science Avenue 187 Business Plaza, Science City, Luogang District, Guangzhou City, Guangdong Province Applicant before: GUANGZHOU TCL SMART HOME TECHNOLOGY Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190219 |