CN107590188A - A kind of reptile crawling method and its management system for automating vertical subdivision field - Google Patents
A kind of reptile crawling method and its management system for automating vertical subdivision field Download PDFInfo
- Publication number
- CN107590188A CN107590188A CN201710673166.3A CN201710673166A CN107590188A CN 107590188 A CN107590188 A CN 107590188A CN 201710673166 A CN201710673166 A CN 201710673166A CN 107590188 A CN107590188 A CN 107590188A
- Authority
- CN
- China
- Prior art keywords
- reptile
- engine
- time
- mrow
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The present invention relates to reptile to crawl and management and dispatching technology, it is desirable to provide a kind of reptile crawling method and its management system for automating vertical subdivision field.The reptile crawling method in the vertical subdivision field of this kind automation includes process:Reptile run time is predicted;Batch reptile optimizing scheduling is carried out according to predicted time and line number;Reptile crawls.The present invention is more efficient than prior art in efficiency in crawling for vertically subdivision field reptile, the time prediction model of reptile is introduced with reference to the feature of vertical subdivision reptile with starting, the efficient scheduling of parallel reptile is carried out with reference to most long processing time priority algorithm, saving crawls the time.
Description
Technical field
The present invention is to crawl vertically to segment field with management and dispatching technical field, more particularly to a kind of automation on reptile
Reptile crawling method and its management system.
Background technology
Although the information age of data explosion contains the magnanimity information and data of all trades and professions, but the mankind receive information
Quantity and the ability of processing information be limited, we often account for valuable time by the useless information of great quantities of spare,
The difficulty that people obtain customized information continues to increase, and therefore, each vertical subdivision field and personalized recommendation arise at the historic moment.Hang down
Notice and service are focused on some specific classification by straight subdivision field, and it is to crawl work to the vertical data for segmenting field
The important and element task of the services such as propertyization recommendation.
Web crawlers is a kind of automatic acquisition web page contents, and does some structuring processing, persistences etc. to these information
The program of operation.Functionally consider, reptile can substantially be divided into the whole network reptile and vertical reptile.The whole network reptile mainly services
In the data acquisition of search engine, it is big to crawl depth, can efficiently capture mass data;Vertical reptile is then to be directed to specific website
Or specific webpage, it is small to crawl depth, crawls target often with obvious structuring, the special data for serving vertical field is adopted
Collection.
In general the whole network reptile originates url queues by inputting, and url is carried out using depth-first or breadth first algorithm
Parse layer by layer, plus frameworks such as distributed platforms, the effect for efficiently capturing mass data can be reached.But in vertical subdivision neck
, it is necessary in the scene being acquired to the specific data feature of multiple target, the quality of data that the whole network reptile obtains is unsatisfactory in domain.
Therefore customization, scheduling and the management study to vertical subdivision field reptile seem particularly urgent.Risen to necessarily in reptile quantity
During magnitude, it would be desirable to configure reptile, reptile scheduling, reptile execution, the link such as data processing are combined closely, so as to form one
Individual perfect vertical field reptile management and dispatching framework.
Most like implementation has following several, Chinese invention patent application with the present invention:" webpage content extraction side
Method, apparatus and system " (application number:201510124714.8) a kind of, " crawler technology based on web page crawl " (application number:
201310040090.2), " construction method of the spiders based on news duplicate removal " (application number:200910153588.3), " one
The configurable vertical field web crawlers implementation method of kind plug-in type " (application number:201510131253.7), " based on weighting wheel
It is the distributed reptile method for scheduling task of algorithm " (application number:201410073829.4).
A kind of webpage content extraction method that 1 (webpage content extraction method, apparatus and system) of invention proposes, device and it is
System, operation layer send to extraction system and extract webpage URL request;Extraction system is climbed according to webpage URL request, invoking web page is extracted
Worm system crawls the page original contents that URL is specified;Extraction system is former to the page using the template document of agreement as matching standard
Beginning content is extracted, and the content of extraction is returned into operation layer, and the present invention makes full use of the ability that backstage crawls webpage, together
When by parsing original web page and extraction template realize the ability for extracting original web page and specifying label substance, the program adapts to all
Web page format extracts named web page label substance, improves the spirit of the ability and webpage content extraction that extract original web page
Activity.But the invention has higher call format to targeted website, effect is crawled for the less unified target web of form
Fruit is undesirable, therefore inapplicable vertical subdivision field has the web page crawl of multiplicity.
Internet object of the invention 2 based on user's setting, being created according to user for task, corresponding money is crawled from internet
Source, rewrite URL and stored, realization is targetedly acquired to internet information;In the embodiment of the present invention, in order to carry
The handling capacity and resource utilization of high system, after task requests are received, task is also split into task burst, each task point
Piece only includes a website, and each task burst is performed parallel by multiple reptiles, and so, task scheduling granularity is actually
Task burst, it can so improve the handling capacity and resource utilization of system.This invention is only to return to the webpage of user's request
The link of content and correlation, it is not further to be handled;When the expansion of reptile task quantity size, and parallel ability is limited
In the case of, the invention can not make full use of concurrency, and what is be optimal crawls the used time.
The construction method of 3 spiders based on news duplicate removal is invented, technical concept is:The text of headline is utilized
Chinese words segmentation extracts the weight of the keyword and each keyword in text;Rule of thumb, choose N number of in the text
Weight highest keyword forms set the C={ (t of (keyword, weight)1, w1), (t2, w2), (t3, w3) ... ... (tN, wN),
Wherein:ti:I-th of keyword;wi:The weight of i-th of keyword;By the element in set C according to weight wiCarry out from big to small
Sequence;Each subset C that news is concentratediIn element sequence from big to small is carried out according to the weight of its keyword;If
Determine C and CiBetween similarity threshold value, described similarity by two set in have identical sorting position keyword number
To characterize;Each C that set C and news are concentratediIt is compared, judges whether their similarity is higher than described threshold value;
If higher than described threshold value, then it is assumed that C is repetition news;If less than described threshold value, then it is assumed that C is non-duplicate news.Compare
In the algorithm, the factor that simhash considers is more comprehensive, and algorithm complex is not also high, and accuracy is stronger.
Invention 4 discloses a kind of configurable vertical field web crawlers implementation method of plug-in type, including stage of gripping and
Extraction stage.Wherein, stage of gripping includes crawl configuration phase and capture program performs the stage, and extraction stage, which includes extracting, to be configured
Stage and extraction program perform the stage.The present invention can realize that the webpage capture of multiple fields and information are taken out by way of configuring
Take, and accuracy is high, can solve traditional search engines and be intended to the shortcomings that not clear, accuracy is not high, and can realizes multiple necks
The webpage capture in domain and information extraction.The invention is determined equally for vertical subdivision field customization reptile by configuration file
The method flexibility of justice crawl parameter and analytic parameter is not high, and user experience is poor;In addition, the invention does not account for yet
The problem of reptile concurrency, thus it is inefficient.
Invention 5 proposes a kind of distributed reptile method for scheduling task based on weighted round robin algorithm, including 1) according to rule
Mould is different, and web crawlers is divided into unit multithreading, isomorphism centralization, isomery centralization, small distributed and large-scale distributed
Five class reptiles;2) master-slave architecture is disposed;3) when reptile node First Contact Connections are to main controlled node, at the beginning of main controlled node gives it
Beginning weights;4) main controlled node constantly selects a reptile node, one is waited to climb according to the dispatching algorithm based on weighted round robin
The URL tasks taken distribute to it;5) when reptile node has crawled a URL task, main controlled node is returned result to, it is main
Control the weights of the node updates reptile node.Nearest task completion time and unfinished task of the invention by reptile node
Several weight calculation methods updates the weights of the reptile node, and next is carried out to the reptile node less than main controlled node weights
The distribution of business, not in view of the estimated duration of reptile in scheduling process, therefore can not Optimized Operation to greatest extent.
Above although 5 patents relate to the scheduling strategy of reptile and the duplicate removal to reptile content, but focus on vertical
The reptile of field individual cultivation, there is following deficiency in them:
1st, all it is the method for more common property, in the application scenarios for needing a large amount of customization reptiles, reptile can not be run
Time is predicted and efficiently dispatched;
2nd, do not formed it is a set of from reptile configuration, reptile scheduling, reptile performs, data processing whole comparatively perfect is
System.
The content of the invention
It is a primary object of the present invention to overcome deficiency of the prior art, there is provided a kind of based on personalized customization reptile
The reptile in the vertical subdivision field of automation crawls and management and dispatching method.In order to solve the above technical problems, the solution party of the present invention
Case is:
A kind of reptile crawling method for automating vertical subdivision field, including following processes are provided:
First, reptile run time is predicted;
In the case where parallel channel and reptile task determine, i.e., reptile and situation that line number and targeted website determine
Under, to each new reptile task, reptile run time is predicted using linear regression model (LRM);
2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number;
If reptile and line number be m, have the reptile tasks that n is independent, prediction run time, is adopted corresponding to reptile task i
With most long processing time priority algorithm (Longest Processing Time, LPT algorithm) so that n reptile task
Completed within the time as short as possible by m parallel channel;
Most long processing time priority algorithm is according to the sequence of reptile run time length is predicted, so by n reptile task
The reptile task of maximum duration is sequentially allocated to the parallel channel earliest to the end time afterwards, (according to paper " Bounds on
Proof in Multiprocessing Timing Anomalies ") as Greedy strategy can obtain (4/3-1/ (3m))
The OPT upper bound, wherein m for reptile and line number, OPT be optimal time (theoretical The shortest operation time);
3rd, reptile crawls;
Reptile is crawled including crawling core and data processing section, to realize subdivision field reptile vertical in automation
Targeted website is crawled;
Crawl core to be used to send request to targeted website (reptile targeted website), and returning result is parsed
And contents extraction, obtain the content of structuring;
Data processing section is used to, to crawling the structured content after the parsing of core, be filtered, screened and data
Storehouse persistence.
In the present invention, the linear regression model (LRM) in the process one is trained the (sample of training at interval of certain time
Originally it is the input parameter and actual run time of history reptile, so most starting to need to carry out certain reptile service data product
It is tired, model parameter is just constantly updated afterwards), the training of linear regression model (LRM) specifically includes following step:
Step 1a):At each qualitative variable (targeted website, crawling data category) quantization in the start-up parameter of reptile
Reason, for each qualitative variable, if k value may (k be constant, for representing the number of targeted website, crawling data
Classification number), then being converted into the virtual independents variable 0 or 1 of k-1, (crawling data category has link, short text, long text, picture
4 classes, then 3 0-1 variables are introduced to represent this qualitative variable of data category;Targeted website has k, then introduces k-1 0-1 and become
Measure to represent targeted website;Subtract 1 and be because if with k 0-1 independent variable, then form complete multicolliearity, and multiple linear
Regression model one of assumes it is that linear relationship is not present between variable, i.e. any one variable all can not be the linear of its dependent variable
Combination, so must subtract 1), plus quantitative variable (requests for page quantity, each requesting interval time), obtain linear regression spy
Value indicative, that is, quantify input feature vector;
The quantization input feature vector for defining reptile is Xi=(x1,...,xD)T, reptile run time ti, then linear regression is obtained
Model:
ti=t (Xi, W) and=WTφ(Xi) (1.1)
Wherein, W=(ω0,...,ωD)T,φ(Xi)=(1, x1,...,xD)T;The D is input feature vector XiCharacteristic
Amount, xi(i=1,2 ..., D) is independent variable, ωi(i=0,1 ..., D) it is model parameter to be asked;
Step 1b):Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition damage
Lose function:
Wherein, N is sample number, t=(t1,...,tN)T, X=(X1,...,XN)T;The ti(i=1,2 ..., N) it is to climb
Worm i actual run times, Xi(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked;
Step 1c):Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum:
W=(XTX)-1XTt (1.3)
Wherein, the X is input feature value X as described aboveiThe matrix of composition, t are reptile actual motion as described above
The vector that time is formed;
Model parameter W is trained, each reptile task of reptile before reptile crawls, can will be run for predicting
Reptile run time.
In the present invention, in the process two, most long processing time priority algorithm specifically refers to:
If reptile number of tasks n≤and line number m, each reptile task is respectively allocated to single batch program (i.e.
Concurrent program, most short scheduling time are the maximum for the prediction run time being equal in n reptile task;
If reptile number of tasks n > and line number m, following operation is repeated until n reptile task is all allocated:
Step 2a):N reptile task is built up into most raft H1 according to prediction run time;
Step 2b):A most rickle H2 is established into m parallel channel according to the available moment;
Step 2c):H1 heap top operation is distributed to H2 heap top passage;
Step 2d):Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2;
Step 2e):Heap H1 heap top element is deleted;
Step 2f):Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap
It is exactly most short scheduling time to push up element.
In the present invention, the reptile of the process three crawls, and specifically includes following step:
Step 3a):Engine opens a website, finds the spider for handling the website and asks first to the spider
The URL to be crawled;
Step 3b):Engine gets first URL to be crawled from spider and uses Request in the scheduler
Scheduling;
Step 3c):Engine is to the next URL to be crawled of scheduler request;
Step 3d):Scheduler returns to next URL to be crawled and forwards URL by downloading middleware to engine, engine
To downloader;
Step 3e):Once page-downloading finishes, downloader generates the Response of the page, and by it under
Carry middleware and be sent to engine;
Step 3f):Engine receives Response from downloader and is sent to by spider middlewares at spider
Reason;
Step 3g):Spider handles Response and returns to the item crawled and new Request to engine;
Step 3h):Engine further screens the item crawled to project pipeline to data, cleaning and persistence
Operation, by Request to scheduler;
Step 3i):Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.
In the present invention, in the process three, crawl core web page contents are converted into using simhash algorithms it is low
Dimensional vector carries out Similarity Measure, carries out duplicate removal to nearly similar web page, specifically includes following step:
Step i):Climbed n days before being extracted from database in (number of days here depends on the propagating characteristic of specific field)
The simhash of the web page contents taken, and the new web page (when extracting content) for crawling, are carried out for each webpage
Operations described below:
Participle:The content of text of the new web page crawled is extracted, is segmented to obtain Feature Words, then removes feature
Stop words in word, then calculate the lexical item frequencies of each Feature Words (number and the text that i.e. word occurs in the text are total
The ratio between word number) it is used as weight;
Quantify:Each Feature Words are carried out with hash computings and obtains 0-1hash strings;
Merge:The hash sequential values (referring to 0-1hash strings) calculated to each Feature Words, it is corresponding to be first multiplied by this feature value
Lexical item frequency weight, then ask the cumulative of each bit positions of all hash sequential values and, become a sequence string;
Dimensionality reduction:Sequence string after will be cumulative becomes 0-1 strings and (is designated as 1 if each is more than 0, is designated as 0) less than 0, i.e.,
Obtain the final simhash values of the web page contents;
Step ii):Nearly similar web page filtering:By the simhash values of the new web page crawled and the simhash of existing webpage
Value contrast, the Hamming distances of two hash values are calculated, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up
The new web page just captured;Otherwise, database is stored in, and updates existing simhash storehouses;
There are the simhash values of all webpages crawled in the simhash storehouses.
In the present invention, the core that crawls only carries out nearly similar web page duplicate removal to news web page.
A kind of management system of the reptile crawling method for the vertical subdivision field of the automation is provided, for reptile
Parameter configuration, in real time operational administrative, monitoring are carried out, the management system crawls core layer, reptile control management level including reptile;
The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy uses Twisted asynchronous networks storehouse
Network communication is handled, and includes various middleware interfaces, various demands can be completed;
Reptile crawls core layer and specifically includes following components:
Engine:The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction;
Scheduler:Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request;
Downloader:Web page contents are simultaneously returned to spider by crawl webpage;
spider:Define crawl and the resolution rules of specific website (using analytical tools such as xpath);
Project pipeline:The item returned from spider is handled, main task is cleaning, checking and data storage;
Downloader middleware:Handle the request and response between engine and downloader;
Spider middlewares:Handle spider response input and request output;
Dispatch middleware:Processing engine is sent to the request and response of scheduling;
The reptile control management level include reptile management backstage module, reptile service layer module;
Reptile management backstage module uses MVC models, by calling reptile service layer module, enter to reptile service layer module
The friendly interfaceization management of row;The management service of interfaceization management includes the parameter configuration of reptile, batch reptile is newly-built, climbs in batches
Worm configuration, reptile startup, the startup of batch reptile, reptile are regularly, reptile daily record is checked, reptile result is checked, reptile daily record is lasting
Change;
The flow of the parameter configuration of reptile is:Obtain whole reptile information that reptile service layer module provides and store to number
According to storehouse, it is each reptile configuration parameter information, configures bootable batch reptile;
Reptile start flow be:The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration,
The request for starting reptile is sent to reptile service layer module, if the jobid of reptile is successfully recorded, periodically to reptile service layer mould
The state of the block transmission acquisition request reptile terminates until reptile;
Batch reptile start flow be:The configuration of batch reptile is obtained, is opened according to the configuration order of reptile according to reptile
Dynamic flow performs each reptile successively, sends request by reptile result persistence;
The flow of timing batch reptile is:Batch reptile is selected, start time point, opening timing reptile are set;It is fixed every time
When the configuration of the batch reptile is obtained when starting, judge whether its timing has been cancelled, if be cancelled, it is fixed to cancel
When, log;If it is not cancelled the Booting sequence for performing batch reptile;If Server Restart, held by configuring
The timing that the mode of device was originally set engineering when engineering startup all starts;
The flow that reptile daily record is checked is:Using project, spider, page, pageSize of reptile as parameter to climbing
Worm service layer module sends the log information that acquisition request corresponds to reptile
Reptile service layer module crawls reptile the operation of core layer, and (startup of reptile, pause, operation monitoring, daily record are looked into
See, crawl result check, persistence etc.) be encapsulated as web service, and provide JSON API method of calling to dispose and control
Reptile processed, so as to support far call and parallel-expansion (i.e. by unified in web services, the side by the various operations of reptile
Just the management and scheduling of whole crawler system).
Compared with prior art, the beneficial effects of the invention are as follows:
It is more efficient than prior art in efficiency in crawling for vertically subdivision field reptile, with reference to the spy of vertical subdivision reptile
Sign introduces the time prediction model of reptile with starting, and the efficient tune of parallel reptile is carried out with reference to most long processing time priority algorithm
Degree, saving crawl the time.
By the configuration of vertical subdivision field reptile, management, dispatch, crawl with data processing whole flow process be combined into one from
The efficient system of dynamicization, it is easily managed (monitoring reptile state in real time, regularly crawl and carry out data processing), scalability height
(configuration reptile is convenient and swift).
Brief description of the drawings
Fig. 1 is reptile management and running algorithm overall flow.
Fig. 2 is reptile management system frame diagram.
Fig. 3 is reptile core layer algorithm flow chart.
Embodiment
The present invention is described in further detail with embodiment below in conjunction with the accompanying drawings:
A kind of management system for being used to automate the reptile crawling method in vertical subdivision field as shown in Figure 2 includes reptile
Core layer, reptile control management level are crawled, for carrying out parameter configuration, in real time operational administrative, monitoring, the management system to reptile
System.
The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy uses Twisted asynchronous networks storehouse
Network communication is handled, framework is clear, and includes various middleware interfaces, can flexibly complete various demands.Fig. 3 is reptile core
Central layer algorithm flow chart.
Reptile crawls core layer and specifically includes following components:
Engine:The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction;
Scheduler:Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request;
Downloader:Web page contents are simultaneously returned to spider by crawl webpage;
spider:Define crawl and the resolution rules of specific website (using analytical tools such as xpath);
Project pipeline:The item returned from spider is handled, main task is cleaning, checking and data storage;
Downloader middleware:Handle the request and response between engine and downloader;
Spider middlewares:Handle spider response input and request output;
Dispatch middleware:Processing engine is sent to the request and response of scheduling;
The reptile control management level include reptile management backstage module, reptile service layer module.
Reptile management backstage module uses MVC models, by calling reptile service layer module, enter to reptile service layer module
The friendly interfaceization management of row;The management service of interfaceization management includes the parameter configuration of reptile, batch reptile is newly-built, climbs in batches
Worm configuration, reptile startup, the startup of batch reptile, reptile are regularly, reptile daily record is checked, reptile result is checked, reptile daily record is lasting
Change.
The flow of the parameter configuration of reptile is:Obtain whole reptile information that reptile service layer module provides and store to number
According to storehouse, it is each reptile configuration parameter information, configures bootable batch reptile;
Reptile start flow be:The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration,
The request for starting reptile is sent to reptile service layer module, if the jobid of reptile is successfully recorded, periodically to reptile service layer mould
The state of the block transmission acquisition request reptile terminates until reptile;
Batch reptile start flow be:The configuration of batch reptile is obtained, is opened according to the configuration order of reptile according to reptile
Dynamic flow performs each reptile successively, sends request by reptile result persistence;
The flow of timing batch reptile is:Batch reptile is selected, start time point, opening timing reptile are set;It is fixed every time
When the configuration of the batch reptile is obtained when starting, judge whether its timing has been cancelled, if be cancelled, it is fixed to cancel
When, log;If it is not cancelled the Booting sequence for performing batch reptile;If Server Restart, held by configuring
The timing that the mode of device was originally set engineering when engineering startup all starts;
The flow that reptile daily record is checked is:Using project, spider, page, pageSize of reptile as parameter to climbing
Worm service layer module sends the log information that acquisition request corresponds to reptile
Startup, pause, operation monitoring, the daily record that reptile service layer module crawls reptile the reptile of core layer are checked, climbed
Take result to check, the operation such as persistence is encapsulated as web service, and provide JSON API method of calling to dispose and control
Reptile, it is so as to support far call and parallel-expansion, i.e., convenient by the way that the various operations of reptile are unified in web services
The management and scheduling of whole crawler system.
The reptile crawling method in field, including following processes are vertically segmented in a kind of automation as shown in Figure 1:
First, reptile run time is predicted;
2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number;
3rd, reptile crawls.
Process one:
The targeted website source of vertical field subdivision reptile is very more, is updated for news category information especially frequently,
Therefore a large amount of individually reptile tasks will be started daily, each independent reptile is because of targeted website and crawls the difference of parameter and is climbing
Take and very big otherness is also shown on the time.In the case where parallel channel and reptile task determine, i.e., reptile and line number
In the case of being determined with targeted website, to each new reptile task, go out reptile operation using linear regression model (LRM) Accurate Prediction
Time, optimize reptile dispatching sequence, can be greatly improved and crawl efficiency, saving crawls the time.
The run time of each independent reptile is mainly influenceed by targeted website and reptile parameter, therefore to having completed
Reptile task creation time prediction multiple linear regression model can effectively predict the substantially run time of new task.
The training of linear regression model (LRM) specifically includes following step:
Step 1a):By each qualitative variable quantification treatment in parameter, setting qualitative variable has k (such as k classifications
Information), the virtual independents variable of k-1 0-1 are converted into, plus quantitative variable (requests for page quantity, each requesting interval time), are obtained
To linear regression characteristic value;
Wherein, the k is constant, for the number for representing targeted website, the classification number that crawls data;
The quantization input feature vector for defining reptile is Xi=(x1,...,xD)T, reptile run time ti, then can obtain linear return
Return model:
ti=t (Xi, W) and=WTφ(Xi) (1.1)
Wherein, W=(ω0,...,ωD)T,φ(Xi)=(1, x1,...,xD)T;The D is input feature vector XiCharacteristic
Amount, xi(i=1,2 ..., D) is independent variable, ωi(i=0,1 ..., D) it is model parameter to be asked;
Step 1b):Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition damage
Lose function:
Wherein, N is sample number, t=(t1,...,tN)T, X=(X1,...,XN)T;The ti(i=1,2 ..., N) it is to climb
Worm i actual run times, Xi(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked;
Step 1c):Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum:
W=(XTX)-1XTt (1.3)
Wherein, the X is input feature value X as described aboveiThe matrix of composition, t are reptile actual motion as described above
The vector that time is formed.
Process two:
If reptile and line number be m, have the reptile tasks that n is independent, prediction run time t corresponding to reptile task ii,
Using most long processing time priority algorithm (Longest Processing Time, LPT algorithm) so that n reptile is appointed
Business is completed within the time as short as possible by m parallel channel.
Most long processing time priority algorithm is according to the sequence of reptile run time length is predicted, so by n reptile task
The reptile task of maximum duration is sequentially allocated to the parallel channel earliest to the end time afterwards, such Greedy strategy can obtain
(4/3-1/3m) OPT upper bound.Specifically refer to:
If reptile number of tasks n≤and line number m, reptile task i is distributed into batch program i, most short scheduling time
It is equal to the maximum of the prediction run time in n reptile task;
If reptile number of tasks n > and line number m, following operation is repeated until n reptile task is all allocated:
Step 2a):N reptile task is built up into most raft H1 according to prediction run time;
Step 2b):A most rickle H2 is established into m parallel channel according to the available moment;
Step 2c):H1 heap top operation is distributed to H2 heap top passage;
Step 2d):Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2;
Step 2e):Heap H1 heap top element is deleted;
Step 2f):Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap
It is exactly most short scheduling time to push up element.
Process three:
Reptile is crawled including crawling core and data processing section, to realize subdivision field reptile vertical in automation
Targeted website is crawled.Crawl core to be used to send request to reptile targeted website, and returning result is parsed
And contents extraction, obtain the content of structuring.Data processing section is used for crawling the structured content after core parses,
Filtered, screened and database persistence.
What reptile crawled comprises the following steps that described:
Step 3a):Engine opens a website, finds the spider for handling the website and asks first to the spider
The URL to be crawled;
Step 3b):Engine gets first URL to be crawled from spider and uses Request in the scheduler
Scheduling;
Step 3c):Engine is to the next URL to be crawled of scheduler request;
Step 3d):Scheduler returns to next URL to be crawled and forwards URL by downloading middleware to engine, engine
To downloader;
Step 3e):Once page-downloading finishes, downloader generates the Response of the page, and by it under
Carry middleware and be sent to engine;
Step 3f):Engine receives Response from downloader and is sent to by spider middlewares at spider
Reason;
Step 3g):Spider handles Response and returns to the item crawled and new Request to engine;
Step 3h):Engine further screens the item crawled to project pipeline to data, cleaning and persistence
Operation, by Request to scheduler;
Step 3i):Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.
Due to the present invention is directed the customization reptile in high perpendicular subdivision field, theme is similar, and internet is deposited at present
In a large amount of mirror images, content duplication, embedded advertisement, the webpage changed on a small quantity.Especially news content, a highlight may
Repeating to issue by multiple websites within these few days.For the webpage largely repeated, filtered if not done by detection, on the one hand meeting
So that data redundancy, takes up space, on the other hand data are also resulted in during follow-up data are used and established such as search engine
Repeat.Therefore it is a necessary job (when especially crawling big Text news content) to add nearly similar web page duplicate removal processing.
The main thought of duplicate removal is the similarity for contrasting two web page contents, sets similarity threshold, if higher than if threshold value
It is considered nearly similar web page, abandons it.Therefore the key point of duplicate removal is the Similarity Measure of web page contents, and the present invention crawls core
Web page contents are converted into low-dimensional vector using simhash algorithms and carry out Similarity Measure by center portion point, and nearly similar web page is carried out
Duplicate removal, specifically include following step:
Step i):Crawled for n days before (in database) in (number of days here depends on the propagating characteristic of specific field)
The web page contents crossed and the new web page crawled, operations described below is carried out for each webpage:
Participle:The content of text of the webpage is extracted, is segmented to obtain Feature Words, then removes the deactivation in Feature Words
Word, then the tf-itf of each Feature Words is extracted as weight;
Quantify:Each Feature Words are carried out with hash computings and obtains 0-1hash strings;
Merge:The hash sequential values calculated to each Feature Words, lexical item frequency weight corresponding to this feature value is first multiplied by,
Then ask the cumulative of each bit positions of all hash sequential values and become a sequence string;
Dimensionality reduction:Sequence string after will be cumulative becomes 0-1 strings and (is designated as 1 if each is more than 0, is designated as 0) less than 0, i.e.,
Obtain the final simhash values of the web page contents;
Step ii):Nearly similar web page filtering:By the simhash values of the new web page crawled and the simhash values of existing webpage
Contrast, the Hamming distances of two hash values are calculated, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up just
The new web page just captured;Otherwise, database is stored in, and updates existing simhash storehouses;
There are the simhash values of the webpage crawled in all first n days in existing simhash storehouses.
Due to the propagation of news web page have it is certain ageing, the reprinting news of newest issue typically will not apart from too long,
Therefore calculative simhash webpages quantity is little, is answered plus the high efficiency of simhash algorithms, therefore in time and space
Reptile efficiency is not had much affect on miscellaneous degree.
Finally it should be noted that listed above is only specific embodiment of the invention.It is clear that the invention is not restricted to
Above example, there can also be many variations.One of ordinary skill in the art can directly lead from present disclosure
All deformations for going out or associating, are considered as protection scope of the present invention.
Claims (7)
1. a kind of reptile crawling method for automating vertical subdivision field, it is characterised in that including following processes:
First, reptile run time is predicted;
In the case where parallel channel and reptile task determine, i.e., reptile and in the case that line number and targeted website determine, it is right
Each new reptile task, reptile run time is predicted using linear regression model (LRM);
2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number;
If reptile and line number be m, have the reptile tasks that n is independent, prediction run time corresponding to reptile task i, using most
Long processing time priority algorithm so that n reptile task is completed within the time as short as possible by m parallel channel;
Most long processing time priority algorithm is then n reptile task will according to the sequence of reptile run time length is predicted
The reptile task of maximum duration is sequentially allocated the parallel channel earliest to the end time, and such Greedy strategy can obtain (4/
3-1/ (3m)) OPT the upper bound, wherein m be reptile and line number, OPT be optimal time;
3rd, reptile crawls;
Reptile is crawled including crawling core and data processing section, to realize in the vertical subdivision field reptile of automation to mesh
Mark website crawls;
Crawl core and be used to send to targeted website and ask, and returning result is parsed and contents extraction, tied
The content of structure;
Data processing section is used to, to crawling the structured content after the parsing of core, be filtered, screened and database is held
Longization.
2. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described
Linear regression model (LRM) in process one is trained at interval of certain time, and the training of linear regression model (LRM) specifically includes following steps
Suddenly:
Step 1a):By each qualitative variable quantification treatment in the start-up parameter of reptile, for each qualitative variable, if k
Individual value is possible, then is converted into k-1 virtual independents variable 0 or 1, plus quantitative variable, obtains linear regression characteristic value, that is, measure
Change input feature vector;
The quantization input feature vector for defining reptile is Xi=(x1..., xD)T, reptile run time ti, then linear regression model (LRM) is obtained:
ti=t (Xi, W) and=WTφ(Xi) (1.1)
Wherein, W=(ω0..., ωD)T, φ (Xi)=(1, x1..., xD)T;The D is input feature vector XiFeature quantity, xi
(i=1,2 ..., D) is independent variable, ωi(i=0,1 ..., D) is model parameter to be asked;
Step 1b):Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition loss letter
Number:
<mrow>
<mi>E</mi>
<mrow>
<mo>(</mo>
<mi>W</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mn>2</mn>
</mfrac>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<msup>
<mrow>
<mo>{</mo>
<msub>
<mi>t</mi>
<mi>n</mi>
</msub>
<mo>-</mo>
<msup>
<mi>W</mi>
<mi>T</mi>
</msup>
<mi>&phi;</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>X</mi>
<mi>n</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>}</mo>
</mrow>
<mn>2</mn>
</msup>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mn>2</mn>
<mi>N</mi>
</mrow>
</mfrac>
<msup>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>-</mo>
<mi>X</mi>
<mi>W</mi>
<mo>)</mo>
</mrow>
<mi>T</mi>
</msup>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>-</mo>
<mi>X</mi>
<mi>W</mi>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1.2</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, N is sample number, t=(t1..., tN)T, X=(X1..., XN)T;The ti(i=1,2 ..., N) is reptile i real
Border run time, Xi(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked;
Step 1c):Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum:
W=(XTX)-1XTt (1.3)
Wherein, the X is input feature value X as described aboveiThe matrix of composition, t are reptile actual run time structure as described above
Into vector;
Model parameter W is trained, can be before reptile crawls, for predicting the reptile for each reptile task that will run reptile
Run time.
3. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described
In process two, most long processing time priority algorithm specifically refers to:
If reptile number of tasks n≤and line number m, it is (i.e. parallel that each reptile task is respectively allocated to single batch program
Program, most short scheduling time are the maximum for the prediction run time being equal in n reptile task;
If reptile number of tasks n > and line number m, following operation is repeated until n reptile task is all allocated:
Step 2a):N reptile task is built up into most raft H1 according to prediction run time;
Step 2b):A most rickle H2 is established into m parallel channel according to the available moment;
Step 2c):H1 heap top operation is distributed to H2 heap top passage;
Step 2d):Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2;
Step 2e):Heap H1 heap top element is deleted;
Step 2f):Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap top member
Element is exactly most short scheduling time.
4. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described
The reptile of process three crawls, and specifically includes following step:
Step 3a):Engine opens a website, and finding the spider for handling the website and asking first to the spider to climb
The URL taken;
Step 3b):Engine is got first URL to be crawled from spider and dispatched in the scheduler using Request;
Step 3c):Engine is to the next URL to be crawled of scheduler request;
Step 3d):Scheduler returns to next URL to be crawled and is transmitted to down URL by downloading middleware to engine, engine
Carry device;
Step 3e):Once page-downloading finishes, downloader generates the Response of the page, and it is passed through in download
Between part be sent to engine;
Step 3f):Engine receives Response from downloader and is sent to spider processing by spider middlewares;
Step 3g):Spider handles Response and returns to the item crawled and new Request to engine;
Step 3h):Engine further screens the item crawled to project pipeline to data, cleaning and persistence behaviour
Make, by Request to scheduler;
Step 3i):Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.
5. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described
In process three, crawl core and web page contents are converted into low-dimensional vector progress Similarity Measure using simhash algorithms, it is right
Nearly similar web page carries out duplicate removal, specifically includes following step:
Step i):The simhash of the web page contents crawled before being extracted from database in n days, and it is directed to what is crawled
New web page, operations described below is carried out for each webpage:
Participle:The content of text of the new web page crawled is extracted, is segmented to obtain Feature Words, then removed in Feature Words
Stop words, then calculate the lexical item frequency of each Feature Words as weight;
Quantify:Each Feature Words are carried out with hash computings and obtains 0-1hash strings;
Merge:The hash sequential values calculated to each Feature Words, lexical item frequency weight corresponding to this feature value is first multiplied by, then
Ask the cumulative of each bit positions of all hash sequential values and become a sequence string;
Dimensionality reduction:Sequence string after will be cumulative becomes 0-1 strings, that is, obtains the final simhash values of the web page contents;
Step ii):Nearly similar web page filtering:By the simhash values of the new web page crawled and the simhash values pair of existing webpage
Than the Hamming distances of two hash values of calculating, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up just
The new web page of crawl;Otherwise, database is stored in, and updates existing simhash storehouses;
There are the simhash values of all webpages crawled in the simhash storehouses.
6. a kind of reptile crawling method for automating vertical subdivision field according to claim 5, it is characterised in that described
Crawl core and nearly similar web page duplicate removal only is carried out to news web page.
7. a kind of management system of reptile crawling method for the vertical subdivision field of automation described in claim 1, for pair
Reptile carry out parameter configuration, operational administrative, in real time monitoring, it is characterised in that the management system including reptile crawl core layer,
Reptile controls management level;
The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy is handled using Twisted asynchronous networks storehouse
Network communication, and include various middleware interfaces, various demands can be completed:
Reptile crawls core layer and specifically includes following components:
Engine:The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction;
Scheduler:Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request;
Downloader:Web page contents are simultaneously returned to spider by crawl webpage;
spider:Define crawl and the resolution rules of specific website;
Project pipeline:The item returned from spider is handled, main task is cleaning, checking and data storage;
Downloader middleware:Handle the request and response between engine and downloader;
Spider middlewares:Handle spider response input and request output;
Dispatch middleware:Processing engine is sent to the request and response of scheduling;
The reptile control management level include reptile management backstage module, reptile service layer module;
Reptile management backstage module uses MVC models, and by calling reptile service layer module, friend is carried out to reptile service layer module
Good interfaceization management;Parameter configuration of the management service including reptile of interfaceization management, batch reptile are newly-built, batch reptile is matched somebody with somebody
Put, reptile start, batch reptile start, reptile timing, reptile daily record is checked, reptile result is checked, reptile daily record persistence;
The flow of the parameter configuration of reptile is:Obtain whole reptile information that reptile service layer module provides and store and arrive data
Storehouse, it is each reptile configuration parameter information, configures bootable batch reptile;
Reptile start flow be:The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration, to climbing
Worm service layer module sends the request for starting reptile, if successfully recording the jobid of reptile, periodically to reptile service layer module hair
The state of the acquisition request reptile is sent until reptile is terminated;
Batch reptile start flow be:The configuration of batch reptile is obtained, is started according to the configuration order of reptile according to reptile and flowed
Cheng Yici performs each reptile, sends request by reptile result persistence;
The flow of timing batch reptile is:Batch reptile is selected, start time point, opening timing reptile are set;Regularly open every time
The configuration of the batch reptile is obtained during the beginning, judges whether its timing has been cancelled, if be cancelled, cancels timing, note
Record daily record;If it is not cancelled the Booting sequence for performing batch reptile;If Server Restart, pass through the side of dispensing containers
The timing that formula was originally set engineering when engineering startup all starts;
The flow that reptile daily record is checked is:Taken using project, spider, page, pageSize of reptile as parameter to reptile
Business layer module sends the log information that acquisition request corresponds to reptile
The operation that reptile service layer module crawls reptile core layer is encapsulated as web service, and provides JSON API tune
Reptile is disposed and controls with mode, so as to support far call and parallel-expansion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710673166.3A CN107590188B (en) | 2017-08-08 | 2017-08-08 | Crawler crawling method and management system for automatic vertical subdivision field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710673166.3A CN107590188B (en) | 2017-08-08 | 2017-08-08 | Crawler crawling method and management system for automatic vertical subdivision field |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107590188A true CN107590188A (en) | 2018-01-16 |
CN107590188B CN107590188B (en) | 2020-02-14 |
Family
ID=61043186
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710673166.3A Active CN107590188B (en) | 2017-08-08 | 2017-08-08 | Crawler crawling method and management system for automatic vertical subdivision field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107590188B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241391A (en) * | 2018-09-20 | 2019-01-18 | 四川长虹电器股份有限公司 | A kind of anti-crawler method climbed of solution font |
CN109670101A (en) * | 2018-12-28 | 2019-04-23 | 北京奇安信科技有限公司 | Crawler dispatching method, device, electronic equipment and storage medium |
CN109815537A (en) * | 2018-12-19 | 2019-05-28 | 清华大学 | A kind of high-throughput material simulation calculation optimization method based on time prediction |
CN109918554A (en) * | 2019-02-13 | 2019-06-21 | 平安科技(深圳)有限公司 | Web data crawling method, device, system and computer readable storage medium |
CN110929128A (en) * | 2019-12-11 | 2020-03-27 | 北京启迪区块链科技发展有限公司 | Data crawling method, device, equipment and medium |
CN110968560A (en) * | 2018-09-29 | 2020-04-07 | 北京国双科技有限公司 | Log collector configuration method, device and system |
CN111026947A (en) * | 2019-12-18 | 2020-04-17 | 烽火通信科技股份有限公司 | Crawler method and embedded crawler implementation method based on browser |
CN111125482A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Method and device for adjusting data crawling frequency, storage medium and processor |
CN111125487A (en) * | 2019-12-24 | 2020-05-08 | 个体化细胞治疗技术国家地方联合工程实验室(深圳) | Crawling method and device for web crawler |
CN111274466A (en) * | 2019-12-18 | 2020-06-12 | 成都迪普曼林信息技术有限公司 | Non-structural data acquisition system and method for overseas server |
CN111488508A (en) * | 2020-04-10 | 2020-08-04 | 长春博立电子科技有限公司 | Internet information acquisition system and method supporting multi-protocol distributed high concurrency |
CN111552864A (en) * | 2020-03-20 | 2020-08-18 | 上海恒生聚源数据服务有限公司 | Method, system, storage medium and electronic equipment for removing duplicate information |
CN112347394A (en) * | 2020-11-30 | 2021-02-09 | 广州至真信息科技有限公司 | Method and device for acquiring webpage information, computer equipment and storage medium |
CN113220968A (en) * | 2021-05-26 | 2021-08-06 | 西安热工研究院有限公司 | Clustered network crawler-based automatic power technology standard updating system and method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090204478A1 (en) * | 2008-02-08 | 2009-08-13 | Vertical Acuity, Inc. | Systems and Methods for Identifying and Measuring Trends in Consumer Content Demand Within Vertically Associated Websites and Related Content |
CN103605764A (en) * | 2013-11-26 | 2014-02-26 | Tcl集团股份有限公司 | Web crawler system and web crawler multitask executing and scheduling method |
CN103870329A (en) * | 2014-03-03 | 2014-06-18 | 同济大学 | Distributed crawler task scheduling method based on weighted round-robin algorithm |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN106209685A (en) * | 2016-07-08 | 2016-12-07 | 武汉烽火普天信息技术有限公司 | A kind of web crawlers distribution method of dynamic bandwidth towards mass data source and system |
-
2017
- 2017-08-08 CN CN201710673166.3A patent/CN107590188B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090204478A1 (en) * | 2008-02-08 | 2009-08-13 | Vertical Acuity, Inc. | Systems and Methods for Identifying and Measuring Trends in Consumer Content Demand Within Vertically Associated Websites and Related Content |
CN103605764A (en) * | 2013-11-26 | 2014-02-26 | Tcl集团股份有限公司 | Web crawler system and web crawler multitask executing and scheduling method |
CN103870329A (en) * | 2014-03-03 | 2014-06-18 | 同济大学 | Distributed crawler task scheduling method based on weighted round-robin algorithm |
CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
CN106209685A (en) * | 2016-07-08 | 2016-12-07 | 武汉烽火普天信息技术有限公司 | A kind of web crawlers distribution method of dynamic bandwidth towards mass data source and system |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241391A (en) * | 2018-09-20 | 2019-01-18 | 四川长虹电器股份有限公司 | A kind of anti-crawler method climbed of solution font |
CN110968560A (en) * | 2018-09-29 | 2020-04-07 | 北京国双科技有限公司 | Log collector configuration method, device and system |
CN110968560B (en) * | 2018-09-29 | 2023-05-23 | 北京国双科技有限公司 | Configuration method, device and system of log collector |
CN111125482A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Method and device for adjusting data crawling frequency, storage medium and processor |
CN111125482B (en) * | 2018-10-31 | 2023-04-07 | 北京国双科技有限公司 | Method and device for adjusting data crawling frequency, storage medium and processor |
CN109815537A (en) * | 2018-12-19 | 2019-05-28 | 清华大学 | A kind of high-throughput material simulation calculation optimization method based on time prediction |
CN109670101A (en) * | 2018-12-28 | 2019-04-23 | 北京奇安信科技有限公司 | Crawler dispatching method, device, electronic equipment and storage medium |
CN109918554A (en) * | 2019-02-13 | 2019-06-21 | 平安科技(深圳)有限公司 | Web data crawling method, device, system and computer readable storage medium |
CN110929128A (en) * | 2019-12-11 | 2020-03-27 | 北京启迪区块链科技发展有限公司 | Data crawling method, device, equipment and medium |
CN111026947B (en) * | 2019-12-18 | 2022-08-12 | 烽火通信科技股份有限公司 | Crawler method and embedded crawler implementation method based on browser |
CN111026947A (en) * | 2019-12-18 | 2020-04-17 | 烽火通信科技股份有限公司 | Crawler method and embedded crawler implementation method based on browser |
CN111274466A (en) * | 2019-12-18 | 2020-06-12 | 成都迪普曼林信息技术有限公司 | Non-structural data acquisition system and method for overseas server |
CN111125487A (en) * | 2019-12-24 | 2020-05-08 | 个体化细胞治疗技术国家地方联合工程实验室(深圳) | Crawling method and device for web crawler |
CN111552864A (en) * | 2020-03-20 | 2020-08-18 | 上海恒生聚源数据服务有限公司 | Method, system, storage medium and electronic equipment for removing duplicate information |
CN111552864B (en) * | 2020-03-20 | 2023-09-12 | 上海恒生聚源数据服务有限公司 | Information deduplication method, system, storage medium and electronic equipment |
CN111488508A (en) * | 2020-04-10 | 2020-08-04 | 长春博立电子科技有限公司 | Internet information acquisition system and method supporting multi-protocol distributed high concurrency |
CN112347394A (en) * | 2020-11-30 | 2021-02-09 | 广州至真信息科技有限公司 | Method and device for acquiring webpage information, computer equipment and storage medium |
CN113220968A (en) * | 2021-05-26 | 2021-08-06 | 西安热工研究院有限公司 | Clustered network crawler-based automatic power technology standard updating system and method |
CN113220968B (en) * | 2021-05-26 | 2023-03-14 | 西安热工研究院有限公司 | Clustered network crawler-based automatic power technology standard updating system and method |
Also Published As
Publication number | Publication date |
---|---|
CN107590188B (en) | 2020-02-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107590188A (en) | A kind of reptile crawling method and its management system for automating vertical subdivision field | |
Eismann et al. | A review of serverless use cases and their characteristics | |
US20180240041A1 (en) | Distributed hyperparameter tuning system for machine learning | |
CN105447204B (en) | Network address recognition methods and device | |
Zhang et al. | WSPred: A time-aware personalized QoS prediction framework for Web services | |
CN109902220B (en) | Webpage information acquisition method, device and computer readable storage medium | |
US20050192936A1 (en) | Decision-theoretic web-crawling and predicting web-page change | |
CN112036577B (en) | Method and device for applying machine learning based on data form and electronic equipment | |
CN107798026A (en) | Data query method and apparatus | |
WO2013042115A2 (en) | Computerized data-aware agent systems for retrieving data to serve a dialog between human user and computerized system | |
CN105989074A (en) | Method and device for recommending cold start through mobile equipment information | |
CN110532078A (en) | A kind of edge calculations method for optimizing scheduling and system | |
CN109656963A (en) | Metadata acquisition methods, device, equipment and computer readable storage medium | |
US20220237567A1 (en) | Chatbot system and method for applying for opportunities | |
CN110222253A (en) | A kind of collecting method, equipment and computer readable storage medium | |
CN101202792B (en) | Method and apparatus for processing messages based on relationship between sender and recipient | |
CN112365157A (en) | Intelligent dispatching method, device, equipment and storage medium | |
CN110516714A (en) | A kind of feature prediction technique, system and engine | |
CN112149838A (en) | Method, device, electronic equipment and storage medium for realizing automatic model building | |
CN110442766A (en) | Webpage data acquiring method, device, equipment and storage medium | |
CN103886033B (en) | Intelligent vertical searching device and method for safety industry chain | |
CN108021607A (en) | A kind of wireless city Audit data off-line analysis method based on big data platform | |
US10896034B2 (en) | Methods and systems for automated screen display generation and configuration | |
Shahoud et al. | A meta learning approach for automating model selection in big data environments using microservice and container virtualization technologies | |
CN111882113A (en) | Enterprise mobile banking user prediction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 310000 room 507, building 13, No.199, Wensan Road, Xihu District, Hangzhou City, Zhejiang Province Patentee after: HANGZHOU JZTDATA TECHNOLOGY Co.,Ltd. Address before: Hangzhou City, Zhejiang province 310030 Xihu District Yaojiang Arphic court room 8-1603 Patentee before: HANGZHOU LINGHAO TECHNOLOGY Co.,Ltd. |