CN107590188A - A kind of reptile crawling method and its management system for automating vertical subdivision field - Google Patents

A kind of reptile crawling method and its management system for automating vertical subdivision field Download PDF

Info

Publication number
CN107590188A
CN107590188A CN201710673166.3A CN201710673166A CN107590188A CN 107590188 A CN107590188 A CN 107590188A CN 201710673166 A CN201710673166 A CN 201710673166A CN 107590188 A CN107590188 A CN 107590188A
Authority
CN
China
Prior art keywords
reptile
engine
time
mrow
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710673166.3A
Other languages
Chinese (zh)
Other versions
CN107590188B (en
Inventor
郑小林
张建勇
林炜华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HANGZHOU JZTDATA TECHNOLOGY Co.,Ltd.
Original Assignee
Hangzhou Ling Hao Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Ling Hao Technology Co Ltd filed Critical Hangzhou Ling Hao Technology Co Ltd
Priority to CN201710673166.3A priority Critical patent/CN107590188B/en
Publication of CN107590188A publication Critical patent/CN107590188A/en
Application granted granted Critical
Publication of CN107590188B publication Critical patent/CN107590188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to reptile to crawl and management and dispatching technology, it is desirable to provide a kind of reptile crawling method and its management system for automating vertical subdivision field.The reptile crawling method in the vertical subdivision field of this kind automation includes process:Reptile run time is predicted;Batch reptile optimizing scheduling is carried out according to predicted time and line number;Reptile crawls.The present invention is more efficient than prior art in efficiency in crawling for vertically subdivision field reptile, the time prediction model of reptile is introduced with reference to the feature of vertical subdivision reptile with starting, the efficient scheduling of parallel reptile is carried out with reference to most long processing time priority algorithm, saving crawls the time.

Description

A kind of reptile crawling method and its management system for automating vertical subdivision field
Technical field
The present invention is to crawl vertically to segment field with management and dispatching technical field, more particularly to a kind of automation on reptile Reptile crawling method and its management system.
Background technology
Although the information age of data explosion contains the magnanimity information and data of all trades and professions, but the mankind receive information Quantity and the ability of processing information be limited, we often account for valuable time by the useless information of great quantities of spare, The difficulty that people obtain customized information continues to increase, and therefore, each vertical subdivision field and personalized recommendation arise at the historic moment.Hang down Notice and service are focused on some specific classification by straight subdivision field, and it is to crawl work to the vertical data for segmenting field The important and element task of the services such as propertyization recommendation.
Web crawlers is a kind of automatic acquisition web page contents, and does some structuring processing, persistences etc. to these information The program of operation.Functionally consider, reptile can substantially be divided into the whole network reptile and vertical reptile.The whole network reptile mainly services In the data acquisition of search engine, it is big to crawl depth, can efficiently capture mass data;Vertical reptile is then to be directed to specific website Or specific webpage, it is small to crawl depth, crawls target often with obvious structuring, the special data for serving vertical field is adopted Collection.
In general the whole network reptile originates url queues by inputting, and url is carried out using depth-first or breadth first algorithm Parse layer by layer, plus frameworks such as distributed platforms, the effect for efficiently capturing mass data can be reached.But in vertical subdivision neck , it is necessary in the scene being acquired to the specific data feature of multiple target, the quality of data that the whole network reptile obtains is unsatisfactory in domain. Therefore customization, scheduling and the management study to vertical subdivision field reptile seem particularly urgent.Risen to necessarily in reptile quantity During magnitude, it would be desirable to configure reptile, reptile scheduling, reptile execution, the link such as data processing are combined closely, so as to form one Individual perfect vertical field reptile management and dispatching framework.
Most like implementation has following several, Chinese invention patent application with the present invention:" webpage content extraction side Method, apparatus and system " (application number:201510124714.8) a kind of, " crawler technology based on web page crawl " (application number: 201310040090.2), " construction method of the spiders based on news duplicate removal " (application number:200910153588.3), " one The configurable vertical field web crawlers implementation method of kind plug-in type " (application number:201510131253.7), " based on weighting wheel It is the distributed reptile method for scheduling task of algorithm " (application number:201410073829.4).
A kind of webpage content extraction method that 1 (webpage content extraction method, apparatus and system) of invention proposes, device and it is System, operation layer send to extraction system and extract webpage URL request;Extraction system is climbed according to webpage URL request, invoking web page is extracted Worm system crawls the page original contents that URL is specified;Extraction system is former to the page using the template document of agreement as matching standard Beginning content is extracted, and the content of extraction is returned into operation layer, and the present invention makes full use of the ability that backstage crawls webpage, together When by parsing original web page and extraction template realize the ability for extracting original web page and specifying label substance, the program adapts to all Web page format extracts named web page label substance, improves the spirit of the ability and webpage content extraction that extract original web page Activity.But the invention has higher call format to targeted website, effect is crawled for the less unified target web of form Fruit is undesirable, therefore inapplicable vertical subdivision field has the web page crawl of multiplicity.
Internet object of the invention 2 based on user's setting, being created according to user for task, corresponding money is crawled from internet Source, rewrite URL and stored, realization is targetedly acquired to internet information;In the embodiment of the present invention, in order to carry The handling capacity and resource utilization of high system, after task requests are received, task is also split into task burst, each task point Piece only includes a website, and each task burst is performed parallel by multiple reptiles, and so, task scheduling granularity is actually Task burst, it can so improve the handling capacity and resource utilization of system.This invention is only to return to the webpage of user's request The link of content and correlation, it is not further to be handled;When the expansion of reptile task quantity size, and parallel ability is limited In the case of, the invention can not make full use of concurrency, and what is be optimal crawls the used time.
The construction method of 3 spiders based on news duplicate removal is invented, technical concept is:The text of headline is utilized Chinese words segmentation extracts the weight of the keyword and each keyword in text;Rule of thumb, choose N number of in the text Weight highest keyword forms set the C={ (t of (keyword, weight)1, w1), (t2, w2), (t3, w3) ... ... (tN, wN), Wherein:ti:I-th of keyword;wi:The weight of i-th of keyword;By the element in set C according to weight wiCarry out from big to small Sequence;Each subset C that news is concentratediIn element sequence from big to small is carried out according to the weight of its keyword;If Determine C and CiBetween similarity threshold value, described similarity by two set in have identical sorting position keyword number To characterize;Each C that set C and news are concentratediIt is compared, judges whether their similarity is higher than described threshold value; If higher than described threshold value, then it is assumed that C is repetition news;If less than described threshold value, then it is assumed that C is non-duplicate news.Compare In the algorithm, the factor that simhash considers is more comprehensive, and algorithm complex is not also high, and accuracy is stronger.
Invention 4 discloses a kind of configurable vertical field web crawlers implementation method of plug-in type, including stage of gripping and Extraction stage.Wherein, stage of gripping includes crawl configuration phase and capture program performs the stage, and extraction stage, which includes extracting, to be configured Stage and extraction program perform the stage.The present invention can realize that the webpage capture of multiple fields and information are taken out by way of configuring Take, and accuracy is high, can solve traditional search engines and be intended to the shortcomings that not clear, accuracy is not high, and can realizes multiple necks The webpage capture in domain and information extraction.The invention is determined equally for vertical subdivision field customization reptile by configuration file The method flexibility of justice crawl parameter and analytic parameter is not high, and user experience is poor;In addition, the invention does not account for yet The problem of reptile concurrency, thus it is inefficient.
Invention 5 proposes a kind of distributed reptile method for scheduling task based on weighted round robin algorithm, including 1) according to rule Mould is different, and web crawlers is divided into unit multithreading, isomorphism centralization, isomery centralization, small distributed and large-scale distributed Five class reptiles;2) master-slave architecture is disposed;3) when reptile node First Contact Connections are to main controlled node, at the beginning of main controlled node gives it Beginning weights;4) main controlled node constantly selects a reptile node, one is waited to climb according to the dispatching algorithm based on weighted round robin The URL tasks taken distribute to it;5) when reptile node has crawled a URL task, main controlled node is returned result to, it is main Control the weights of the node updates reptile node.Nearest task completion time and unfinished task of the invention by reptile node Several weight calculation methods updates the weights of the reptile node, and next is carried out to the reptile node less than main controlled node weights The distribution of business, not in view of the estimated duration of reptile in scheduling process, therefore can not Optimized Operation to greatest extent.
Above although 5 patents relate to the scheduling strategy of reptile and the duplicate removal to reptile content, but focus on vertical The reptile of field individual cultivation, there is following deficiency in them:
1st, all it is the method for more common property, in the application scenarios for needing a large amount of customization reptiles, reptile can not be run Time is predicted and efficiently dispatched;
2nd, do not formed it is a set of from reptile configuration, reptile scheduling, reptile performs, data processing whole comparatively perfect is System.
The content of the invention
It is a primary object of the present invention to overcome deficiency of the prior art, there is provided a kind of based on personalized customization reptile The reptile in the vertical subdivision field of automation crawls and management and dispatching method.In order to solve the above technical problems, the solution party of the present invention Case is:
A kind of reptile crawling method for automating vertical subdivision field, including following processes are provided:
First, reptile run time is predicted;
In the case where parallel channel and reptile task determine, i.e., reptile and situation that line number and targeted website determine Under, to each new reptile task, reptile run time is predicted using linear regression model (LRM);
2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number;
If reptile and line number be m, have the reptile tasks that n is independent, prediction run time, is adopted corresponding to reptile task i With most long processing time priority algorithm (Longest Processing Time, LPT algorithm) so that n reptile task Completed within the time as short as possible by m parallel channel;
Most long processing time priority algorithm is according to the sequence of reptile run time length is predicted, so by n reptile task The reptile task of maximum duration is sequentially allocated to the parallel channel earliest to the end time afterwards, (according to paper " Bounds on Proof in Multiprocessing Timing Anomalies ") as Greedy strategy can obtain (4/3-1/ (3m)) The OPT upper bound, wherein m for reptile and line number, OPT be optimal time (theoretical The shortest operation time);
3rd, reptile crawls;
Reptile is crawled including crawling core and data processing section, to realize subdivision field reptile vertical in automation Targeted website is crawled;
Crawl core to be used to send request to targeted website (reptile targeted website), and returning result is parsed And contents extraction, obtain the content of structuring;
Data processing section is used to, to crawling the structured content after the parsing of core, be filtered, screened and data Storehouse persistence.
In the present invention, the linear regression model (LRM) in the process one is trained the (sample of training at interval of certain time Originally it is the input parameter and actual run time of history reptile, so most starting to need to carry out certain reptile service data product It is tired, model parameter is just constantly updated afterwards), the training of linear regression model (LRM) specifically includes following step:
Step 1a):At each qualitative variable (targeted website, crawling data category) quantization in the start-up parameter of reptile Reason, for each qualitative variable, if k value may (k be constant, for representing the number of targeted website, crawling data Classification number), then being converted into the virtual independents variable 0 or 1 of k-1, (crawling data category has link, short text, long text, picture 4 classes, then 3 0-1 variables are introduced to represent this qualitative variable of data category;Targeted website has k, then introduces k-1 0-1 and become Measure to represent targeted website;Subtract 1 and be because if with k 0-1 independent variable, then form complete multicolliearity, and multiple linear Regression model one of assumes it is that linear relationship is not present between variable, i.e. any one variable all can not be the linear of its dependent variable Combination, so must subtract 1), plus quantitative variable (requests for page quantity, each requesting interval time), obtain linear regression spy Value indicative, that is, quantify input feature vector;
The quantization input feature vector for defining reptile is Xi=(x1,...,xD)T, reptile run time ti, then linear regression is obtained Model:
ti=t (Xi, W) and=WTφ(Xi) (1.1)
Wherein, W=(ω0,...,ωD)T,φ(Xi)=(1, x1,...,xD)T;The D is input feature vector XiCharacteristic Amount, xi(i=1,2 ..., D) is independent variable, ωi(i=0,1 ..., D) it is model parameter to be asked;
Step 1b):Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition damage Lose function:
Wherein, N is sample number, t=(t1,...,tN)T, X=(X1,...,XN)T;The ti(i=1,2 ..., N) it is to climb Worm i actual run times, Xi(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked;
Step 1c):Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum:
W=(XTX)-1XTt (1.3)
Wherein, the X is input feature value X as described aboveiThe matrix of composition, t are reptile actual motion as described above The vector that time is formed;
Model parameter W is trained, each reptile task of reptile before reptile crawls, can will be run for predicting Reptile run time.
In the present invention, in the process two, most long processing time priority algorithm specifically refers to:
If reptile number of tasks n≤and line number m, each reptile task is respectively allocated to single batch program (i.e. Concurrent program, most short scheduling time are the maximum for the prediction run time being equal in n reptile task;
If reptile number of tasks n > and line number m, following operation is repeated until n reptile task is all allocated:
Step 2a):N reptile task is built up into most raft H1 according to prediction run time;
Step 2b):A most rickle H2 is established into m parallel channel according to the available moment;
Step 2c):H1 heap top operation is distributed to H2 heap top passage;
Step 2d):Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2;
Step 2e):Heap H1 heap top element is deleted;
Step 2f):Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap It is exactly most short scheduling time to push up element.
In the present invention, the reptile of the process three crawls, and specifically includes following step:
Step 3a):Engine opens a website, finds the spider for handling the website and asks first to the spider The URL to be crawled;
Step 3b):Engine gets first URL to be crawled from spider and uses Request in the scheduler Scheduling;
Step 3c):Engine is to the next URL to be crawled of scheduler request;
Step 3d):Scheduler returns to next URL to be crawled and forwards URL by downloading middleware to engine, engine To downloader;
Step 3e):Once page-downloading finishes, downloader generates the Response of the page, and by it under Carry middleware and be sent to engine;
Step 3f):Engine receives Response from downloader and is sent to by spider middlewares at spider Reason;
Step 3g):Spider handles Response and returns to the item crawled and new Request to engine;
Step 3h):Engine further screens the item crawled to project pipeline to data, cleaning and persistence Operation, by Request to scheduler;
Step 3i):Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.
In the present invention, in the process three, crawl core web page contents are converted into using simhash algorithms it is low Dimensional vector carries out Similarity Measure, carries out duplicate removal to nearly similar web page, specifically includes following step:
Step i):Climbed n days before being extracted from database in (number of days here depends on the propagating characteristic of specific field) The simhash of the web page contents taken, and the new web page (when extracting content) for crawling, are carried out for each webpage Operations described below:
Participle:The content of text of the new web page crawled is extracted, is segmented to obtain Feature Words, then removes feature Stop words in word, then calculate the lexical item frequencies of each Feature Words (number and the text that i.e. word occurs in the text are total The ratio between word number) it is used as weight;
Quantify:Each Feature Words are carried out with hash computings and obtains 0-1hash strings;
Merge:The hash sequential values (referring to 0-1hash strings) calculated to each Feature Words, it is corresponding to be first multiplied by this feature value Lexical item frequency weight, then ask the cumulative of each bit positions of all hash sequential values and, become a sequence string;
Dimensionality reduction:Sequence string after will be cumulative becomes 0-1 strings and (is designated as 1 if each is more than 0, is designated as 0) less than 0, i.e., Obtain the final simhash values of the web page contents;
Step ii):Nearly similar web page filtering:By the simhash values of the new web page crawled and the simhash of existing webpage Value contrast, the Hamming distances of two hash values are calculated, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up The new web page just captured;Otherwise, database is stored in, and updates existing simhash storehouses;
There are the simhash values of all webpages crawled in the simhash storehouses.
In the present invention, the core that crawls only carries out nearly similar web page duplicate removal to news web page.
A kind of management system of the reptile crawling method for the vertical subdivision field of the automation is provided, for reptile Parameter configuration, in real time operational administrative, monitoring are carried out, the management system crawls core layer, reptile control management level including reptile;
The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy uses Twisted asynchronous networks storehouse Network communication is handled, and includes various middleware interfaces, various demands can be completed;
Reptile crawls core layer and specifically includes following components:
Engine:The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction;
Scheduler:Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request;
Downloader:Web page contents are simultaneously returned to spider by crawl webpage;
spider:Define crawl and the resolution rules of specific website (using analytical tools such as xpath);
Project pipeline:The item returned from spider is handled, main task is cleaning, checking and data storage;
Downloader middleware:Handle the request and response between engine and downloader;
Spider middlewares:Handle spider response input and request output;
Dispatch middleware:Processing engine is sent to the request and response of scheduling;
The reptile control management level include reptile management backstage module, reptile service layer module;
Reptile management backstage module uses MVC models, by calling reptile service layer module, enter to reptile service layer module The friendly interfaceization management of row;The management service of interfaceization management includes the parameter configuration of reptile, batch reptile is newly-built, climbs in batches Worm configuration, reptile startup, the startup of batch reptile, reptile are regularly, reptile daily record is checked, reptile result is checked, reptile daily record is lasting Change;
The flow of the parameter configuration of reptile is:Obtain whole reptile information that reptile service layer module provides and store to number According to storehouse, it is each reptile configuration parameter information, configures bootable batch reptile;
Reptile start flow be:The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration, The request for starting reptile is sent to reptile service layer module, if the jobid of reptile is successfully recorded, periodically to reptile service layer mould The state of the block transmission acquisition request reptile terminates until reptile;
Batch reptile start flow be:The configuration of batch reptile is obtained, is opened according to the configuration order of reptile according to reptile Dynamic flow performs each reptile successively, sends request by reptile result persistence;
The flow of timing batch reptile is:Batch reptile is selected, start time point, opening timing reptile are set;It is fixed every time When the configuration of the batch reptile is obtained when starting, judge whether its timing has been cancelled, if be cancelled, it is fixed to cancel When, log;If it is not cancelled the Booting sequence for performing batch reptile;If Server Restart, held by configuring The timing that the mode of device was originally set engineering when engineering startup all starts;
The flow that reptile daily record is checked is:Using project, spider, page, pageSize of reptile as parameter to climbing Worm service layer module sends the log information that acquisition request corresponds to reptile
Reptile service layer module crawls reptile the operation of core layer, and (startup of reptile, pause, operation monitoring, daily record are looked into See, crawl result check, persistence etc.) be encapsulated as web service, and provide JSON API method of calling to dispose and control Reptile processed, so as to support far call and parallel-expansion (i.e. by unified in web services, the side by the various operations of reptile Just the management and scheduling of whole crawler system).
Compared with prior art, the beneficial effects of the invention are as follows:
It is more efficient than prior art in efficiency in crawling for vertically subdivision field reptile, with reference to the spy of vertical subdivision reptile Sign introduces the time prediction model of reptile with starting, and the efficient tune of parallel reptile is carried out with reference to most long processing time priority algorithm Degree, saving crawl the time.
By the configuration of vertical subdivision field reptile, management, dispatch, crawl with data processing whole flow process be combined into one from The efficient system of dynamicization, it is easily managed (monitoring reptile state in real time, regularly crawl and carry out data processing), scalability height (configuration reptile is convenient and swift).
Brief description of the drawings
Fig. 1 is reptile management and running algorithm overall flow.
Fig. 2 is reptile management system frame diagram.
Fig. 3 is reptile core layer algorithm flow chart.
Embodiment
The present invention is described in further detail with embodiment below in conjunction with the accompanying drawings:
A kind of management system for being used to automate the reptile crawling method in vertical subdivision field as shown in Figure 2 includes reptile Core layer, reptile control management level are crawled, for carrying out parameter configuration, in real time operational administrative, monitoring, the management system to reptile System.
The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy uses Twisted asynchronous networks storehouse Network communication is handled, framework is clear, and includes various middleware interfaces, can flexibly complete various demands.Fig. 3 is reptile core Central layer algorithm flow chart.
Reptile crawls core layer and specifically includes following components:
Engine:The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction;
Scheduler:Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request;
Downloader:Web page contents are simultaneously returned to spider by crawl webpage;
spider:Define crawl and the resolution rules of specific website (using analytical tools such as xpath);
Project pipeline:The item returned from spider is handled, main task is cleaning, checking and data storage;
Downloader middleware:Handle the request and response between engine and downloader;
Spider middlewares:Handle spider response input and request output;
Dispatch middleware:Processing engine is sent to the request and response of scheduling;
The reptile control management level include reptile management backstage module, reptile service layer module.
Reptile management backstage module uses MVC models, by calling reptile service layer module, enter to reptile service layer module The friendly interfaceization management of row;The management service of interfaceization management includes the parameter configuration of reptile, batch reptile is newly-built, climbs in batches Worm configuration, reptile startup, the startup of batch reptile, reptile are regularly, reptile daily record is checked, reptile result is checked, reptile daily record is lasting Change.
The flow of the parameter configuration of reptile is:Obtain whole reptile information that reptile service layer module provides and store to number According to storehouse, it is each reptile configuration parameter information, configures bootable batch reptile;
Reptile start flow be:The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration, The request for starting reptile is sent to reptile service layer module, if the jobid of reptile is successfully recorded, periodically to reptile service layer mould The state of the block transmission acquisition request reptile terminates until reptile;
Batch reptile start flow be:The configuration of batch reptile is obtained, is opened according to the configuration order of reptile according to reptile Dynamic flow performs each reptile successively, sends request by reptile result persistence;
The flow of timing batch reptile is:Batch reptile is selected, start time point, opening timing reptile are set;It is fixed every time When the configuration of the batch reptile is obtained when starting, judge whether its timing has been cancelled, if be cancelled, it is fixed to cancel When, log;If it is not cancelled the Booting sequence for performing batch reptile;If Server Restart, held by configuring The timing that the mode of device was originally set engineering when engineering startup all starts;
The flow that reptile daily record is checked is:Using project, spider, page, pageSize of reptile as parameter to climbing Worm service layer module sends the log information that acquisition request corresponds to reptile
Startup, pause, operation monitoring, the daily record that reptile service layer module crawls reptile the reptile of core layer are checked, climbed Take result to check, the operation such as persistence is encapsulated as web service, and provide JSON API method of calling to dispose and control Reptile, it is so as to support far call and parallel-expansion, i.e., convenient by the way that the various operations of reptile are unified in web services The management and scheduling of whole crawler system.
The reptile crawling method in field, including following processes are vertically segmented in a kind of automation as shown in Figure 1:
First, reptile run time is predicted;
2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number;
3rd, reptile crawls.
Process one:
The targeted website source of vertical field subdivision reptile is very more, is updated for news category information especially frequently, Therefore a large amount of individually reptile tasks will be started daily, each independent reptile is because of targeted website and crawls the difference of parameter and is climbing Take and very big otherness is also shown on the time.In the case where parallel channel and reptile task determine, i.e., reptile and line number In the case of being determined with targeted website, to each new reptile task, go out reptile operation using linear regression model (LRM) Accurate Prediction Time, optimize reptile dispatching sequence, can be greatly improved and crawl efficiency, saving crawls the time.
The run time of each independent reptile is mainly influenceed by targeted website and reptile parameter, therefore to having completed Reptile task creation time prediction multiple linear regression model can effectively predict the substantially run time of new task.
The training of linear regression model (LRM) specifically includes following step:
Step 1a):By each qualitative variable quantification treatment in parameter, setting qualitative variable has k (such as k classifications Information), the virtual independents variable of k-1 0-1 are converted into, plus quantitative variable (requests for page quantity, each requesting interval time), are obtained To linear regression characteristic value;
Wherein, the k is constant, for the number for representing targeted website, the classification number that crawls data;
The quantization input feature vector for defining reptile is Xi=(x1,...,xD)T, reptile run time ti, then can obtain linear return Return model:
ti=t (Xi, W) and=WTφ(Xi) (1.1)
Wherein, W=(ω0,...,ωD)T,φ(Xi)=(1, x1,...,xD)T;The D is input feature vector XiCharacteristic Amount, xi(i=1,2 ..., D) is independent variable, ωi(i=0,1 ..., D) it is model parameter to be asked;
Step 1b):Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition damage Lose function:
Wherein, N is sample number, t=(t1,...,tN)T, X=(X1,...,XN)T;The ti(i=1,2 ..., N) it is to climb Worm i actual run times, Xi(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked;
Step 1c):Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum:
W=(XTX)-1XTt (1.3)
Wherein, the X is input feature value X as described aboveiThe matrix of composition, t are reptile actual motion as described above The vector that time is formed.
Process two:
If reptile and line number be m, have the reptile tasks that n is independent, prediction run time t corresponding to reptile task ii, Using most long processing time priority algorithm (Longest Processing Time, LPT algorithm) so that n reptile is appointed Business is completed within the time as short as possible by m parallel channel.
Most long processing time priority algorithm is according to the sequence of reptile run time length is predicted, so by n reptile task The reptile task of maximum duration is sequentially allocated to the parallel channel earliest to the end time afterwards, such Greedy strategy can obtain (4/3-1/3m) OPT upper bound.Specifically refer to:
If reptile number of tasks n≤and line number m, reptile task i is distributed into batch program i, most short scheduling time It is equal to the maximum of the prediction run time in n reptile task;
If reptile number of tasks n > and line number m, following operation is repeated until n reptile task is all allocated:
Step 2a):N reptile task is built up into most raft H1 according to prediction run time;
Step 2b):A most rickle H2 is established into m parallel channel according to the available moment;
Step 2c):H1 heap top operation is distributed to H2 heap top passage;
Step 2d):Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2;
Step 2e):Heap H1 heap top element is deleted;
Step 2f):Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap It is exactly most short scheduling time to push up element.
Process three:
Reptile is crawled including crawling core and data processing section, to realize subdivision field reptile vertical in automation Targeted website is crawled.Crawl core to be used to send request to reptile targeted website, and returning result is parsed And contents extraction, obtain the content of structuring.Data processing section is used for crawling the structured content after core parses, Filtered, screened and database persistence.
What reptile crawled comprises the following steps that described:
Step 3a):Engine opens a website, finds the spider for handling the website and asks first to the spider The URL to be crawled;
Step 3b):Engine gets first URL to be crawled from spider and uses Request in the scheduler Scheduling;
Step 3c):Engine is to the next URL to be crawled of scheduler request;
Step 3d):Scheduler returns to next URL to be crawled and forwards URL by downloading middleware to engine, engine To downloader;
Step 3e):Once page-downloading finishes, downloader generates the Response of the page, and by it under Carry middleware and be sent to engine;
Step 3f):Engine receives Response from downloader and is sent to by spider middlewares at spider Reason;
Step 3g):Spider handles Response and returns to the item crawled and new Request to engine;
Step 3h):Engine further screens the item crawled to project pipeline to data, cleaning and persistence Operation, by Request to scheduler;
Step 3i):Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.
Due to the present invention is directed the customization reptile in high perpendicular subdivision field, theme is similar, and internet is deposited at present In a large amount of mirror images, content duplication, embedded advertisement, the webpage changed on a small quantity.Especially news content, a highlight may Repeating to issue by multiple websites within these few days.For the webpage largely repeated, filtered if not done by detection, on the one hand meeting So that data redundancy, takes up space, on the other hand data are also resulted in during follow-up data are used and established such as search engine Repeat.Therefore it is a necessary job (when especially crawling big Text news content) to add nearly similar web page duplicate removal processing.
The main thought of duplicate removal is the similarity for contrasting two web page contents, sets similarity threshold, if higher than if threshold value It is considered nearly similar web page, abandons it.Therefore the key point of duplicate removal is the Similarity Measure of web page contents, and the present invention crawls core Web page contents are converted into low-dimensional vector using simhash algorithms and carry out Similarity Measure by center portion point, and nearly similar web page is carried out Duplicate removal, specifically include following step:
Step i):Crawled for n days before (in database) in (number of days here depends on the propagating characteristic of specific field) The web page contents crossed and the new web page crawled, operations described below is carried out for each webpage:
Participle:The content of text of the webpage is extracted, is segmented to obtain Feature Words, then removes the deactivation in Feature Words Word, then the tf-itf of each Feature Words is extracted as weight;
Quantify:Each Feature Words are carried out with hash computings and obtains 0-1hash strings;
Merge:The hash sequential values calculated to each Feature Words, lexical item frequency weight corresponding to this feature value is first multiplied by, Then ask the cumulative of each bit positions of all hash sequential values and become a sequence string;
Dimensionality reduction:Sequence string after will be cumulative becomes 0-1 strings and (is designated as 1 if each is more than 0, is designated as 0) less than 0, i.e., Obtain the final simhash values of the web page contents;
Step ii):Nearly similar web page filtering:By the simhash values of the new web page crawled and the simhash values of existing webpage Contrast, the Hamming distances of two hash values are calculated, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up just The new web page just captured;Otherwise, database is stored in, and updates existing simhash storehouses;
There are the simhash values of the webpage crawled in all first n days in existing simhash storehouses.
Due to the propagation of news web page have it is certain ageing, the reprinting news of newest issue typically will not apart from too long, Therefore calculative simhash webpages quantity is little, is answered plus the high efficiency of simhash algorithms, therefore in time and space Reptile efficiency is not had much affect on miscellaneous degree.
Finally it should be noted that listed above is only specific embodiment of the invention.It is clear that the invention is not restricted to Above example, there can also be many variations.One of ordinary skill in the art can directly lead from present disclosure All deformations for going out or associating, are considered as protection scope of the present invention.

Claims (7)

1. a kind of reptile crawling method for automating vertical subdivision field, it is characterised in that including following processes:
First, reptile run time is predicted;
In the case where parallel channel and reptile task determine, i.e., reptile and in the case that line number and targeted website determine, it is right Each new reptile task, reptile run time is predicted using linear regression model (LRM);
2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number;
If reptile and line number be m, have the reptile tasks that n is independent, prediction run time corresponding to reptile task i, using most Long processing time priority algorithm so that n reptile task is completed within the time as short as possible by m parallel channel;
Most long processing time priority algorithm is then n reptile task will according to the sequence of reptile run time length is predicted The reptile task of maximum duration is sequentially allocated the parallel channel earliest to the end time, and such Greedy strategy can obtain (4/ 3-1/ (3m)) OPT the upper bound, wherein m be reptile and line number, OPT be optimal time;
3rd, reptile crawls;
Reptile is crawled including crawling core and data processing section, to realize in the vertical subdivision field reptile of automation to mesh Mark website crawls;
Crawl core and be used to send to targeted website and ask, and returning result is parsed and contents extraction, tied The content of structure;
Data processing section is used to, to crawling the structured content after the parsing of core, be filtered, screened and database is held Longization.
2. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described Linear regression model (LRM) in process one is trained at interval of certain time, and the training of linear regression model (LRM) specifically includes following steps Suddenly:
Step 1a):By each qualitative variable quantification treatment in the start-up parameter of reptile, for each qualitative variable, if k Individual value is possible, then is converted into k-1 virtual independents variable 0 or 1, plus quantitative variable, obtains linear regression characteristic value, that is, measure Change input feature vector;
The quantization input feature vector for defining reptile is Xi=(x1..., xD)T, reptile run time ti, then linear regression model (LRM) is obtained:
ti=t (Xi, W) and=WTφ(Xi) (1.1)
Wherein, W=(ω0..., ωD)T, φ (Xi)=(1, x1..., xD)T;The D is input feature vector XiFeature quantity, xi (i=1,2 ..., D) is independent variable, ωi(i=0,1 ..., D) is model parameter to be asked;
Step 1b):Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition loss letter Number:
<mrow> <mi>E</mi> <mrow> <mo>(</mo> <mi>W</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msup> <mrow> <mo>{</mo> <msub> <mi>t</mi> <mi>n</mi> </msub> <mo>-</mo> <msup> <mi>W</mi> <mi>T</mi> </msup> <mi>&amp;phi;</mi> <mrow> <mo>(</mo> <msub> <mi>X</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>}</mo> </mrow> <mn>2</mn> </msup> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>2</mn> <mi>N</mi> </mrow> </mfrac> <msup> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mi>X</mi> <mi>W</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mi>X</mi> <mi>W</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1.2</mn> <mo>)</mo> </mrow> </mrow>
Wherein, N is sample number, t=(t1..., tN)T, X=(X1..., XN)T;The ti(i=1,2 ..., N) is reptile i real Border run time, Xi(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked;
Step 1c):Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum:
W=(XTX)-1XTt (1.3)
Wherein, the X is input feature value X as described aboveiThe matrix of composition, t are reptile actual run time structure as described above Into vector;
Model parameter W is trained, can be before reptile crawls, for predicting the reptile for each reptile task that will run reptile Run time.
3. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described In process two, most long processing time priority algorithm specifically refers to:
If reptile number of tasks n≤and line number m, it is (i.e. parallel that each reptile task is respectively allocated to single batch program Program, most short scheduling time are the maximum for the prediction run time being equal in n reptile task;
If reptile number of tasks n > and line number m, following operation is repeated until n reptile task is all allocated:
Step 2a):N reptile task is built up into most raft H1 according to prediction run time;
Step 2b):A most rickle H2 is established into m parallel channel according to the available moment;
Step 2c):H1 heap top operation is distributed to H2 heap top passage;
Step 2d):Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2;
Step 2e):Heap H1 heap top element is deleted;
Step 2f):Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap top member Element is exactly most short scheduling time.
4. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described The reptile of process three crawls, and specifically includes following step:
Step 3a):Engine opens a website, and finding the spider for handling the website and asking first to the spider to climb The URL taken;
Step 3b):Engine is got first URL to be crawled from spider and dispatched in the scheduler using Request;
Step 3c):Engine is to the next URL to be crawled of scheduler request;
Step 3d):Scheduler returns to next URL to be crawled and is transmitted to down URL by downloading middleware to engine, engine Carry device;
Step 3e):Once page-downloading finishes, downloader generates the Response of the page, and it is passed through in download Between part be sent to engine;
Step 3f):Engine receives Response from downloader and is sent to spider processing by spider middlewares;
Step 3g):Spider handles Response and returns to the item crawled and new Request to engine;
Step 3h):Engine further screens the item crawled to project pipeline to data, cleaning and persistence behaviour Make, by Request to scheduler;
Step 3i):Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.
5. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described In process three, crawl core and web page contents are converted into low-dimensional vector progress Similarity Measure using simhash algorithms, it is right Nearly similar web page carries out duplicate removal, specifically includes following step:
Step i):The simhash of the web page contents crawled before being extracted from database in n days, and it is directed to what is crawled New web page, operations described below is carried out for each webpage:
Participle:The content of text of the new web page crawled is extracted, is segmented to obtain Feature Words, then removed in Feature Words Stop words, then calculate the lexical item frequency of each Feature Words as weight;
Quantify:Each Feature Words are carried out with hash computings and obtains 0-1hash strings;
Merge:The hash sequential values calculated to each Feature Words, lexical item frequency weight corresponding to this feature value is first multiplied by, then Ask the cumulative of each bit positions of all hash sequential values and become a sequence string;
Dimensionality reduction:Sequence string after will be cumulative becomes 0-1 strings, that is, obtains the final simhash values of the web page contents;
Step ii):Nearly similar web page filtering:By the simhash values of the new web page crawled and the simhash values pair of existing webpage Than the Hamming distances of two hash values of calculating, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up just The new web page of crawl;Otherwise, database is stored in, and updates existing simhash storehouses;
There are the simhash values of all webpages crawled in the simhash storehouses.
6. a kind of reptile crawling method for automating vertical subdivision field according to claim 5, it is characterised in that described Crawl core and nearly similar web page duplicate removal only is carried out to news web page.
7. a kind of management system of reptile crawling method for the vertical subdivision field of automation described in claim 1, for pair Reptile carry out parameter configuration, operational administrative, in real time monitoring, it is characterised in that the management system including reptile crawl core layer, Reptile controls management level;
The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy is handled using Twisted asynchronous networks storehouse Network communication, and include various middleware interfaces, various demands can be completed:
Reptile crawls core layer and specifically includes following components:
Engine:The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction;
Scheduler:Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request;
Downloader:Web page contents are simultaneously returned to spider by crawl webpage;
spider:Define crawl and the resolution rules of specific website;
Project pipeline:The item returned from spider is handled, main task is cleaning, checking and data storage;
Downloader middleware:Handle the request and response between engine and downloader;
Spider middlewares:Handle spider response input and request output;
Dispatch middleware:Processing engine is sent to the request and response of scheduling;
The reptile control management level include reptile management backstage module, reptile service layer module;
Reptile management backstage module uses MVC models, and by calling reptile service layer module, friend is carried out to reptile service layer module Good interfaceization management;Parameter configuration of the management service including reptile of interfaceization management, batch reptile are newly-built, batch reptile is matched somebody with somebody Put, reptile start, batch reptile start, reptile timing, reptile daily record is checked, reptile result is checked, reptile daily record persistence;
The flow of the parameter configuration of reptile is:Obtain whole reptile information that reptile service layer module provides and store and arrive data Storehouse, it is each reptile configuration parameter information, configures bootable batch reptile;
Reptile start flow be:The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration, to climbing Worm service layer module sends the request for starting reptile, if successfully recording the jobid of reptile, periodically to reptile service layer module hair The state of the acquisition request reptile is sent until reptile is terminated;
Batch reptile start flow be:The configuration of batch reptile is obtained, is started according to the configuration order of reptile according to reptile and flowed Cheng Yici performs each reptile, sends request by reptile result persistence;
The flow of timing batch reptile is:Batch reptile is selected, start time point, opening timing reptile are set;Regularly open every time The configuration of the batch reptile is obtained during the beginning, judges whether its timing has been cancelled, if be cancelled, cancels timing, note Record daily record;If it is not cancelled the Booting sequence for performing batch reptile;If Server Restart, pass through the side of dispensing containers The timing that formula was originally set engineering when engineering startup all starts;
The flow that reptile daily record is checked is:Taken using project, spider, page, pageSize of reptile as parameter to reptile Business layer module sends the log information that acquisition request corresponds to reptile
The operation that reptile service layer module crawls reptile core layer is encapsulated as web service, and provides JSON API tune Reptile is disposed and controls with mode, so as to support far call and parallel-expansion.
CN201710673166.3A 2017-08-08 2017-08-08 Crawler crawling method and management system for automatic vertical subdivision field Active CN107590188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710673166.3A CN107590188B (en) 2017-08-08 2017-08-08 Crawler crawling method and management system for automatic vertical subdivision field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710673166.3A CN107590188B (en) 2017-08-08 2017-08-08 Crawler crawling method and management system for automatic vertical subdivision field

Publications (2)

Publication Number Publication Date
CN107590188A true CN107590188A (en) 2018-01-16
CN107590188B CN107590188B (en) 2020-02-14

Family

ID=61043186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710673166.3A Active CN107590188B (en) 2017-08-08 2017-08-08 Crawler crawling method and management system for automatic vertical subdivision field

Country Status (1)

Country Link
CN (1) CN107590188B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241391A (en) * 2018-09-20 2019-01-18 四川长虹电器股份有限公司 A kind of anti-crawler method climbed of solution font
CN109670101A (en) * 2018-12-28 2019-04-23 北京奇安信科技有限公司 Crawler dispatching method, device, electronic equipment and storage medium
CN109815537A (en) * 2018-12-19 2019-05-28 清华大学 A kind of high-throughput material simulation calculation optimization method based on time prediction
CN109918554A (en) * 2019-02-13 2019-06-21 平安科技(深圳)有限公司 Web data crawling method, device, system and computer readable storage medium
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN110968560A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Log collector configuration method, device and system
CN111026947A (en) * 2019-12-18 2020-04-17 烽火通信科技股份有限公司 Crawler method and embedded crawler implementation method based on browser
CN111125482A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Method and device for adjusting data crawling frequency, storage medium and processor
CN111125487A (en) * 2019-12-24 2020-05-08 个体化细胞治疗技术国家地方联合工程实验室(深圳) Crawling method and device for web crawler
CN111274466A (en) * 2019-12-18 2020-06-12 成都迪普曼林信息技术有限公司 Non-structural data acquisition system and method for overseas server
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN111552864A (en) * 2020-03-20 2020-08-18 上海恒生聚源数据服务有限公司 Method, system, storage medium and electronic equipment for removing duplicate information
CN112347394A (en) * 2020-11-30 2021-02-09 广州至真信息科技有限公司 Method and device for acquiring webpage information, computer equipment and storage medium
CN113220968A (en) * 2021-05-26 2021-08-06 西安热工研究院有限公司 Clustered network crawler-based automatic power technology standard updating system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204478A1 (en) * 2008-02-08 2009-08-13 Vertical Acuity, Inc. Systems and Methods for Identifying and Measuring Trends in Consumer Content Demand Within Vertically Associated Websites and Related Content
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN106209685A (en) * 2016-07-08 2016-12-07 武汉烽火普天信息技术有限公司 A kind of web crawlers distribution method of dynamic bandwidth towards mass data source and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204478A1 (en) * 2008-02-08 2009-08-13 Vertical Acuity, Inc. Systems and Methods for Identifying and Measuring Trends in Consumer Content Demand Within Vertically Associated Websites and Related Content
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN105956175A (en) * 2016-05-24 2016-09-21 考拉征信服务有限公司 Webpage content crawling method and device
CN106209685A (en) * 2016-07-08 2016-12-07 武汉烽火普天信息技术有限公司 A kind of web crawlers distribution method of dynamic bandwidth towards mass data source and system

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241391A (en) * 2018-09-20 2019-01-18 四川长虹电器股份有限公司 A kind of anti-crawler method climbed of solution font
CN110968560A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Log collector configuration method, device and system
CN110968560B (en) * 2018-09-29 2023-05-23 北京国双科技有限公司 Configuration method, device and system of log collector
CN111125482A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Method and device for adjusting data crawling frequency, storage medium and processor
CN111125482B (en) * 2018-10-31 2023-04-07 北京国双科技有限公司 Method and device for adjusting data crawling frequency, storage medium and processor
CN109815537A (en) * 2018-12-19 2019-05-28 清华大学 A kind of high-throughput material simulation calculation optimization method based on time prediction
CN109670101A (en) * 2018-12-28 2019-04-23 北京奇安信科技有限公司 Crawler dispatching method, device, electronic equipment and storage medium
CN109918554A (en) * 2019-02-13 2019-06-21 平安科技(深圳)有限公司 Web data crawling method, device, system and computer readable storage medium
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111026947B (en) * 2019-12-18 2022-08-12 烽火通信科技股份有限公司 Crawler method and embedded crawler implementation method based on browser
CN111026947A (en) * 2019-12-18 2020-04-17 烽火通信科技股份有限公司 Crawler method and embedded crawler implementation method based on browser
CN111274466A (en) * 2019-12-18 2020-06-12 成都迪普曼林信息技术有限公司 Non-structural data acquisition system and method for overseas server
CN111125487A (en) * 2019-12-24 2020-05-08 个体化细胞治疗技术国家地方联合工程实验室(深圳) Crawling method and device for web crawler
CN111552864A (en) * 2020-03-20 2020-08-18 上海恒生聚源数据服务有限公司 Method, system, storage medium and electronic equipment for removing duplicate information
CN111552864B (en) * 2020-03-20 2023-09-12 上海恒生聚源数据服务有限公司 Information deduplication method, system, storage medium and electronic equipment
CN111488508A (en) * 2020-04-10 2020-08-04 长春博立电子科技有限公司 Internet information acquisition system and method supporting multi-protocol distributed high concurrency
CN112347394A (en) * 2020-11-30 2021-02-09 广州至真信息科技有限公司 Method and device for acquiring webpage information, computer equipment and storage medium
CN113220968A (en) * 2021-05-26 2021-08-06 西安热工研究院有限公司 Clustered network crawler-based automatic power technology standard updating system and method
CN113220968B (en) * 2021-05-26 2023-03-14 西安热工研究院有限公司 Clustered network crawler-based automatic power technology standard updating system and method

Also Published As

Publication number Publication date
CN107590188B (en) 2020-02-14

Similar Documents

Publication Publication Date Title
CN107590188A (en) A kind of reptile crawling method and its management system for automating vertical subdivision field
Eismann et al. A review of serverless use cases and their characteristics
US20180240041A1 (en) Distributed hyperparameter tuning system for machine learning
CN105447204B (en) Network address recognition methods and device
Zhang et al. WSPred: A time-aware personalized QoS prediction framework for Web services
CN109902220B (en) Webpage information acquisition method, device and computer readable storage medium
US20050192936A1 (en) Decision-theoretic web-crawling and predicting web-page change
CN112036577B (en) Method and device for applying machine learning based on data form and electronic equipment
CN107798026A (en) Data query method and apparatus
WO2013042115A2 (en) Computerized data-aware agent systems for retrieving data to serve a dialog between human user and computerized system
CN105989074A (en) Method and device for recommending cold start through mobile equipment information
CN110532078A (en) A kind of edge calculations method for optimizing scheduling and system
CN109656963A (en) Metadata acquisition methods, device, equipment and computer readable storage medium
US20220237567A1 (en) Chatbot system and method for applying for opportunities
CN110222253A (en) A kind of collecting method, equipment and computer readable storage medium
CN101202792B (en) Method and apparatus for processing messages based on relationship between sender and recipient
CN112365157A (en) Intelligent dispatching method, device, equipment and storage medium
CN110516714A (en) A kind of feature prediction technique, system and engine
CN112149838A (en) Method, device, electronic equipment and storage medium for realizing automatic model building
CN110442766A (en) Webpage data acquiring method, device, equipment and storage medium
CN103886033B (en) Intelligent vertical searching device and method for safety industry chain
CN108021607A (en) A kind of wireless city Audit data off-line analysis method based on big data platform
US10896034B2 (en) Methods and systems for automated screen display generation and configuration
Shahoud et al. A meta learning approach for automating model selection in big data environments using microservice and container virtualization technologies
CN111882113A (en) Enterprise mobile banking user prediction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 310000 room 507, building 13, No.199, Wensan Road, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: HANGZHOU JZTDATA TECHNOLOGY Co.,Ltd.

Address before: Hangzhou City, Zhejiang province 310030 Xihu District Yaojiang Arphic court room 8-1603

Patentee before: HANGZHOU LINGHAO TECHNOLOGY Co.,Ltd.