CN107590188A

CN107590188A - A kind of reptile crawling method and its management system for automating vertical subdivision field

Info

Publication number: CN107590188A
Application number: CN201710673166.3A
Authority: CN
Inventors: 郑小林; 张建勇; 林炜华
Original assignee: Hangzhou Ling Hao Technology Co Ltd
Current assignee: HANGZHOU JZTDATA TECHNOLOGY Co.,Ltd.
Priority date: 2017-08-08
Filing date: 2017-08-08
Publication date: 2018-01-16
Anticipated expiration: 2037-08-08
Also published as: CN107590188B

Abstract

The present invention relates to reptile to crawl and management and dispatching technology, it is desirable to provide a kind of reptile crawling method and its management system for automating vertical subdivision field.The reptile crawling method in the vertical subdivision field of this kind automation includes process：Reptile run time is predicted；Batch reptile optimizing scheduling is carried out according to predicted time and line number；Reptile crawls.The present invention is more efficient than prior art in efficiency in crawling for vertically subdivision field reptile, the time prediction model of reptile is introduced with reference to the feature of vertical subdivision reptile with starting, the efficient scheduling of parallel reptile is carried out with reference to most long processing time priority algorithm, saving crawls the time.

Description

A kind of reptile crawling method and its management system for automating vertical subdivision field

Technical field

The present invention is to crawl vertically to segment field with management and dispatching technical field, more particularly to a kind of automation on reptile Reptile crawling method and its management system.

Background technology

Although the information age of data explosion contains the magnanimity information and data of all trades and professions, but the mankind receive information Quantity and the ability of processing information be limited, we often account for valuable time by the useless information of great quantities of spare, The difficulty that people obtain customized information continues to increase, and therefore, each vertical subdivision field and personalized recommendation arise at the historic moment.Hang down Notice and service are focused on some specific classification by straight subdivision field, and it is to crawl work to the vertical data for segmenting field The important and element task of the services such as propertyization recommendation.

Web crawlers is a kind of automatic acquisition web page contents, and does some structuring processing, persistences etc. to these information The program of operation.Functionally consider, reptile can substantially be divided into the whole network reptile and vertical reptile.The whole network reptile mainly services In the data acquisition of search engine, it is big to crawl depth, can efficiently capture mass data；Vertical reptile is then to be directed to specific website Or specific webpage, it is small to crawl depth, crawls target often with obvious structuring, the special data for serving vertical field is adopted Collection.

In general the whole network reptile originates url queues by inputting, and url is carried out using depth-first or breadth first algorithm Parse layer by layer, plus frameworks such as distributed platforms, the effect for efficiently capturing mass data can be reached.But in vertical subdivision neck , it is necessary in the scene being acquired to the specific data feature of multiple target, the quality of data that the whole network reptile obtains is unsatisfactory in domain. Therefore customization, scheduling and the management study to vertical subdivision field reptile seem particularly urgent.Risen to necessarily in reptile quantity During magnitude, it would be desirable to configure reptile, reptile scheduling, reptile execution, the link such as data processing are combined closely, so as to form one Individual perfect vertical field reptile management and dispatching framework.

Most like implementation has following several, Chinese invention patent application with the present invention：" webpage content extraction side Method, apparatus and system " (application number：201510124714.8) a kind of, " crawler technology based on web page crawl " (application number： 201310040090.2), " construction method of the spiders based on news duplicate removal " (application number：200910153588.3), " one The configurable vertical field web crawlers implementation method of kind plug-in type " (application number：201510131253.7), " based on weighting wheel It is the distributed reptile method for scheduling task of algorithm " (application number：201410073829.4).

A kind of webpage content extraction method that 1 (webpage content extraction method, apparatus and system) of invention proposes, device and it is System, operation layer send to extraction system and extract webpage URL request；Extraction system is climbed according to webpage URL request, invoking web page is extracted Worm system crawls the page original contents that URL is specified；Extraction system is former to the page using the template document of agreement as matching standard Beginning content is extracted, and the content of extraction is returned into operation layer, and the present invention makes full use of the ability that backstage crawls webpage, together When by parsing original web page and extraction template realize the ability for extracting original web page and specifying label substance, the program adapts to all Web page format extracts named web page label substance, improves the spirit of the ability and webpage content extraction that extract original web page Activity.But the invention has higher call format to targeted website, effect is crawled for the less unified target web of form Fruit is undesirable, therefore inapplicable vertical subdivision field has the web page crawl of multiplicity.

Internet object of the invention 2 based on user's setting, being created according to user for task, corresponding money is crawled from internet Source, rewrite URL and stored, realization is targetedly acquired to internet information；In the embodiment of the present invention, in order to carry The handling capacity and resource utilization of high system, after task requests are received, task is also split into task burst, each task point Piece only includes a website, and each task burst is performed parallel by multiple reptiles, and so, task scheduling granularity is actually Task burst, it can so improve the handling capacity and resource utilization of system.This invention is only to return to the webpage of user's request The link of content and correlation, it is not further to be handled；When the expansion of reptile task quantity size, and parallel ability is limited In the case of, the invention can not make full use of concurrency, and what is be optimal crawls the used time.

The construction method of 3 spiders based on news duplicate removal is invented, technical concept is：The text of headline is utilized Chinese words segmentation extracts the weight of the keyword and each keyword in text；Rule of thumb, choose N number of in the text Weight highest keyword forms set the C={ (t of (keyword, weight)₁, w₁), (t₂, w₂), (t₃, w₃) ... ... (t_N, w_N), Wherein：t_i：I-th of keyword；w_i：The weight of i-th of keyword；By the element in set C according to weight w_iCarry out from big to small Sequence；Each subset C that news is concentrated_iIn element sequence from big to small is carried out according to the weight of its keyword；If Determine C and C_iBetween similarity threshold value, described similarity by two set in have identical sorting position keyword number To characterize；Each C that set C and news are concentrated_iIt is compared, judges whether their similarity is higher than described threshold value； If higher than described threshold value, then it is assumed that C is repetition news；If less than described threshold value, then it is assumed that C is non-duplicate news.Compare In the algorithm, the factor that simhash considers is more comprehensive, and algorithm complex is not also high, and accuracy is stronger.

Invention 4 discloses a kind of configurable vertical field web crawlers implementation method of plug-in type, including stage of gripping and Extraction stage.Wherein, stage of gripping includes crawl configuration phase and capture program performs the stage, and extraction stage, which includes extracting, to be configured Stage and extraction program perform the stage.The present invention can realize that the webpage capture of multiple fields and information are taken out by way of configuring Take, and accuracy is high, can solve traditional search engines and be intended to the shortcomings that not clear, accuracy is not high, and can realizes multiple necks The webpage capture in domain and information extraction.The invention is determined equally for vertical subdivision field customization reptile by configuration file The method flexibility of justice crawl parameter and analytic parameter is not high, and user experience is poor；In addition, the invention does not account for yet The problem of reptile concurrency, thus it is inefficient.

Invention 5 proposes a kind of distributed reptile method for scheduling task based on weighted round robin algorithm, including 1) according to rule Mould is different, and web crawlers is divided into unit multithreading, isomorphism centralization, isomery centralization, small distributed and large-scale distributed Five class reptiles；2) master-slave architecture is disposed；3) when reptile node First Contact Connections are to main controlled node, at the beginning of main controlled node gives it Beginning weights；4) main controlled node constantly selects a reptile node, one is waited to climb according to the dispatching algorithm based on weighted round robin The URL tasks taken distribute to it；5) when reptile node has crawled a URL task, main controlled node is returned result to, it is main Control the weights of the node updates reptile node.Nearest task completion time and unfinished task of the invention by reptile node Several weight calculation methods updates the weights of the reptile node, and next is carried out to the reptile node less than main controlled node weights The distribution of business, not in view of the estimated duration of reptile in scheduling process, therefore can not Optimized Operation to greatest extent.

Above although 5 patents relate to the scheduling strategy of reptile and the duplicate removal to reptile content, but focus on vertical The reptile of field individual cultivation, there is following deficiency in them：

1st, all it is the method for more common property, in the application scenarios for needing a large amount of customization reptiles, reptile can not be run Time is predicted and efficiently dispatched；

2nd, do not formed it is a set of from reptile configuration, reptile scheduling, reptile performs, data processing whole comparatively perfect is System.

The content of the invention

It is a primary object of the present invention to overcome deficiency of the prior art, there is provided a kind of based on personalized customization reptile The reptile in the vertical subdivision field of automation crawls and management and dispatching method.In order to solve the above technical problems, the solution party of the present invention Case is：

A kind of reptile crawling method for automating vertical subdivision field, including following processes are provided：

First, reptile run time is predicted；

In the case where parallel channel and reptile task determine, i.e., reptile and situation that line number and targeted website determine Under, to each new reptile task, reptile run time is predicted using linear regression model (LRM)；

2nd, batch reptile optimizing scheduling is carried out according to predicted time and line number；

If reptile and line number be m, have the reptile tasks that n is independent, prediction run time, is adopted corresponding to reptile task i With most long processing time priority algorithm (Longest Processing Time, LPT algorithm) so that n reptile task Completed within the time as short as possible by m parallel channel；

Most long processing time priority algorithm is according to the sequence of reptile run time length is predicted, so by n reptile task The reptile task of maximum duration is sequentially allocated to the parallel channel earliest to the end time afterwards, (according to paper " Bounds on Proof in Multiprocessing Timing Anomalies ") as Greedy strategy can obtain (4/3-1/ (3m)) The OPT upper bound, wherein m for reptile and line number, OPT be optimal time (theoretical The shortest operation time)；

3rd, reptile crawls；

Reptile is crawled including crawling core and data processing section, to realize subdivision field reptile vertical in automation Targeted website is crawled；

Crawl core to be used to send request to targeted website (reptile targeted website), and returning result is parsed And contents extraction, obtain the content of structuring；

Data processing section is used to, to crawling the structured content after the parsing of core, be filtered, screened and data Storehouse persistence.

In the present invention, the linear regression model (LRM) in the process one is trained the (sample of training at interval of certain time Originally it is the input parameter and actual run time of history reptile, so most starting to need to carry out certain reptile service data product It is tired, model parameter is just constantly updated afterwards), the training of linear regression model (LRM) specifically includes following step：

Step 1a)：At each qualitative variable (targeted website, crawling data category) quantization in the start-up parameter of reptile Reason, for each qualitative variable, if k value may (k be constant, for representing the number of targeted website, crawling data Classification number), then being converted into the virtual independents variable 0 or 1 of k-1, (crawling data category has link, short text, long text, picture 4 classes, then 3 0-1 variables are introduced to represent this qualitative variable of data category；Targeted website has k, then introduces k-1 0-1 and become Measure to represent targeted website；Subtract 1 and be because if with k 0-1 independent variable, then form complete multicolliearity, and multiple linear Regression model one of assumes it is that linear relationship is not present between variable, i.e. any one variable all can not be the linear of its dependent variable Combination, so must subtract 1), plus quantitative variable (requests for page quantity, each requesting interval time), obtain linear regression spy Value indicative, that is, quantify input feature vector；

The quantization input feature vector for defining reptile is X_i=(x₁,...,x_D)^T, reptile run time t_i, then linear regression is obtained Model：

t_i=t (X_i, W) and=W^Tφ(X_i) (1.1)

Wherein, W=(ω₀,...,ω_D)^T,φ(X_i)=(1, x₁,...,x_D)^T；The D is input feature vector X_iCharacteristic Amount, x_i(i=1,2 ..., D) is independent variable, ω_i(i=0,1 ..., D) it is model parameter to be asked；

Step 1b)：Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition damage Lose function：

Wherein, N is sample number, t=(t₁,...,t_N)^T, X=(X₁,...,X_N)^T；The t_i(i=1,2 ..., N) it is to climb Worm i actual run times, X_i(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked；

Step 1c)：Local derviation is asked to W with formula (1.2), allows local derviation to be equal to 0, obtains optimized parameter W, makes E (W) minimum：

W=(X^TX)^-1X^Tt (1.3)

Wherein, the X is input feature value X as described above_iThe matrix of composition, t are reptile actual motion as described above The vector that time is formed；

Model parameter W is trained, each reptile task of reptile before reptile crawls, can will be run for predicting Reptile run time.

In the present invention, in the process two, most long processing time priority algorithm specifically refers to：

If reptile number of tasks n≤and line number m, each reptile task is respectively allocated to single batch program (i.e. Concurrent program, most short scheduling time are the maximum for the prediction run time being equal in n reptile task；

If reptile number of tasks n ＞ and line number m, following operation is repeated until n reptile task is all allocated：

Step 2a)：N reptile task is built up into most raft H1 according to prediction run time；

Step 2b)：A most rickle H2 is established into m parallel channel according to the available moment；

Step 2c)：H1 heap top operation is distributed to H2 heap top passage；

Step 2d)：Processing time of the H2 heap top passage plus H1 heap top operation is reinserted in H2；

Step 2e)：Heap H1 heap top element is deleted；

Step 2f)：Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap It is exactly most short scheduling time to push up element.

In the present invention, the reptile of the process three crawls, and specifically includes following step：

Step 3a)：Engine opens a website, finds the spider for handling the website and asks first to the spider The URL to be crawled；

Step 3b)：Engine gets first URL to be crawled from spider and uses Request in the scheduler Scheduling；

Step 3c)：Engine is to the next URL to be crawled of scheduler request；

Step 3d)：Scheduler returns to next URL to be crawled and forwards URL by downloading middleware to engine, engine To downloader；

Step 3e)：Once page-downloading finishes, downloader generates the Response of the page, and by it under Carry middleware and be sent to engine；

Step 3f)：Engine receives Response from downloader and is sent to by spider middlewares at spider Reason；

Step 3g)：Spider handles Response and returns to the item crawled and new Request to engine；

Step 3h)：Engine further screens the item crawled to project pipeline to data, cleaning and persistence Operation, by Request to scheduler；

Step 3i)：Jump to step 3b) repeat, until not having more request in scheduler, engine closes the website.

In the present invention, in the process three, crawl core web page contents are converted into using simhash algorithms it is low Dimensional vector carries out Similarity Measure, carries out duplicate removal to nearly similar web page, specifically includes following step：

Step i)：Climbed n days before being extracted from database in (number of days here depends on the propagating characteristic of specific field) The simhash of the web page contents taken, and the new web page (when extracting content) for crawling, are carried out for each webpage Operations described below：

Participle：The content of text of the new web page crawled is extracted, is segmented to obtain Feature Words, then removes feature Stop words in word, then calculate the lexical item frequencies of each Feature Words (number and the text that i.e. word occurs in the text are total The ratio between word number) it is used as weight；

Quantify：Each Feature Words are carried out with hash computings and obtains 0-1hash strings；

Merge：The hash sequential values (referring to 0-1hash strings) calculated to each Feature Words, it is corresponding to be first multiplied by this feature value Lexical item frequency weight, then ask the cumulative of each bit positions of all hash sequential values and, become a sequence string；

Dimensionality reduction：Sequence string after will be cumulative becomes 0-1 strings and (is designated as 1 if each is more than 0, is designated as 0) less than 0, i.e., Obtain the final simhash values of the web page contents；

Step ii)：Nearly similar web page filtering：By the simhash values of the new web page crawled and the simhash of existing webpage Value contrast, the Hamming distances of two hash values are calculated, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up The new web page just captured；Otherwise, database is stored in, and updates existing simhash storehouses；

There are the simhash values of all webpages crawled in the simhash storehouses.

In the present invention, the core that crawls only carries out nearly similar web page duplicate removal to news web page.

A kind of management system of the reptile crawling method for the vertical subdivision field of the automation is provided, for reptile Parameter configuration, in real time operational administrative, monitoring are carried out, the management system crawls core layer, reptile control management level including reptile；

The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy uses Twisted asynchronous networks storehouse Network communication is handled, and includes various middleware interfaces, various demands can be completed；

Reptile crawls core layer and specifically includes following components：

Engine：The flow chart of data processing of whole system is controlled, carries out the triggering of issued transaction；

Scheduler：Receive to ask enqueue arranged side by side from engine, engine is returned to after engine request；

Downloader：Web page contents are simultaneously returned to spider by crawl webpage；

spider：Define crawl and the resolution rules of specific website (using analytical tools such as xpath)；

Project pipeline：The item returned from spider is handled, main task is cleaning, checking and data storage；

Downloader middleware：Handle the request and response between engine and downloader；

Spider middlewares：Handle spider response input and request output；

Dispatch middleware：Processing engine is sent to the request and response of scheduling；

The reptile control management level include reptile management backstage module, reptile service layer module；

Reptile management backstage module uses MVC models, by calling reptile service layer module, enter to reptile service layer module The friendly interfaceization management of row；The management service of interfaceization management includes the parameter configuration of reptile, batch reptile is newly-built, climbs in batches Worm configuration, reptile startup, the startup of batch reptile, reptile are regularly, reptile daily record is checked, reptile result is checked, reptile daily record is lasting Change；

The flow of the parameter configuration of reptile is：Obtain whole reptile information that reptile service layer module provides and store to number According to storehouse, it is each reptile configuration parameter information, configures bootable batch reptile；

Reptile start flow be：The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration, The request for starting reptile is sent to reptile service layer module, if the jobid of reptile is successfully recorded, periodically to reptile service layer mould The state of the block transmission acquisition request reptile terminates until reptile；

Batch reptile start flow be：The configuration of batch reptile is obtained, is opened according to the configuration order of reptile according to reptile Dynamic flow performs each reptile successively, sends request by reptile result persistence；

The flow of timing batch reptile is：Batch reptile is selected, start time point, opening timing reptile are set；It is fixed every time When the configuration of the batch reptile is obtained when starting, judge whether its timing has been cancelled, if be cancelled, it is fixed to cancel When, log；If it is not cancelled the Booting sequence for performing batch reptile；If Server Restart, held by configuring The timing that the mode of device was originally set engineering when engineering startup all starts；

The flow that reptile daily record is checked is：Using project, spider, page, pageSize of reptile as parameter to climbing Worm service layer module sends the log information that acquisition request corresponds to reptile

Reptile service layer module crawls reptile the operation of core layer, and (startup of reptile, pause, operation monitoring, daily record are looked into See, crawl result check, persistence etc.) be encapsulated as web service, and provide JSON API method of calling to dispose and control Reptile processed, so as to support far call and parallel-expansion (i.e. by unified in web services, the side by the various operations of reptile Just the management and scheduling of whole crawler system).

Compared with prior art, the beneficial effects of the invention are as follows：

It is more efficient than prior art in efficiency in crawling for vertically subdivision field reptile, with reference to the spy of vertical subdivision reptile Sign introduces the time prediction model of reptile with starting, and the efficient tune of parallel reptile is carried out with reference to most long processing time priority algorithm Degree, saving crawl the time.

By the configuration of vertical subdivision field reptile, management, dispatch, crawl with data processing whole flow process be combined into one from The efficient system of dynamicization, it is easily managed (monitoring reptile state in real time, regularly crawl and carry out data processing), scalability height (configuration reptile is convenient and swift).

Brief description of the drawings

Fig. 1 is reptile management and running algorithm overall flow.

Fig. 2 is reptile management system frame diagram.

Fig. 3 is reptile core layer algorithm flow chart.

Embodiment

The present invention is described in further detail with embodiment below in conjunction with the accompanying drawings：

A kind of management system for being used to automate the reptile crawling method in vertical subdivision field as shown in Figure 2 includes reptile Core layer, reptile control management level are crawled, for carrying out parameter configuration, in real time operational administrative, monitoring, the management system to reptile System.

The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy uses Twisted asynchronous networks storehouse Network communication is handled, framework is clear, and includes various middleware interfaces, can flexibly complete various demands.Fig. 3 is reptile core Central layer algorithm flow chart.

Reptile crawls core layer and specifically includes following components：

Spider middlewares：Handle spider response input and request output；

The reptile control management level include reptile management backstage module, reptile service layer module.

Reptile management backstage module uses MVC models, by calling reptile service layer module, enter to reptile service layer module The friendly interfaceization management of row；The management service of interfaceization management includes the parameter configuration of reptile, batch reptile is newly-built, climbs in batches Worm configuration, reptile startup, the startup of batch reptile, reptile are regularly, reptile daily record is checked, reptile result is checked, reptile daily record is lasting Change.

Startup, pause, operation monitoring, the daily record that reptile service layer module crawls reptile the reptile of core layer are checked, climbed Take result to check, the operation such as persistence is encapsulated as web service, and provide JSON API method of calling to dispose and control Reptile, it is so as to support far call and parallel-expansion, i.e., convenient by the way that the various operations of reptile are unified in web services The management and scheduling of whole crawler system.

The reptile crawling method in field, including following processes are vertically segmented in a kind of automation as shown in Figure 1：

First, reptile run time is predicted；

3rd, reptile crawls.

Process one：

The targeted website source of vertical field subdivision reptile is very more, is updated for news category information especially frequently, Therefore a large amount of individually reptile tasks will be started daily, each independent reptile is because of targeted website and crawls the difference of parameter and is climbing Take and very big otherness is also shown on the time.In the case where parallel channel and reptile task determine, i.e., reptile and line number In the case of being determined with targeted website, to each new reptile task, go out reptile operation using linear regression model (LRM) Accurate Prediction Time, optimize reptile dispatching sequence, can be greatly improved and crawl efficiency, saving crawls the time.

The run time of each independent reptile is mainly influenceed by targeted website and reptile parameter, therefore to having completed Reptile task creation time prediction multiple linear regression model can effectively predict the substantially run time of new task.

The training of linear regression model (LRM) specifically includes following step：

Step 1a)：By each qualitative variable quantification treatment in parameter, setting qualitative variable has k (such as k classifications Information), the virtual independents variable of k-1 0-1 are converted into, plus quantitative variable (requests for page quantity, each requesting interval time), are obtained To linear regression characteristic value；

Wherein, the k is constant, for the number for representing targeted website, the classification number that crawls data；

The quantization input feature vector for defining reptile is X_i=(x₁,...,x_D)^T, reptile run time t_i, then can obtain linear return Return model：

t_i=t (X_i, W) and=W^Tφ(X_i) (1.1)

W=(X^TX)^-1X^Tt (1.3)

Wherein, the X is input feature value X as described above_iThe matrix of composition, t are reptile actual motion as described above The vector that time is formed.

Process two：

If reptile and line number be m, have the reptile tasks that n is independent, prediction run time t corresponding to reptile task i_i, Using most long processing time priority algorithm (Longest Processing Time, LPT algorithm) so that n reptile is appointed Business is completed within the time as short as possible by m parallel channel.

Most long processing time priority algorithm is according to the sequence of reptile run time length is predicted, so by n reptile task The reptile task of maximum duration is sequentially allocated to the parallel channel earliest to the end time afterwards, such Greedy strategy can obtain (4/3-1/3m) OPT upper bound.Specifically refer to：

If reptile number of tasks n≤and line number m, reptile task i is distributed into batch program i, most short scheduling time It is equal to the maximum of the prediction run time in n reptile task；

Step 2c)：H1 heap top operation is distributed to H2 heap top passage；

Step 2e)：Heap H1 heap top element is deleted；

Process three：

Reptile is crawled including crawling core and data processing section, to realize subdivision field reptile vertical in automation Targeted website is crawled.Crawl core to be used to send request to reptile targeted website, and returning result is parsed And contents extraction, obtain the content of structuring.Data processing section is used for crawling the structured content after core parses, Filtered, screened and database persistence.

What reptile crawled comprises the following steps that described：

Step 3c)：Engine is to the next URL to be crawled of scheduler request；

Due to the present invention is directed the customization reptile in high perpendicular subdivision field, theme is similar, and internet is deposited at present In a large amount of mirror images, content duplication, embedded advertisement, the webpage changed on a small quantity.Especially news content, a highlight may Repeating to issue by multiple websites within these few days.For the webpage largely repeated, filtered if not done by detection, on the one hand meeting So that data redundancy, takes up space, on the other hand data are also resulted in during follow-up data are used and established such as search engine Repeat.Therefore it is a necessary job (when especially crawling big Text news content) to add nearly similar web page duplicate removal processing.

The main thought of duplicate removal is the similarity for contrasting two web page contents, sets similarity threshold, if higher than if threshold value It is considered nearly similar web page, abandons it.Therefore the key point of duplicate removal is the Similarity Measure of web page contents, and the present invention crawls core Web page contents are converted into low-dimensional vector using simhash algorithms and carry out Similarity Measure by center portion point, and nearly similar web page is carried out Duplicate removal, specifically include following step：

Step i)：Crawled for n days before (in database) in (number of days here depends on the propagating characteristic of specific field) The web page contents crossed and the new web page crawled, operations described below is carried out for each webpage：

Participle：The content of text of the webpage is extracted, is segmented to obtain Feature Words, then removes the deactivation in Feature Words Word, then the tf-itf of each Feature Words is extracted as weight；

Merge：The hash sequential values calculated to each Feature Words, lexical item frequency weight corresponding to this feature value is first multiplied by, Then ask the cumulative of each bit positions of all hash sequential values and become a sequence string；

Step ii)：Nearly similar web page filtering：By the simhash values of the new web page crawled and the simhash values of existing webpage Contrast, the Hamming distances of two hash values are calculated, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up just The new web page just captured；Otherwise, database is stored in, and updates existing simhash storehouses；

There are the simhash values of the webpage crawled in all first n days in existing simhash storehouses.

Due to the propagation of news web page have it is certain ageing, the reprinting news of newest issue typically will not apart from too long, Therefore calculative simhash webpages quantity is little, is answered plus the high efficiency of simhash algorithms, therefore in time and space Reptile efficiency is not had much affect on miscellaneous degree.

Finally it should be noted that listed above is only specific embodiment of the invention.It is clear that the invention is not restricted to Above example, there can also be many variations.One of ordinary skill in the art can directly lead from present disclosure All deformations for going out or associating, are considered as protection scope of the present invention.

Claims

1. a kind of reptile crawling method for automating vertical subdivision field, it is characterised in that including following processes：

First, reptile run time is predicted；

In the case where parallel channel and reptile task determine, i.e., reptile and in the case that line number and targeted website determine, it is right Each new reptile task, reptile run time is predicted using linear regression model (LRM)；

If reptile and line number be m, have the reptile tasks that n is independent, prediction run time corresponding to reptile task i, using most Long processing time priority algorithm so that n reptile task is completed within the time as short as possible by m parallel channel；

Most long processing time priority algorithm is then n reptile task will according to the sequence of reptile run time length is predicted The reptile task of maximum duration is sequentially allocated the parallel channel earliest to the end time, and such Greedy strategy can obtain (4/ 3-1/ (3m)) OPT the upper bound, wherein m be reptile and line number, OPT be optimal time；

3rd, reptile crawls；

Reptile is crawled including crawling core and data processing section, to realize in the vertical subdivision field reptile of automation to mesh Mark website crawls；

Crawl core and be used to send to targeted website and ask, and returning result is parsed and contents extraction, tied The content of structure；

Data processing section is used to, to crawling the structured content after the parsing of core, be filtered, screened and database is held Longization.

2. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described Linear regression model (LRM) in process one is trained at interval of certain time, and the training of linear regression model (LRM) specifically includes following steps Suddenly：

Step 1a)：By each qualitative variable quantification treatment in the start-up parameter of reptile, for each qualitative variable, if k Individual value is possible, then is converted into k-1 virtual independents variable 0 or 1, plus quantitative variable, obtains linear regression characteristic value, that is, measure Change input feature vector；

The quantization input feature vector for defining reptile is X_i=(x₁..., x_D)^T, reptile run time t_i, then linear regression model (LRM) is obtained：

t_i=t (X_i, W) and=W^Tφ(X_i) (1.1)

Wherein, W=(ω₀..., ω_D)^T, φ (X_i)=(1, x₁..., x_D)^T；The D is input feature vector X_iFeature quantity, x_i (i=1,2 ..., D) is independent variable, ω_i(i=0,1 ..., D) is model parameter to be asked；

Step 1b)：Utilize least square method so that the quadratic sum of predicted time and real time reach minimum, definition loss letter Number：

<mrow> <mi>E</mi> <mrow> <mo>(</mo> <mi>W</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>n</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msup> <mrow> <mo>{</mo> <msub> <mi>t</mi> <mi>n</mi> </msub> <mo>-</mo> <msup> <mi>W</mi> <mi>T</mi> </msup> <mi>&phi;</mi> <mrow> <mo>(</mo> <msub> <mi>X</mi> <mi>n</mi> </msub> <mo>)</mo> </mrow> <mo>}</mo> </mrow> <mn>2</mn> </msup> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mn>2</mn> <mi>N</mi> </mrow> </mfrac> <msup> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mi>X</mi> <mi>W</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <mrow> <mo>(</mo> <mi>t</mi> <mo>-</mo> <mi>X</mi> <mi>W</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1.2</mn> <mo>)</mo> </mrow> </mrow>

Wherein, N is sample number, t=(t₁..., t_N)^T, X=(X₁..., X_N)^T；The t_i(i=1,2 ..., N) is reptile i real Border run time, X_i(i=1,2 ..., N) is reptile i input feature value, and W is parameter vector to be asked；

W=(X^TX)^-1X^Tt (1.3)

Wherein, the X is input feature value X as described above_iThe matrix of composition, t are reptile actual run time structure as described above Into vector；

Model parameter W is trained, can be before reptile crawls, for predicting the reptile for each reptile task that will run reptile Run time.

3. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described In process two, most long processing time priority algorithm specifically refers to：

If reptile number of tasks n≤and line number m, it is (i.e. parallel that each reptile task is respectively allocated to single batch program Program, most short scheduling time are the maximum for the prediction run time being equal in n reptile task；

Step 2c)：H1 heap top operation is distributed to H2 heap top passage；

Step 2e)：Heap H1 heap top element is deleted；

Step 2f)：Repeat step step 2c) to step 2e), until the element in H1 is all deleted, heap H2 heap top member Element is exactly most short scheduling time.

4. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described The reptile of process three crawls, and specifically includes following step：

Step 3a)：Engine opens a website, and finding the spider for handling the website and asking first to the spider to climb The URL taken；

Step 3b)：Engine is got first URL to be crawled from spider and dispatched in the scheduler using Request；

Step 3c)：Engine is to the next URL to be crawled of scheduler request；

Step 3d)：Scheduler returns to next URL to be crawled and is transmitted to down URL by downloading middleware to engine, engine Carry device；

Step 3e)：Once page-downloading finishes, downloader generates the Response of the page, and it is passed through in download Between part be sent to engine；

Step 3f)：Engine receives Response from downloader and is sent to spider processing by spider middlewares；

Step 3h)：Engine further screens the item crawled to project pipeline to data, cleaning and persistence behaviour Make, by Request to scheduler；

5. a kind of reptile crawling method for automating vertical subdivision field according to claim 1, it is characterised in that described In process three, crawl core and web page contents are converted into low-dimensional vector progress Similarity Measure using simhash algorithms, it is right Nearly similar web page carries out duplicate removal, specifically includes following step：

Step i)：The simhash of the web page contents crawled before being extracted from database in n days, and it is directed to what is crawled New web page, operations described below is carried out for each webpage：

Participle：The content of text of the new web page crawled is extracted, is segmented to obtain Feature Words, then removed in Feature Words Stop words, then calculate the lexical item frequency of each Feature Words as weight；

Dimensionality reduction：Sequence string after will be cumulative becomes 0-1 strings, that is, obtains the final simhash values of the web page contents；

Step ii)：Nearly similar web page filtering：By the simhash values of the new web page crawled and the simhash values pair of existing webpage Than the Hamming distances of two hash values of calculating, if Hamming distances are less than 3, then it is assumed that the two webpages are near similar, are given up just The new web page of crawl；Otherwise, database is stored in, and updates existing simhash storehouses；

6. a kind of reptile crawling method for automating vertical subdivision field according to claim 5, it is characterised in that described Crawl core and nearly similar web page duplicate removal only is carried out to news web page.

7. a kind of management system of reptile crawling method for the vertical subdivision field of automation described in claim 1, for pair Reptile carry out parameter configuration, operational administrative, in real time monitoring, it is characterised in that the management system including reptile crawl core layer, Reptile controls management level；

The reptile crawls core layer and is based on Scrapy reptile application frameworks, and Scrapy is handled using Twisted asynchronous networks storehouse Network communication, and include various middleware interfaces, various demands can be completed：

Reptile crawls core layer and specifically includes following components：

spider：Define crawl and the resolution rules of specific website；

Spider middlewares：Handle spider response input and request output；

Reptile management backstage module uses MVC models, and by calling reptile service layer module, friend is carried out to reptile service layer module Good interfaceization management；Parameter configuration of the management service including reptile of interfaceization management, batch reptile are newly-built, batch reptile is matched somebody with somebody Put, reptile start, batch reptile start, reptile timing, reptile daily record is checked, reptile result is checked, reptile daily record persistence；

The flow of the parameter configuration of reptile is：Obtain whole reptile information that reptile service layer module provides and store and arrive data Storehouse, it is each reptile configuration parameter information, configures bootable batch reptile；

Reptile start flow be：The parameter configuration of the reptile is obtained, is prepared to start the request of reptile according to parameter configuration, to climbing Worm service layer module sends the request for starting reptile, if successfully recording the jobid of reptile, periodically to reptile service layer module hair The state of the acquisition request reptile is sent until reptile is terminated；

Batch reptile start flow be：The configuration of batch reptile is obtained, is started according to the configuration order of reptile according to reptile and flowed Cheng Yici performs each reptile, sends request by reptile result persistence；

The flow of timing batch reptile is：Batch reptile is selected, start time point, opening timing reptile are set；Regularly open every time The configuration of the batch reptile is obtained during the beginning, judges whether its timing has been cancelled, if be cancelled, cancels timing, note Record daily record；If it is not cancelled the Booting sequence for performing batch reptile；If Server Restart, pass through the side of dispensing containers The timing that formula was originally set engineering when engineering startup all starts；

The flow that reptile daily record is checked is：Taken using project, spider, page, pageSize of reptile as parameter to reptile Business layer module sends the log information that acquisition request corresponds to reptile

The operation that reptile service layer module crawls reptile core layer is encapsulated as web service, and provides JSON API tune Reptile is disposed and controls with mode, so as to support far call and parallel-expansion.