CN106547824B - One kind crawling paths planning method and device - Google Patents

One kind crawling paths planning method and device Download PDF

Info

Publication number
CN106547824B
CN106547824B CN201610867888.8A CN201610867888A CN106547824B CN 106547824 B CN106547824 B CN 106547824B CN 201610867888 A CN201610867888 A CN 201610867888A CN 106547824 B CN106547824 B CN 106547824B
Authority
CN
China
Prior art keywords
page
path
crawled
feature
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610867888.8A
Other languages
Chinese (zh)
Other versions
CN106547824A (en
Inventor
张煜苒
帅伟良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201610867888.8A priority Critical patent/CN106547824B/en
Publication of CN106547824A publication Critical patent/CN106547824A/en
Application granted granted Critical
Publication of CN106547824B publication Critical patent/CN106547824B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses one kind to crawl paths planning method and device, and method includes: to crawl strategy according to default, since default portal page, crawls the page that the default portal page corresponds to website;The page feature of each crawled page is acquired, record reaches the path examples of each crawled page from the default portal page;According to the path examples of record and the page feature of each crawled page, the path examples of the arrival page similar with the goal-selling page are picked out;Path planning is carried out according to the page feature of each page in the path examples selected and the path examples selected, generates route programming result.Using the embodiment of the present invention, the efficiency of path planning can be improved, also can be reduced and crawl burden.

Description

One kind crawling paths planning method and device
Technical field
The present invention relates to Internet technical field, in particular to one kind crawls paths planning method and device.
Background technique
Web crawlers can automatically extract webpage, be search engine from WWW downloading webpage, be the important of search engine Component part, currently, web crawlers has become the main means from internet acquisition massive information data, it is much outstanding to open Source crawler frame also has already appeared.Web crawlers is broadly divided into two classes: one kind is the search crawler for search engine, crawls mesh Mark is entire internet;One kind is orientation crawler, and crawling target is a specific subset in all websites, or even is exactly a certain A website.For the orientation crawler for crawling webpage from a certain website, there are two types of implementations at present: first is that passing through developer Participation, definition planning is accurately executable to crawl route result, and orientation crawler carries out crawling work according to route result is crawled Make;It plans that accurately be can be performed crawls route result second is that not defining, directly carries out whole station formula and crawl.
Above two implementation is respectively present following problems:
For first way, need wherein planning accurately crawls route result through developer's analysis and research network The problem of page code is realized, and web page code is more complex, will lead to low efficiency in this way.
For the second way, although reducing the workload of developer, since there are the redundancy pages in website, directly It taps into row whole station formula and crawls the downloading that will cause the excessive useless page, increase burden to work is crawled.
Summary of the invention
The one kind that is designed to provide of the embodiment of the present invention crawls paths planning method and device, can improve road to realize The efficiency of diameter planning, also can be reduced and crawls burden purpose.
In order to achieve the above objectives, the embodiment of the invention discloses one kind to crawl paths planning method and device.Technical solution It is as follows:
One kind provided in an embodiment of the present invention crawls paths planning method, comprising:
Strategy is crawled according to default, since default portal page, crawls the page of the default portal page affiliated web site Face;
The page feature of each crawled page is acquired, record reaches each crawled from the default portal page The path examples of the page;
According to the path examples of record and the page feature of each crawled page, arrival and goal-selling page are picked out The path examples of the similar page in face;
Path planning is carried out according to the page feature of each page in the path examples selected and the path examples selected, Generate route programming result.
Preferably, described according to the path examples of record and the page feature of each crawled page, pick out arrival The path examples of the page similar with the goal-selling page, comprising:
According to the page feature of each crawled page, the page classifications that will be crawled;
According to page classifications as a result, determining category node belonging to the goal-selling page;
From the path examples recorded, the path examples for reaching the corresponding page of determined category node are picked out.
Preferably, the page feature according to each page in the path examples selected and the path examples selected into Row path planning generates route programming result, comprising:
According to page classifications as a result, determining classification belonging to each page in the path examples picked out;
According to the path examples and identified classification picked out, generate using classification as the active path of node;
According to process model mining algorithm and the page classifications as a result, from active path obtained, excavate meet it is pre- If rule crawls path profile and the description file for crawling path profile, wherein the description file crawls road including described Relationship in diameter figure between category node, the page feature of each category node are the page features according to category node corresponding page It obtains;
According to the description file and the page feature of each category node, section of all categories in path profile is crawled described in generation Extraction relationship between point, wherein the page feature of each category node is obtained according to the page feature of category node corresponding page ;
According to the extraction relationship, route programming result is generated, wherein the route programming result includes advising using grammer The extraction relationship then described.
Preferably, in the case where the page feature includes page link and page source code structure, the basis is each The page feature of a crawled page, the page classifications that will be crawled, comprising:
For the every two page in the page crawled, the first similarity and the page source generation of page link are calculated separately Second similarity of code structure;
According to preset weight, first similarity and second similarity are summed, obtain comprehensive similarity;
According to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
Preferably, the default strategy that crawls is specially that breadth First crawls strategy.
Second aspect, one kind provided in an embodiment of the present invention crawl path planning apparatus, comprising:
Module is crawled, for crawling strategy according to default, since default portal page, crawls the default portal page The page of affiliated web site;
Processing module, for acquiring the page feature of each crawled page, record from the default portal page to Up to the path examples of each crawled page;
Choosing module, for picking out according to the path examples of record and the page feature of each crawled page Up to the path examples of the page similar with the goal-selling page;
Planning module, for the page feature according to each page in the path examples selected and the path examples selected Path planning is carried out, route programming result is generated.
Preferably, the Choosing module includes:
Taxon, for the page feature according to each crawled page, the page classifications that will be crawled;
First determination unit, for according to page classifications as a result, determining classification belonging to the goal-selling page;
Module of selection, for from the path examples recorded, picking out the road for reaching the corresponding page of determined classification Diameter example.
Preferably, the planning module includes:
Second determination unit, for according to page classifications as a result, determining in the path examples picked out belonging to each page Classification;
First generation unit is section for according to the path examples and identified classification picked out, generating with classification The active path of point;
Excavate unit, for according to process model mining algorithm and the page classifications as a result, from active path obtained, Excavate meet preset rules crawl path profile and the description file for crawling path profile, wherein the description file packet The relationship crawled in path profile between category node is included, the page feature of each category node is to correspond to page according to category node What the page feature in face obtained;
Second generation unit is climbed described in generation for the page feature according to the description file and each category node Take the extraction relationship in path profile between node of all categories, wherein the page feature of each category node is according to category node pair The page feature of the page is answered to obtain;
Third generation unit generates route programming result, wherein the route programming result according to the extraction relationship Including the extraction relationship using syntax rule description.
Preferably, the page feature includes page link and page source code structure, and the taxon includes:
Computation subunit, for calculating separately the first of page link for the every two page in the crawled page Second similarity of similarity and page source code structure;
Subelement is obtained, for first similarity and second similarity being summed, obtained according to preset weight Obtain comprehensive similarity;
Classification subelement, for according to comprehensive similarity and default Measurement of Similarity value is obtained, the page crawled to be divided Class.
Preferably, the default strategy that crawls is specially that breadth First crawls strategy.
Using the embodiment of the present invention, since default portal page, the page of default portal page affiliated web site is crawled;Note Record reaches the path examples of each crawled page from default portal page, and the page for acquiring each crawled page is special Sign;By specifically crawling operation, the path examples between the corresponding specific page are obtained, sampling operation is completed.According to record The page feature of path examples and each crawled page picks out the path of the arrival page similar with the goal-selling page Example carries out path planning according to the page feature for the path examples and the corresponding page selected, generates route programming result. This process does not have the participation of developer, does not need the web page code that developer goes research complicated, and path rule can be improved The efficiency drawn;The sample that path planning relies on has specific aim, can reduce and generate unnecessary route programming result, can be certain Guarantee to degree to crawl the comprehensive of result, crawling comprehensively compared with the existing technology can greatly reduce and crawl burden.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram provided in an embodiment of the present invention for crawling paths planning method;
Fig. 2 be another embodiment of the present invention provides the flow diagram for crawling paths planning method;
Fig. 3 is the structural schematic diagram provided in an embodiment of the present invention for crawling path planning apparatus;
Fig. 4 be another embodiment of the present invention provides the structural schematic diagram for crawling path planning apparatus.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In order to realize the efficiency that can improve path planning, the purpose for crawling burden also can be reduced, the embodiment of the present invention mentions One kind has been supplied to crawl paths planning method and device.
Paths planning method is crawled to one kind provided in an embodiment of the present invention first below to be introduced.
Referring to Fig.1, Fig. 1 is the flow diagram provided in an embodiment of the present invention for crawling paths planning method, this method packet Include following steps:
S101, strategy is crawled according to default, since default portal page, crawls the page of default portal page affiliated web site Face;
Crawl strategy have breadth First plan crawl strategy and depth-first crawl strategy, wherein breadth First plan crawls plan Slightly basic thought be to crawl the page according to the content of pages TOC level depth, the page in shallower TOC level first by It crawls, creeps and finish when the page in same level, then protrude into next level and continue to crawl, this strategy can be effectively controlled the page Crawl depth, can not terminate to creep when avoiding the problem that encountering an infinite deep layer branch.Depth-first crawls the base of strategy This thought is the sequence according to depth from low to high, successively accesses next stage web page interlinkage, until cannot be deeply, this plan Slightly it be easy to cause the waste of resource.In this step, strategy is preferentially crawled using breadth First.
Strategy is crawled for breadth First, when it is implemented, extracting its all subchain for a page of downloading It connects, then downloads the corresponding page of sublink, be continued for carrying out, until crawling movement terminates, specifically at what time Terminate, can be limited by defining the time, if the machine for crawling is more, be can be set after the shorter time eventually Only;If the machine crawled is less, it can be set and terminated in longer time, for example, have 5 for the machines that crawl, it can be with Regulation crawls three hours;There are 3 machines for crawling, can specify that and crawl 5 hours.
In order to obtain more accurate route programming result so as to crawl more similar to the goal-selling page The page, default portal page be according to crawl demand acquisition, the homepage of usually one website, for example, it is desired to from www.qq.com The homepage stood starts to crawl, and to obtain all information of film, the homepage of Tencent website is provided i.e. in the form of page link “http://v.qq.com/”。
The page feature of S102, each the crawled page of acquisition, record reach each crawled from default portal page The page path examples;
When it is implemented, the page feature of each the crawled page of acquisition may include page link and page source code Structure, wherein page link for example: http://v.qq.com/x/movielist/ cate=10001&of_fset=0& Sort=5&pay=-1;http://v.qq.com/cover/3/3ew17ydbfgmy79r.html.
Page source code structure refers to that hypertext markup language (Hyper Text Markup Language tag) is marked Label, is also generally referred to as html label, and page source code structure uses the string-concatenation of all html labels of the page Form indicates.
Belong to the prior art by the character string that the specific page obtains corresponding page link and html label, does not do herein It repeats.
In this step, the specific path examples crawled to be recorded, the node in path examples is the specific page, Can be fetched using page chain indicates single page node, for example, for the page that range crawls since portal page s, note The path examples of record have s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- > b2->e1, s- > a- > b1-> e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2;Wherein, letter is in order to express easily, to link to representing pages. The path for meaning to reach all pages on the path by recording whole path is also recorded.For example, have recorded s- > b1->c1->e1This paths example, it is actually implicit to have recorded s- > b1, s- > b1->c1, s- > b1->c1->e1Equal paths.
S103, according to the path examples of record and the page feature of each crawled page, pick out arrival and default The path examples of the similar page of target pages;
It should be noted that target pages are determining webpages, for example, target webpage is http://v.qq.com/x/ cover/3ew17ydbfgmy79r/x002159scet.html.The page similar with the goal-selling page has very big possibility It is intended to the page crawled, there is the relevant content of pages of more multiple target.
When specifically used, path profile is more comprehensively crawled in order to finally excavate covering, it is preferable that pick out arrival All path examples of the page similar with the goal-selling page.
S104, path is carried out according to the page feature of each page in the path examples selected and the path examples selected Planning generates route programming result.
Using embodiment illustrated in fig. 1, since default portal page, the page of default portal page affiliated web site is crawled; Record reaches the path examples of each crawled page from default portal page, and the page for acquiring each crawled page is special Sign;By specifically crawling operation, the path examples between the corresponding specific page are obtained, sampling operation is completed.According to record The page feature of path examples and each crawled page picks out the path of the arrival page similar with the goal-selling page Example carries out path planning according to the page feature for the path examples and the corresponding page selected, generates route programming result. This process does not have the participation of developer, does not need the web page code that developer goes research complicated, and path rule can be improved The efficiency drawn;The sample that path planning relies on has specific aim, can reduce and generate unnecessary route programming result, can be certain Guarantee to degree to crawl the comprehensive of result, crawling comprehensively compared with the existing technology can greatly reduce and crawl burden.
In another embodiment of the present invention, referring to fig. 2, Fig. 2 be another embodiment of the present invention provides crawl road The flow diagram of diameter planing method, compared with embodiment illustrated in fig. 1, in the present embodiment, according to the path examples of record and respectively The page feature of a crawled page is picked out the path examples of the arrival page similar with the goal-selling page, be can wrap It includes:
S1031, according to the page feature of each crawled page, the page classifications that will be crawled;
Classification in this step is carried out based on the almost the same fact of the page feature of the similar page of same website , specific classification method follows the steps below:
(1), for the every two page in the page crawled, the first similarity and the page of page link are calculated separately Second similarity of source code structure;
The similarity algorithm of page link calculates the first similarity, is calculated according to the similarity algorithm of page source code structure Second similarity, wherein the similarity algorithm of page link and the similarity algorithm of page source code structure belong to existing skill Art, this will not be repeated here.
For example, all path examples of record are s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- >b2->e1, s- > a- > b1->e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2, then can according to these similarity algorithms, Calculate page b1With page b2The first similarity be 0.9, page a and page b1The first similarity be 0.3;Page b1And page Face b2The second similarity be 0.91, page a and page b1The second similarity be respectively 0.2.
(2), according to preset weight, the first similarity and the second similarity are summed, obtain comprehensive similarity;
When specifically used, the first similarity and the second Similarity-Weighted are summed, this default weight can be according to major The general feature of website is arranged, if the similarity of the page link of the similar webpage of a webpage is larger, and page source The similarity of code structure is smaller, then larger, and the weight of the second similarity can be arranged in the weight of the first similarity What is be arranged is smaller.Similarly, if the similarity of the page source code structure of the similar webpage of a webpage is larger, and page chain The similarity connect is smaller, then can by the weight of the first similarity be arranged it is smaller, and the weight of the second similarity setting It is larger.It is of course also possible to use weight is arranged in other modes, setting of the embodiment of the present invention to weight do not do specific limit It is fixed.
For example, the weight for page link and page source code structure setting is respectively 0.8 and 0.2, then the page is directed to b1With page b2, calculate 0.9 × 0.8+0.91 × 0.2 and obtain comprehensive similarity 0.902;For page a and page b1, calculate 0.3 It is 0.28 that × 0.8+0.2 × 0.2, which obtains comprehensive similarity,.
(3), according to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
In this step, comprehensive similarity can be compared with default Measurement of Similarity value, comprehensive similarity is greater than default Measurement of Similarity value, it is believed that corresponding two pages of comprehensive similarity belong to one kind;On the contrary, then it is assumed that be not a kind of.For example, Page b1With page b2Comprehensive similarity be 0.902, page a and page b1Comprehensive similarity be 0.28, and it is preset similar Degree standard value 0.85 is compared, and 0.902 is greater than 0.85, instruction page b1With page b2Similarity with higher, it is believed that be one Class, and 0.28 less than 0.85, instruction page b1With page b2With lower similarity, it is believed that be not a kind of.
It, can be by s- > b finally according to comprehensive similarity1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- >b2->e1, s- > a- > b1->e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2Page classifications on path, s belong to S class, b1、b2Belong to B class, c1、c2、c3Belong to C class, e1、e2、e3Belong to E class, f1、f2、f3Belong to F class, m1、m2Belong to M class.
Other than above-mentioned classification method, it can also be instructed using the page feature of the page crawled as training sample Practice Learning machine, to obtain the classifier for being suitable for classifying in the present invention, the specific page may be implemented using corresponding classifier Classification.
S1032, according to page classifications as a result, determine the goal-selling page belonging to classification;
This step when it is implemented, acquire the page link and page source code structure of the goal-selling page, then root first According to classification as a result, calculating the similarity of a certain page feature of the goal-selling page and any sort page, maximum similarity pair The classification answered is classification belonging to the goal-selling page.For example, determining that the goal-selling page belongs to above-mentioned E class.
S1033, from the path examples recorded, pick out the path examples for reaching the corresponding page of determined classification.
For example, all path examples of record are s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- >b2->e1, s- > a- > b1->e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2, the goal-selling page determined belongs to E Class then picks out the path examples for reaching the corresponding page of E class are as follows: s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3-> e3, s- > a- > b2->e1, s- > a- > b1->e1, s- > m1->n1>e2, s- > m2->n2->e2
Compared with embodiment illustrated in fig. 1, in the present embodiment, according to the path examples selected and the path examples selected In each page page feature carry out path planning, generate route programming result, may include:
S1041, according to page classifications as a result, determining classification belonging to each page in the path examples picked out;
For example, the path examples picked out are as follows: s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- > b2->e1, s- > a- > b1->e1, s- > m1->n1>e2, s- > m2->n2->e2, wherein s belongs to S class, b1、b2Belong to B class, c1、c2、c3 Belong to C class, e1、e2、e3Belong to E class, f1、f2、f3Belong to F class, m1、m2Belong to M class.
The path examples and identified classification that S1042, basis are picked out, generate using classification as effective road of node Diameter;
For example, can be generated has by the active path of node of classification according to above-mentioned example: S- > B- > C- > E, S- > A- > B- > E, S- > M- > N- > E.
S1043, according to process model mining algorithm and page classifications as a result, from active path obtained, excavate and meet Crawling for preset rules and crawls the description file of path profile at path profile, wherein description file includes crawling classification in path profile Relationship between node;
Wherein, crawling path profile is obtained by data mining, be simplify and using classification as the path profile of node, can It can be individual paths figure, it is also possible to the mesh paths figure with branch.Relationship between category node, which specifically includes, crawls path The connection relationship of node of all categories in figure, for example, A class can directly arrive B class, B class, which can refer to, is connected to D class etc..
When it is implemented, preset rules therein meet the following conditions: that excavates crawls path profile in the page crawled In the page covered in face, the page similar with the goal-selling page is as more as possible, and intermediate page is few as far as possible.It is default Rule can specifically covered with intermediate page with the ratio of the page similar with the goal-selling page, or setting intermediate page The page in ratio embody, the two ratios are of equal value, if ratio is smaller, illustrate that corresponding to crawl path profile more excellent. Set scale threshold value, ratio used do not meet preset rules if it is greater than proportion threshold value, representative;Ratio used is if it is less than ratio Threshold value, expression meet preset rules.
S1044, according to description file and each category node page feature, generation crawl node of all categories in path profile Between extraction relationship, wherein the page feature of each category node be according to the page feature of category node corresponding page obtain 's;
In this step, first according to the page feature extraction main feature of category node corresponding page as node of all categories Page feature, first page feature and second page feature, first page including each category node are characterized according to page Face link is extracted, and second page is characterized according to page source code structure extraction.For example, according to the specific page c1, c2, c3 Page feature extraction C class page feature, for page link, the page link for obtaining the page c1, c2, c3 shares part, The shared part of the page link indicated using regular expression is the first page feature of C class, for page source code Structure obtains and shares part in the html label of the specific page c1, c2, c3, and using the splicing form of html tag characters string It indicates, is second of page feature of C class.
If indicating that A class can directly reach C class in description file, according to second of page feature in A class, and The first page feature of C class can determine css path or jsonpath of the C class in the A class page.In this way It can determine css path or jsonpath of the subclass category node in parent category node, classification can be obtained so as to subsequent Extraction relationship between node.
According to above-mentioned detailed process, it can determine how to be drawn into the A class page from C class page specific location, thus may be used To generate the extraction relationship between node of all categories.
S1045, according to extract relationship, generate route programming result, wherein route programming result include use syntax rule The extraction relationship of description.
Extraction relationship is described using syntax rule, particular by regular expression, css path or json path Any extract sublink from a kind of page.For example, for qq website, from class page http://v.q_q.com/x/ Movielist/ cate=10001&offset=0&sort=5&pay=-1, can by css path:#vid_eos > ul > Li > strong > a can also pass through regular expression: ^http: //v.qq.com/cover/.+ navigates to such page and is wrapped The broadcasting link of all single album class pages contained.
When specifically used, the route programming result cooked up through the embodiment of the present invention is embedded into crawler system, is climbed Worm system is crawled according to corresponding route programming result, and more good the whole network search service can be provided for user.
Using embodiment illustrated in fig. 2, classify for the page crawled, determines class belonging to the goal-selling page Not, according to this classification determine this classification corresponding to the page, and pick out reach these pages path examples, pass through The specific page in path examples is substituted for the corresponding classification of the specific page, can polymerize and be saved by path of category node The active path of point, meets crawling for preset rules as a result, excavating from active path according to mining algorithm and page classifications Path profile and corresponding description file crawl path described in generation according to description file and the page feature of each category node Extraction relationship in figure between node of all categories, finally, generating route programming result according to the relationship of extraction.By generating from one kind The page is drawn into another kind of or another a few class pages, the route programming result of the target class page is finally drawn into, relative to existing skill For the artificial planning path of art, labour has been liberated, has improved path planning efficiency.Meanwhile it avoiding because of artificial planning path Subjectivity and bring target class crawl the page missing problem generation.Whole station formula in compared with the existing technology crawls, Not only reduce the downloading of a large amount of useless pages, moreover it is possible to which guarantee crawls the target class page more comprehensively.
Corresponding with above-mentioned embodiment of the method, the embodiment of the invention also provides one kind to crawl path planning apparatus.
Referring to Fig. 3, Fig. 3 is the structural schematic diagram provided in an embodiment of the present invention for crawling path planning apparatus, this crawls road Diameter device for planning, comprising:
Module 31 is crawled, for crawling strategy according to default, since default portal page, crawls the default portal page The page of face affiliated web site;Wherein it is preferred to which the default strategy that crawls is specially that breadth First crawls strategy.
Processing module 32, for recording the path examples for reaching each crawled page from the default portal page, Acquire the page feature of each crawled page;
Choosing module 33, for picking out according to the path examples of record and the page feature of each crawled page Reach the path examples of the page similar with the goal-selling page;
Planning module 34, for special according to the page of each page in the path examples selected and the path examples selected Sign carries out path planning, generates route programming result.
Using embodiment illustrated in fig. 3, since default portal page, the page of default portal page affiliated web site is crawled; Record reaches the path examples of each crawled page from default portal page, and the page for acquiring each crawled page is special Sign;By specifically crawling operation, the path examples between the corresponding specific page are obtained, sampling operation is completed.According to record The page feature of path examples and each crawled page picks out the path of the arrival page similar with the goal-selling page Example carries out path planning according to the page feature for the path examples and the corresponding page selected, generates route programming result. This process does not have the participation of developer, does not need the web page code that developer goes research complicated, and path rule can be improved The efficiency drawn;The sample that path planning relies on has specific aim, can reduce and generate unnecessary route programming result, can be certain Guarantee to degree to crawl the comprehensive of result, crawling comprehensively compared with the existing technology can greatly reduce and crawl burden.
In another specific embodiment of the invention, referring to fig. 4, Fig. 4 be another embodiment of the present invention provides crawl The structural schematic diagram of path planning apparatus, compared with embodiment illustrated in fig. 3, in the present embodiment, Choosing module 33 specifically include with Under several units:
Taxon 331, for the page feature according to each crawled page, the page classifications that will be crawled;
First determination unit 332, for according to page classifications as a result, determining classification belonging to the goal-selling page;
Module of selection 333 reaches the corresponding page of determined classification for picking out from the path examples recorded Path examples.
Wherein, wherein the page feature includes page link and page source code structure, and the taxon 331 is wrapped It includes:
Computation subunit, for calculating separately the first of page link for the every two page in the crawled page Second similarity of similarity and page source code structure;
Subelement is obtained, for first similarity and second similarity being summed, obtained according to preset weight Obtain comprehensive similarity;
Classification subelement, for according to comprehensive similarity and default Measurement of Similarity value is obtained, the page crawled to be divided Class.
In the present embodiment, compared with embodiment illustrated in fig. 3, planning module 34 is specifically included:
Second determination unit 341, for according to page classifications as a result, determining in the path examples picked out belonging to each page Classification;
First generation unit 342, for path examples and identified classification that basis is picked out, generation is with classification The active path of node;
Unit 343 is excavated, is used for according to process model mining algorithm and the page classifications as a result, from active path obtained In, excavate meet preset rules crawl path profile and the description file for crawling path profile, wherein the description file Including the relationship crawled in path profile between category node;
Second generation unit 344, for the page feature according to the description file and each category node, described in generation Crawl the extraction relationship in path profile between node of all categories, wherein the page feature of each category node is according to category node What the page feature of corresponding page obtained;
Third generation unit 345 generates route programming result, wherein the path planning knot according to the extraction relationship Fruit includes the extraction relationship using syntax rule description.
Using embodiment illustrated in fig. 4, classify for the page crawled, determines class belonging to the goal-selling page Not, according to this classification determine this classification corresponding to the page, and pick out reach these pages path examples, pass through The specific page in path examples is substituted for the corresponding classification of the specific page, can polymerize and be saved by path of category node The active path of point, meets preset rules as a result, excavating from active path according to mining algorithm and the page classifications Path profile and corresponding description file are crawled, according to description file and the page feature of each category node, is crawled described in generation Extraction relationship in path profile between node of all categories, finally, generating route programming result according to the relationship of extraction.By generate from A kind of page is drawn into another kind of or another a few class pages, the route programming result of the target class page is finally drawn into, relative to existing Have for the artificial planning path of technology, liberated labour, improves path planning efficiency.Meanwhile it avoiding because of artificial planning The subjectivity in path and bring target class crawl the generation of the missing problem of the page.Whole station formula in compared with the existing technology is climbed It takes, not only reduces the downloading of a large amount of useless pages, moreover it is possible to which guarantee crawls the target class page more comprehensively.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.
Those of ordinary skill in the art will appreciate that all or part of the steps in realization above method embodiment is can It is completed with instructing relevant hardware by program, the program can store in computer-readable storage medium, The storage medium designated herein obtained, such as: ROM/RAM, magnetic disk, CD.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (10)

1. one kind crawls paths planning method characterized by comprising
Strategy is crawled according to default, since default portal page, crawls the page of the default portal page affiliated web site;
The page feature of each crawled page is acquired, record reaches each crawled page from the default portal page Path examples;
According to the path examples of record and the page feature of each crawled page, arrival and goal-selling page phase are picked out As the page path examples;
Path planning is carried out according to the page feature of each page in the path examples selected and the path examples selected, is generated Route programming result.
2. the method according to claim 1, wherein path examples according to record and each crawled The page feature of the page picks out the path examples of the arrival page similar with the goal-selling page, comprising:
According to the page feature of each crawled page, the page classifications that will be crawled;
According to page classifications as a result, determining classification belonging to the goal-selling page;
From the path examples recorded, the path examples for reaching the corresponding page of determined classification are picked out.
3. according to the method described in claim 2, it is characterized in that, described according to the path examples selected and the road selected The page feature of each page carries out path planning in diameter example, generates route programming result, comprising:
According to page classifications as a result, determining classification belonging to each page in the path examples picked out;
According to the path examples and identified classification picked out, generate using classification as the active path of node;
According to process model mining algorithm and the page classifications as a result, from active path obtained, excavates and meet default rule Then crawl path profile and the description file for crawling path profile, wherein the description file includes described crawling path profile Relationship between middle category node;
According to the description file and the page feature of each category node, crawl described in generation in path profile between node of all categories Extraction relationship, wherein the page feature of each category node is obtained according to the page feature of category node corresponding page;
According to the extraction relationship, route programming result is generated, wherein the route programming result includes retouching using syntax rule The extraction relationship stated.
4. according to the method described in claim 2, it is characterized in that, including page link and page source generation in the page feature In the case where code structure, the page feature according to each crawled page, the page classifications that will be crawled, comprising:
For the every two page in the page crawled, the first similarity and page source code knot of page link are calculated separately Second similarity of structure;
According to preset weight, first similarity and second similarity are summed, obtain comprehensive similarity;
According to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
5. method according to claim 1-4, which is characterized in that the default strategy that crawls is specially that range is excellent First crawl strategy.
6. one kind crawls path planning apparatus characterized by comprising
Module is crawled, for crawling strategy according to default, since default portal page, is crawled belonging to the default portal page The page of website;
Processing module, for acquiring the page feature of each crawled page, record reaches each from the default portal page The path examples of a crawled page;
Choosing module, for according to the path examples of record and the page feature of each crawled page, pick out arrival with The path examples of the similar page of the goal-selling page;
Planning module, for being carried out according to the page feature of each page in the path examples selected and the path examples selected Path planning generates route programming result.
7. device according to claim 6, which is characterized in that the Choosing module includes:
Taxon, for the page feature according to each crawled page, the page classifications that will be crawled;
First determination unit, for according to page classifications as a result, determining classification belonging to the goal-selling page;
Module of selection, for from the path examples recorded, picking out the path reality for reaching the corresponding page of determined classification Example.
8. device according to claim 7, which is characterized in that the planning module includes:
Second determination unit, for according to page classifications as a result, determining classification belonging to each page in the path examples picked out;
First generation unit, for generating using classification as node according to the path examples and identified classification picked out Active path;
Unit is excavated, is used for according to process model mining algorithm and the page classifications as a result, being excavated from active path obtained Meet preset rules out crawls path profile and the description file for crawling path profile, wherein the description file includes institute State the relationship crawled in path profile between category node;
Second generation unit crawls road described in generation for the page feature according to the description file and each category node Extraction relationship in diameter figure between node of all categories, wherein the page feature of each category node is to correspond to page according to category node What the page feature in face obtained;
Third generation unit generates route programming result according to the extraction relationship, wherein the route programming result includes The extraction relationship described using syntax rule.
9. device according to claim 7, which is characterized in that the page feature includes page link and page source code Structure, the taxon include:
Computation subunit, for for the every two page in the crawled page, calculate separately page link first to be similar Second similarity of degree and page source code structure;
Subelement is obtained, for according to preset weight, first similarity and second similarity to be summed, is obtained comprehensive Close similarity;
Classification subelement is used for according to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
10. according to the described in any item devices of claim 6-9, which is characterized in that the default strategy that crawls is specially range Preferentially crawl strategy.
CN201610867888.8A 2016-09-29 2016-09-29 One kind crawling paths planning method and device Active CN106547824B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610867888.8A CN106547824B (en) 2016-09-29 2016-09-29 One kind crawling paths planning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610867888.8A CN106547824B (en) 2016-09-29 2016-09-29 One kind crawling paths planning method and device

Publications (2)

Publication Number Publication Date
CN106547824A CN106547824A (en) 2017-03-29
CN106547824B true CN106547824B (en) 2019-11-15

Family

ID=58368487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610867888.8A Active CN106547824B (en) 2016-09-29 2016-09-29 One kind crawling paths planning method and device

Country Status (1)

Country Link
CN (1) CN106547824B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297450B (en) * 2021-05-24 2023-04-14 华北科技学院(中国煤矿安全技术培训中心) Crawler method, system, medium and electronic device based on fuzzy comprehensive evaluation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process
CN103984749A (en) * 2014-05-27 2014-08-13 电子科技大学 Focused crawler method based on link analysis
CN104657659A (en) * 2013-11-20 2015-05-27 腾讯科技(深圳)有限公司 Storage cross-site attack script vulnerability detection method, device and system
CN105955984A (en) * 2016-04-19 2016-09-21 中国银联股份有限公司 Network data searching method based on crawler mode

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101101601A (en) * 2007-07-10 2008-01-09 北京大学 Subject crawling method based on link hierarchical classification in network search
CN104657659A (en) * 2013-11-20 2015-05-27 腾讯科技(深圳)有限公司 Storage cross-site attack script vulnerability detection method, device and system
CN103942309A (en) * 2014-04-18 2014-07-23 乐得科技有限公司 Network data acquisition device and method and implementation method of acquisition process
CN103984749A (en) * 2014-05-27 2014-08-13 电子科技大学 Focused crawler method based on link analysis
CN105955984A (en) * 2016-04-19 2016-09-21 中国银联股份有限公司 Network data searching method based on crawler mode

Also Published As

Publication number Publication date
CN106547824A (en) 2017-03-29

Similar Documents

Publication Publication Date Title
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN107273409B (en) Network data acquisition, storage and processing method and system
CN107911719B (en) Video Dynamic recommendation device
CN103927400B (en) Web site product detailed information classification crawling and product information base establishing method
CN103678509B (en) Generate the method and device of web page template
CN107066548A (en) The method that web page interlinkage is extracted in a kind of pair of dimension classification
CN106156335A (en) A kind of discovery and arrangement method and system of teaching material knowledge point
CN103294732B (en) Webpage capture method and reptile
Ali et al. Rule-guided human classification of Volunteered Geographic Information
CN104156356B (en) Personalized Navigation page generation method and device
CN103136253A (en) Method and device of acquiring information
CN101853300A (en) Method and system for identifying and evaluating video downloading service website
CN105955962A (en) Method and device for calculating similarity of topics
CN108197030A (en) Software interface based on deep learning tests cloud platform device and test method automatically
CN103678510B (en) The method and device of visualization mark is provided webpage
CN107481218A (en) Image aesthetic feeling appraisal procedure and device
CN102902794B (en) Web page classification system and method
CN102902790B (en) Web page classification system and method
CN108710672A (en) A kind of Theme Crawler of Content method based on increment bayesian algorithm
CN102567392A (en) Control method for interest subject excavation based on time window
CN110309386B (en) Method and device for crawling web page
CN107562966A (en) The optimization system and method based on intelligence learning for web page interlinkage retrieval ordering
CN106503247A (en) Ancient book document management system and method based on knowledge discovery technology
CN106547824B (en) One kind crawling paths planning method and device
CN104156458B (en) The extracting method and device of a kind of information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant