CN106547824B - One kind crawling paths planning method and device - Google Patents
One kind crawling paths planning method and device Download PDFInfo
- Publication number
- CN106547824B CN106547824B CN201610867888.8A CN201610867888A CN106547824B CN 106547824 B CN106547824 B CN 106547824B CN 201610867888 A CN201610867888 A CN 201610867888A CN 106547824 B CN106547824 B CN 106547824B
- Authority
- CN
- China
- Prior art keywords
- page
- path
- crawled
- feature
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses one kind to crawl paths planning method and device, and method includes: to crawl strategy according to default, since default portal page, crawls the page that the default portal page corresponds to website;The page feature of each crawled page is acquired, record reaches the path examples of each crawled page from the default portal page;According to the path examples of record and the page feature of each crawled page, the path examples of the arrival page similar with the goal-selling page are picked out;Path planning is carried out according to the page feature of each page in the path examples selected and the path examples selected, generates route programming result.Using the embodiment of the present invention, the efficiency of path planning can be improved, also can be reduced and crawl burden.
Description
Technical field
The present invention relates to Internet technical field, in particular to one kind crawls paths planning method and device.
Background technique
Web crawlers can automatically extract webpage, be search engine from WWW downloading webpage, be the important of search engine
Component part, currently, web crawlers has become the main means from internet acquisition massive information data, it is much outstanding to open
Source crawler frame also has already appeared.Web crawlers is broadly divided into two classes: one kind is the search crawler for search engine, crawls mesh
Mark is entire internet;One kind is orientation crawler, and crawling target is a specific subset in all websites, or even is exactly a certain
A website.For the orientation crawler for crawling webpage from a certain website, there are two types of implementations at present: first is that passing through developer
Participation, definition planning is accurately executable to crawl route result, and orientation crawler carries out crawling work according to route result is crawled
Make;It plans that accurately be can be performed crawls route result second is that not defining, directly carries out whole station formula and crawl.
Above two implementation is respectively present following problems:
For first way, need wherein planning accurately crawls route result through developer's analysis and research network
The problem of page code is realized, and web page code is more complex, will lead to low efficiency in this way.
For the second way, although reducing the workload of developer, since there are the redundancy pages in website, directly
It taps into row whole station formula and crawls the downloading that will cause the excessive useless page, increase burden to work is crawled.
Summary of the invention
The one kind that is designed to provide of the embodiment of the present invention crawls paths planning method and device, can improve road to realize
The efficiency of diameter planning, also can be reduced and crawls burden purpose.
In order to achieve the above objectives, the embodiment of the invention discloses one kind to crawl paths planning method and device.Technical solution
It is as follows:
One kind provided in an embodiment of the present invention crawls paths planning method, comprising:
Strategy is crawled according to default, since default portal page, crawls the page of the default portal page affiliated web site
Face;
The page feature of each crawled page is acquired, record reaches each crawled from the default portal page
The path examples of the page;
According to the path examples of record and the page feature of each crawled page, arrival and goal-selling page are picked out
The path examples of the similar page in face;
Path planning is carried out according to the page feature of each page in the path examples selected and the path examples selected,
Generate route programming result.
Preferably, described according to the path examples of record and the page feature of each crawled page, pick out arrival
The path examples of the page similar with the goal-selling page, comprising:
According to the page feature of each crawled page, the page classifications that will be crawled;
According to page classifications as a result, determining category node belonging to the goal-selling page;
From the path examples recorded, the path examples for reaching the corresponding page of determined category node are picked out.
Preferably, the page feature according to each page in the path examples selected and the path examples selected into
Row path planning generates route programming result, comprising:
According to page classifications as a result, determining classification belonging to each page in the path examples picked out;
According to the path examples and identified classification picked out, generate using classification as the active path of node;
According to process model mining algorithm and the page classifications as a result, from active path obtained, excavate meet it is pre-
If rule crawls path profile and the description file for crawling path profile, wherein the description file crawls road including described
Relationship in diameter figure between category node, the page feature of each category node are the page features according to category node corresponding page
It obtains;
According to the description file and the page feature of each category node, section of all categories in path profile is crawled described in generation
Extraction relationship between point, wherein the page feature of each category node is obtained according to the page feature of category node corresponding page
;
According to the extraction relationship, route programming result is generated, wherein the route programming result includes advising using grammer
The extraction relationship then described.
Preferably, in the case where the page feature includes page link and page source code structure, the basis is each
The page feature of a crawled page, the page classifications that will be crawled, comprising:
For the every two page in the page crawled, the first similarity and the page source generation of page link are calculated separately
Second similarity of code structure;
According to preset weight, first similarity and second similarity are summed, obtain comprehensive similarity;
According to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
Preferably, the default strategy that crawls is specially that breadth First crawls strategy.
Second aspect, one kind provided in an embodiment of the present invention crawl path planning apparatus, comprising:
Module is crawled, for crawling strategy according to default, since default portal page, crawls the default portal page
The page of affiliated web site;
Processing module, for acquiring the page feature of each crawled page, record from the default portal page to
Up to the path examples of each crawled page;
Choosing module, for picking out according to the path examples of record and the page feature of each crawled page
Up to the path examples of the page similar with the goal-selling page;
Planning module, for the page feature according to each page in the path examples selected and the path examples selected
Path planning is carried out, route programming result is generated.
Preferably, the Choosing module includes:
Taxon, for the page feature according to each crawled page, the page classifications that will be crawled;
First determination unit, for according to page classifications as a result, determining classification belonging to the goal-selling page;
Module of selection, for from the path examples recorded, picking out the road for reaching the corresponding page of determined classification
Diameter example.
Preferably, the planning module includes:
Second determination unit, for according to page classifications as a result, determining in the path examples picked out belonging to each page
Classification;
First generation unit is section for according to the path examples and identified classification picked out, generating with classification
The active path of point;
Excavate unit, for according to process model mining algorithm and the page classifications as a result, from active path obtained,
Excavate meet preset rules crawl path profile and the description file for crawling path profile, wherein the description file packet
The relationship crawled in path profile between category node is included, the page feature of each category node is to correspond to page according to category node
What the page feature in face obtained;
Second generation unit is climbed described in generation for the page feature according to the description file and each category node
Take the extraction relationship in path profile between node of all categories, wherein the page feature of each category node is according to category node pair
The page feature of the page is answered to obtain;
Third generation unit generates route programming result, wherein the route programming result according to the extraction relationship
Including the extraction relationship using syntax rule description.
Preferably, the page feature includes page link and page source code structure, and the taxon includes:
Computation subunit, for calculating separately the first of page link for the every two page in the crawled page
Second similarity of similarity and page source code structure;
Subelement is obtained, for first similarity and second similarity being summed, obtained according to preset weight
Obtain comprehensive similarity;
Classification subelement, for according to comprehensive similarity and default Measurement of Similarity value is obtained, the page crawled to be divided
Class.
Preferably, the default strategy that crawls is specially that breadth First crawls strategy.
Using the embodiment of the present invention, since default portal page, the page of default portal page affiliated web site is crawled;Note
Record reaches the path examples of each crawled page from default portal page, and the page for acquiring each crawled page is special
Sign;By specifically crawling operation, the path examples between the corresponding specific page are obtained, sampling operation is completed.According to record
The page feature of path examples and each crawled page picks out the path of the arrival page similar with the goal-selling page
Example carries out path planning according to the page feature for the path examples and the corresponding page selected, generates route programming result.
This process does not have the participation of developer, does not need the web page code that developer goes research complicated, and path rule can be improved
The efficiency drawn;The sample that path planning relies on has specific aim, can reduce and generate unnecessary route programming result, can be certain
Guarantee to degree to crawl the comprehensive of result, crawling comprehensively compared with the existing technology can greatly reduce and crawl burden.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram provided in an embodiment of the present invention for crawling paths planning method;
Fig. 2 be another embodiment of the present invention provides the flow diagram for crawling paths planning method;
Fig. 3 is the structural schematic diagram provided in an embodiment of the present invention for crawling path planning apparatus;
Fig. 4 be another embodiment of the present invention provides the structural schematic diagram for crawling path planning apparatus.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
In order to realize the efficiency that can improve path planning, the purpose for crawling burden also can be reduced, the embodiment of the present invention mentions
One kind has been supplied to crawl paths planning method and device.
Paths planning method is crawled to one kind provided in an embodiment of the present invention first below to be introduced.
Referring to Fig.1, Fig. 1 is the flow diagram provided in an embodiment of the present invention for crawling paths planning method, this method packet
Include following steps:
S101, strategy is crawled according to default, since default portal page, crawls the page of default portal page affiliated web site
Face;
Crawl strategy have breadth First plan crawl strategy and depth-first crawl strategy, wherein breadth First plan crawls plan
Slightly basic thought be to crawl the page according to the content of pages TOC level depth, the page in shallower TOC level first by
It crawls, creeps and finish when the page in same level, then protrude into next level and continue to crawl, this strategy can be effectively controlled the page
Crawl depth, can not terminate to creep when avoiding the problem that encountering an infinite deep layer branch.Depth-first crawls the base of strategy
This thought is the sequence according to depth from low to high, successively accesses next stage web page interlinkage, until cannot be deeply, this plan
Slightly it be easy to cause the waste of resource.In this step, strategy is preferentially crawled using breadth First.
Strategy is crawled for breadth First, when it is implemented, extracting its all subchain for a page of downloading
It connects, then downloads the corresponding page of sublink, be continued for carrying out, until crawling movement terminates, specifically at what time
Terminate, can be limited by defining the time, if the machine for crawling is more, be can be set after the shorter time eventually
Only;If the machine crawled is less, it can be set and terminated in longer time, for example, have 5 for the machines that crawl, it can be with
Regulation crawls three hours;There are 3 machines for crawling, can specify that and crawl 5 hours.
In order to obtain more accurate route programming result so as to crawl more similar to the goal-selling page
The page, default portal page be according to crawl demand acquisition, the homepage of usually one website, for example, it is desired to from www.qq.com
The homepage stood starts to crawl, and to obtain all information of film, the homepage of Tencent website is provided i.e. in the form of page link
“http://v.qq.com/”。
The page feature of S102, each the crawled page of acquisition, record reach each crawled from default portal page
The page path examples;
When it is implemented, the page feature of each the crawled page of acquisition may include page link and page source code
Structure, wherein page link for example: http://v.qq.com/x/movielist/ cate=10001&of_fset=0&
Sort=5&pay=-1;http://v.qq.com/cover/3/3ew17ydbfgmy79r.html.
Page source code structure refers to that hypertext markup language (Hyper Text Markup Language tag) is marked
Label, is also generally referred to as html label, and page source code structure uses the string-concatenation of all html labels of the page
Form indicates.
Belong to the prior art by the character string that the specific page obtains corresponding page link and html label, does not do herein
It repeats.
In this step, the specific path examples crawled to be recorded, the node in path examples is the specific page,
Can be fetched using page chain indicates single page node, for example, for the page that range crawls since portal page s, note
The path examples of record have s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- > b2->e1, s- > a- > b1->
e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2;Wherein, letter is in order to express easily, to link to representing pages.
The path for meaning to reach all pages on the path by recording whole path is also recorded.For example, have recorded s- >
b1->c1->e1This paths example, it is actually implicit to have recorded s- > b1, s- > b1->c1, s- > b1->c1->e1Equal paths.
S103, according to the path examples of record and the page feature of each crawled page, pick out arrival and default
The path examples of the similar page of target pages;
It should be noted that target pages are determining webpages, for example, target webpage is http://v.qq.com/x/
cover/3ew17ydbfgmy79r/x002159scet.html.The page similar with the goal-selling page has very big possibility
It is intended to the page crawled, there is the relevant content of pages of more multiple target.
When specifically used, path profile is more comprehensively crawled in order to finally excavate covering, it is preferable that pick out arrival
All path examples of the page similar with the goal-selling page.
S104, path is carried out according to the page feature of each page in the path examples selected and the path examples selected
Planning generates route programming result.
Using embodiment illustrated in fig. 1, since default portal page, the page of default portal page affiliated web site is crawled;
Record reaches the path examples of each crawled page from default portal page, and the page for acquiring each crawled page is special
Sign;By specifically crawling operation, the path examples between the corresponding specific page are obtained, sampling operation is completed.According to record
The page feature of path examples and each crawled page picks out the path of the arrival page similar with the goal-selling page
Example carries out path planning according to the page feature for the path examples and the corresponding page selected, generates route programming result.
This process does not have the participation of developer, does not need the web page code that developer goes research complicated, and path rule can be improved
The efficiency drawn;The sample that path planning relies on has specific aim, can reduce and generate unnecessary route programming result, can be certain
Guarantee to degree to crawl the comprehensive of result, crawling comprehensively compared with the existing technology can greatly reduce and crawl burden.
In another embodiment of the present invention, referring to fig. 2, Fig. 2 be another embodiment of the present invention provides crawl road
The flow diagram of diameter planing method, compared with embodiment illustrated in fig. 1, in the present embodiment, according to the path examples of record and respectively
The page feature of a crawled page is picked out the path examples of the arrival page similar with the goal-selling page, be can wrap
It includes:
S1031, according to the page feature of each crawled page, the page classifications that will be crawled;
Classification in this step is carried out based on the almost the same fact of the page feature of the similar page of same website
, specific classification method follows the steps below:
(1), for the every two page in the page crawled, the first similarity and the page of page link are calculated separately
Second similarity of source code structure;
The similarity algorithm of page link calculates the first similarity, is calculated according to the similarity algorithm of page source code structure
Second similarity, wherein the similarity algorithm of page link and the similarity algorithm of page source code structure belong to existing skill
Art, this will not be repeated here.
For example, all path examples of record are s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a-
>b2->e1, s- > a- > b1->e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2, then can according to these similarity algorithms,
Calculate page b1With page b2The first similarity be 0.9, page a and page b1The first similarity be 0.3;Page b1And page
Face b2The second similarity be 0.91, page a and page b1The second similarity be respectively 0.2.
(2), according to preset weight, the first similarity and the second similarity are summed, obtain comprehensive similarity;
When specifically used, the first similarity and the second Similarity-Weighted are summed, this default weight can be according to major
The general feature of website is arranged, if the similarity of the page link of the similar webpage of a webpage is larger, and page source
The similarity of code structure is smaller, then larger, and the weight of the second similarity can be arranged in the weight of the first similarity
What is be arranged is smaller.Similarly, if the similarity of the page source code structure of the similar webpage of a webpage is larger, and page chain
The similarity connect is smaller, then can by the weight of the first similarity be arranged it is smaller, and the weight of the second similarity setting
It is larger.It is of course also possible to use weight is arranged in other modes, setting of the embodiment of the present invention to weight do not do specific limit
It is fixed.
For example, the weight for page link and page source code structure setting is respectively 0.8 and 0.2, then the page is directed to
b1With page b2, calculate 0.9 × 0.8+0.91 × 0.2 and obtain comprehensive similarity 0.902;For page a and page b1, calculate 0.3
It is 0.28 that × 0.8+0.2 × 0.2, which obtains comprehensive similarity,.
(3), according to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
In this step, comprehensive similarity can be compared with default Measurement of Similarity value, comprehensive similarity is greater than default
Measurement of Similarity value, it is believed that corresponding two pages of comprehensive similarity belong to one kind;On the contrary, then it is assumed that be not a kind of.For example,
Page b1With page b2Comprehensive similarity be 0.902, page a and page b1Comprehensive similarity be 0.28, and it is preset similar
Degree standard value 0.85 is compared, and 0.902 is greater than 0.85, instruction page b1With page b2Similarity with higher, it is believed that be one
Class, and 0.28 less than 0.85, instruction page b1With page b2With lower similarity, it is believed that be not a kind of.
It, can be by s- > b finally according to comprehensive similarity1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a-
>b2->e1, s- > a- > b1->e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2Page classifications on path, s belong to S class,
b1、b2Belong to B class, c1、c2、c3Belong to C class, e1、e2、e3Belong to E class, f1、f2、f3Belong to F class, m1、m2Belong to M class.
Other than above-mentioned classification method, it can also be instructed using the page feature of the page crawled as training sample
Practice Learning machine, to obtain the classifier for being suitable for classifying in the present invention, the specific page may be implemented using corresponding classifier
Classification.
S1032, according to page classifications as a result, determine the goal-selling page belonging to classification;
This step when it is implemented, acquire the page link and page source code structure of the goal-selling page, then root first
According to classification as a result, calculating the similarity of a certain page feature of the goal-selling page and any sort page, maximum similarity pair
The classification answered is classification belonging to the goal-selling page.For example, determining that the goal-selling page belongs to above-mentioned E class.
S1033, from the path examples recorded, pick out the path examples for reaching the corresponding page of determined classification.
For example, all path examples of record are s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a-
>b2->e1, s- > a- > b1->e1->f1, s- > m1->n1>e2->f2, s- > m2->n2->e2, the goal-selling page determined belongs to E
Class then picks out the path examples for reaching the corresponding page of E class are as follows: s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->
e3, s- > a- > b2->e1, s- > a- > b1->e1, s- > m1->n1>e2, s- > m2->n2->e2。
Compared with embodiment illustrated in fig. 1, in the present embodiment, according to the path examples selected and the path examples selected
In each page page feature carry out path planning, generate route programming result, may include:
S1041, according to page classifications as a result, determining classification belonging to each page in the path examples picked out;
For example, the path examples picked out are as follows: s- > b1->c1->e1, s- > b2->c2->e2, s- > b2->c3->e3, s- > a- >
b2->e1, s- > a- > b1->e1, s- > m1->n1>e2, s- > m2->n2->e2, wherein s belongs to S class, b1、b2Belong to B class, c1、c2、c3
Belong to C class, e1、e2、e3Belong to E class, f1、f2、f3Belong to F class, m1、m2Belong to M class.
The path examples and identified classification that S1042, basis are picked out, generate using classification as effective road of node
Diameter;
For example, can be generated has by the active path of node of classification according to above-mentioned example: S- > B- > C- > E, S- > A- >
B- > E, S- > M- > N- > E.
S1043, according to process model mining algorithm and page classifications as a result, from active path obtained, excavate and meet
Crawling for preset rules and crawls the description file of path profile at path profile, wherein description file includes crawling classification in path profile
Relationship between node;
Wherein, crawling path profile is obtained by data mining, be simplify and using classification as the path profile of node, can
It can be individual paths figure, it is also possible to the mesh paths figure with branch.Relationship between category node, which specifically includes, crawls path
The connection relationship of node of all categories in figure, for example, A class can directly arrive B class, B class, which can refer to, is connected to D class etc..
When it is implemented, preset rules therein meet the following conditions: that excavates crawls path profile in the page crawled
In the page covered in face, the page similar with the goal-selling page is as more as possible, and intermediate page is few as far as possible.It is default
Rule can specifically covered with intermediate page with the ratio of the page similar with the goal-selling page, or setting intermediate page
The page in ratio embody, the two ratios are of equal value, if ratio is smaller, illustrate that corresponding to crawl path profile more excellent.
Set scale threshold value, ratio used do not meet preset rules if it is greater than proportion threshold value, representative;Ratio used is if it is less than ratio
Threshold value, expression meet preset rules.
S1044, according to description file and each category node page feature, generation crawl node of all categories in path profile
Between extraction relationship, wherein the page feature of each category node be according to the page feature of category node corresponding page obtain
's;
In this step, first according to the page feature extraction main feature of category node corresponding page as node of all categories
Page feature, first page feature and second page feature, first page including each category node are characterized according to page
Face link is extracted, and second page is characterized according to page source code structure extraction.For example, according to the specific page c1, c2, c3
Page feature extraction C class page feature, for page link, the page link for obtaining the page c1, c2, c3 shares part,
The shared part of the page link indicated using regular expression is the first page feature of C class, for page source code
Structure obtains and shares part in the html label of the specific page c1, c2, c3, and using the splicing form of html tag characters string
It indicates, is second of page feature of C class.
If indicating that A class can directly reach C class in description file, according to second of page feature in A class, and
The first page feature of C class can determine css path or jsonpath of the C class in the A class page.In this way
It can determine css path or jsonpath of the subclass category node in parent category node, classification can be obtained so as to subsequent
Extraction relationship between node.
According to above-mentioned detailed process, it can determine how to be drawn into the A class page from C class page specific location, thus may be used
To generate the extraction relationship between node of all categories.
S1045, according to extract relationship, generate route programming result, wherein route programming result include use syntax rule
The extraction relationship of description.
Extraction relationship is described using syntax rule, particular by regular expression, css path or json path
Any extract sublink from a kind of page.For example, for qq website, from class page http://v.q_q.com/x/
Movielist/ cate=10001&offset=0&sort=5&pay=-1, can by css path:#vid_eos > ul >
Li > strong > a can also pass through regular expression: ^http: //v.qq.com/cover/.+ navigates to such page and is wrapped
The broadcasting link of all single album class pages contained.
When specifically used, the route programming result cooked up through the embodiment of the present invention is embedded into crawler system, is climbed
Worm system is crawled according to corresponding route programming result, and more good the whole network search service can be provided for user.
Using embodiment illustrated in fig. 2, classify for the page crawled, determines class belonging to the goal-selling page
Not, according to this classification determine this classification corresponding to the page, and pick out reach these pages path examples, pass through
The specific page in path examples is substituted for the corresponding classification of the specific page, can polymerize and be saved by path of category node
The active path of point, meets crawling for preset rules as a result, excavating from active path according to mining algorithm and page classifications
Path profile and corresponding description file crawl path described in generation according to description file and the page feature of each category node
Extraction relationship in figure between node of all categories, finally, generating route programming result according to the relationship of extraction.By generating from one kind
The page is drawn into another kind of or another a few class pages, the route programming result of the target class page is finally drawn into, relative to existing skill
For the artificial planning path of art, labour has been liberated, has improved path planning efficiency.Meanwhile it avoiding because of artificial planning path
Subjectivity and bring target class crawl the page missing problem generation.Whole station formula in compared with the existing technology crawls,
Not only reduce the downloading of a large amount of useless pages, moreover it is possible to which guarantee crawls the target class page more comprehensively.
Corresponding with above-mentioned embodiment of the method, the embodiment of the invention also provides one kind to crawl path planning apparatus.
Referring to Fig. 3, Fig. 3 is the structural schematic diagram provided in an embodiment of the present invention for crawling path planning apparatus, this crawls road
Diameter device for planning, comprising:
Module 31 is crawled, for crawling strategy according to default, since default portal page, crawls the default portal page
The page of face affiliated web site;Wherein it is preferred to which the default strategy that crawls is specially that breadth First crawls strategy.
Processing module 32, for recording the path examples for reaching each crawled page from the default portal page,
Acquire the page feature of each crawled page;
Choosing module 33, for picking out according to the path examples of record and the page feature of each crawled page
Reach the path examples of the page similar with the goal-selling page;
Planning module 34, for special according to the page of each page in the path examples selected and the path examples selected
Sign carries out path planning, generates route programming result.
Using embodiment illustrated in fig. 3, since default portal page, the page of default portal page affiliated web site is crawled;
Record reaches the path examples of each crawled page from default portal page, and the page for acquiring each crawled page is special
Sign;By specifically crawling operation, the path examples between the corresponding specific page are obtained, sampling operation is completed.According to record
The page feature of path examples and each crawled page picks out the path of the arrival page similar with the goal-selling page
Example carries out path planning according to the page feature for the path examples and the corresponding page selected, generates route programming result.
This process does not have the participation of developer, does not need the web page code that developer goes research complicated, and path rule can be improved
The efficiency drawn;The sample that path planning relies on has specific aim, can reduce and generate unnecessary route programming result, can be certain
Guarantee to degree to crawl the comprehensive of result, crawling comprehensively compared with the existing technology can greatly reduce and crawl burden.
In another specific embodiment of the invention, referring to fig. 4, Fig. 4 be another embodiment of the present invention provides crawl
The structural schematic diagram of path planning apparatus, compared with embodiment illustrated in fig. 3, in the present embodiment, Choosing module 33 specifically include with
Under several units:
Taxon 331, for the page feature according to each crawled page, the page classifications that will be crawled;
First determination unit 332, for according to page classifications as a result, determining classification belonging to the goal-selling page;
Module of selection 333 reaches the corresponding page of determined classification for picking out from the path examples recorded
Path examples.
Wherein, wherein the page feature includes page link and page source code structure, and the taxon 331 is wrapped
It includes:
Computation subunit, for calculating separately the first of page link for the every two page in the crawled page
Second similarity of similarity and page source code structure;
Subelement is obtained, for first similarity and second similarity being summed, obtained according to preset weight
Obtain comprehensive similarity;
Classification subelement, for according to comprehensive similarity and default Measurement of Similarity value is obtained, the page crawled to be divided
Class.
In the present embodiment, compared with embodiment illustrated in fig. 3, planning module 34 is specifically included:
Second determination unit 341, for according to page classifications as a result, determining in the path examples picked out belonging to each page
Classification;
First generation unit 342, for path examples and identified classification that basis is picked out, generation is with classification
The active path of node;
Unit 343 is excavated, is used for according to process model mining algorithm and the page classifications as a result, from active path obtained
In, excavate meet preset rules crawl path profile and the description file for crawling path profile, wherein the description file
Including the relationship crawled in path profile between category node;
Second generation unit 344, for the page feature according to the description file and each category node, described in generation
Crawl the extraction relationship in path profile between node of all categories, wherein the page feature of each category node is according to category node
What the page feature of corresponding page obtained;
Third generation unit 345 generates route programming result, wherein the path planning knot according to the extraction relationship
Fruit includes the extraction relationship using syntax rule description.
Using embodiment illustrated in fig. 4, classify for the page crawled, determines class belonging to the goal-selling page
Not, according to this classification determine this classification corresponding to the page, and pick out reach these pages path examples, pass through
The specific page in path examples is substituted for the corresponding classification of the specific page, can polymerize and be saved by path of category node
The active path of point, meets preset rules as a result, excavating from active path according to mining algorithm and the page classifications
Path profile and corresponding description file are crawled, according to description file and the page feature of each category node, is crawled described in generation
Extraction relationship in path profile between node of all categories, finally, generating route programming result according to the relationship of extraction.By generate from
A kind of page is drawn into another kind of or another a few class pages, the route programming result of the target class page is finally drawn into, relative to existing
Have for the artificial planning path of technology, liberated labour, improves path planning efficiency.Meanwhile it avoiding because of artificial planning
The subjectivity in path and bring target class crawl the generation of the missing problem of the page.Whole station formula in compared with the existing technology is climbed
It takes, not only reduces the downloading of a large amount of useless pages, moreover it is possible to which guarantee crawls the target class page more comprehensively.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method
Part explanation.
Those of ordinary skill in the art will appreciate that all or part of the steps in realization above method embodiment is can
It is completed with instructing relevant hardware by program, the program can store in computer-readable storage medium,
The storage medium designated herein obtained, such as: ROM/RAM, magnetic disk, CD.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention
It is interior.
Claims (10)
1. one kind crawls paths planning method characterized by comprising
Strategy is crawled according to default, since default portal page, crawls the page of the default portal page affiliated web site;
The page feature of each crawled page is acquired, record reaches each crawled page from the default portal page
Path examples;
According to the path examples of record and the page feature of each crawled page, arrival and goal-selling page phase are picked out
As the page path examples;
Path planning is carried out according to the page feature of each page in the path examples selected and the path examples selected, is generated
Route programming result.
2. the method according to claim 1, wherein path examples according to record and each crawled
The page feature of the page picks out the path examples of the arrival page similar with the goal-selling page, comprising:
According to the page feature of each crawled page, the page classifications that will be crawled;
According to page classifications as a result, determining classification belonging to the goal-selling page;
From the path examples recorded, the path examples for reaching the corresponding page of determined classification are picked out.
3. according to the method described in claim 2, it is characterized in that, described according to the path examples selected and the road selected
The page feature of each page carries out path planning in diameter example, generates route programming result, comprising:
According to page classifications as a result, determining classification belonging to each page in the path examples picked out;
According to the path examples and identified classification picked out, generate using classification as the active path of node;
According to process model mining algorithm and the page classifications as a result, from active path obtained, excavates and meet default rule
Then crawl path profile and the description file for crawling path profile, wherein the description file includes described crawling path profile
Relationship between middle category node;
According to the description file and the page feature of each category node, crawl described in generation in path profile between node of all categories
Extraction relationship, wherein the page feature of each category node is obtained according to the page feature of category node corresponding page;
According to the extraction relationship, route programming result is generated, wherein the route programming result includes retouching using syntax rule
The extraction relationship stated.
4. according to the method described in claim 2, it is characterized in that, including page link and page source generation in the page feature
In the case where code structure, the page feature according to each crawled page, the page classifications that will be crawled, comprising:
For the every two page in the page crawled, the first similarity and page source code knot of page link are calculated separately
Second similarity of structure;
According to preset weight, first similarity and second similarity are summed, obtain comprehensive similarity;
According to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
5. method according to claim 1-4, which is characterized in that the default strategy that crawls is specially that range is excellent
First crawl strategy.
6. one kind crawls path planning apparatus characterized by comprising
Module is crawled, for crawling strategy according to default, since default portal page, is crawled belonging to the default portal page
The page of website;
Processing module, for acquiring the page feature of each crawled page, record reaches each from the default portal page
The path examples of a crawled page;
Choosing module, for according to the path examples of record and the page feature of each crawled page, pick out arrival with
The path examples of the similar page of the goal-selling page;
Planning module, for being carried out according to the page feature of each page in the path examples selected and the path examples selected
Path planning generates route programming result.
7. device according to claim 6, which is characterized in that the Choosing module includes:
Taxon, for the page feature according to each crawled page, the page classifications that will be crawled;
First determination unit, for according to page classifications as a result, determining classification belonging to the goal-selling page;
Module of selection, for from the path examples recorded, picking out the path reality for reaching the corresponding page of determined classification
Example.
8. device according to claim 7, which is characterized in that the planning module includes:
Second determination unit, for according to page classifications as a result, determining classification belonging to each page in the path examples picked out;
First generation unit, for generating using classification as node according to the path examples and identified classification picked out
Active path;
Unit is excavated, is used for according to process model mining algorithm and the page classifications as a result, being excavated from active path obtained
Meet preset rules out crawls path profile and the description file for crawling path profile, wherein the description file includes institute
State the relationship crawled in path profile between category node;
Second generation unit crawls road described in generation for the page feature according to the description file and each category node
Extraction relationship in diameter figure between node of all categories, wherein the page feature of each category node is to correspond to page according to category node
What the page feature in face obtained;
Third generation unit generates route programming result according to the extraction relationship, wherein the route programming result includes
The extraction relationship described using syntax rule.
9. device according to claim 7, which is characterized in that the page feature includes page link and page source code
Structure, the taxon include:
Computation subunit, for for the every two page in the crawled page, calculate separately page link first to be similar
Second similarity of degree and page source code structure;
Subelement is obtained, for according to preset weight, first similarity and second similarity to be summed, is obtained comprehensive
Close similarity;
Classification subelement is used for according to acquisition comprehensive similarity and default Measurement of Similarity value, the page classifications that will be crawled.
10. according to the described in any item devices of claim 6-9, which is characterized in that the default strategy that crawls is specially range
Preferentially crawl strategy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610867888.8A CN106547824B (en) | 2016-09-29 | 2016-09-29 | One kind crawling paths planning method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610867888.8A CN106547824B (en) | 2016-09-29 | 2016-09-29 | One kind crawling paths planning method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106547824A CN106547824A (en) | 2017-03-29 |
CN106547824B true CN106547824B (en) | 2019-11-15 |
Family
ID=58368487
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610867888.8A Active CN106547824B (en) | 2016-09-29 | 2016-09-29 | One kind crawling paths planning method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106547824B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113297450B (en) * | 2021-05-24 | 2023-04-14 | 华北科技学院(中国煤矿安全技术培训中心) | Crawler method, system, medium and electronic device based on fuzzy comprehensive evaluation method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101601A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Subject crawling method based on link hierarchical classification in network search |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
CN103984749A (en) * | 2014-05-27 | 2014-08-13 | 电子科技大学 | Focused crawler method based on link analysis |
CN104657659A (en) * | 2013-11-20 | 2015-05-27 | 腾讯科技(深圳)有限公司 | Storage cross-site attack script vulnerability detection method, device and system |
CN105955984A (en) * | 2016-04-19 | 2016-09-21 | 中国银联股份有限公司 | Network data searching method based on crawler mode |
-
2016
- 2016-09-29 CN CN201610867888.8A patent/CN106547824B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101601A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Subject crawling method based on link hierarchical classification in network search |
CN104657659A (en) * | 2013-11-20 | 2015-05-27 | 腾讯科技(深圳)有限公司 | Storage cross-site attack script vulnerability detection method, device and system |
CN103942309A (en) * | 2014-04-18 | 2014-07-23 | 乐得科技有限公司 | Network data acquisition device and method and implementation method of acquisition process |
CN103984749A (en) * | 2014-05-27 | 2014-08-13 | 电子科技大学 | Focused crawler method based on link analysis |
CN105955984A (en) * | 2016-04-19 | 2016-09-21 | 中国银联股份有限公司 | Network data searching method based on crawler mode |
Also Published As
Publication number | Publication date |
---|---|
CN106547824A (en) | 2017-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103491205B (en) | The method for pushing of a kind of correlated resources address based on video search and device | |
CN107273409B (en) | Network data acquisition, storage and processing method and system | |
CN107911719B (en) | Video Dynamic recommendation device | |
CN103927400B (en) | Web site product detailed information classification crawling and product information base establishing method | |
CN103678509B (en) | Generate the method and device of web page template | |
CN107066548A (en) | The method that web page interlinkage is extracted in a kind of pair of dimension classification | |
CN106156335A (en) | A kind of discovery and arrangement method and system of teaching material knowledge point | |
CN103294732B (en) | Webpage capture method and reptile | |
Ali et al. | Rule-guided human classification of Volunteered Geographic Information | |
CN104156356B (en) | Personalized Navigation page generation method and device | |
CN103136253A (en) | Method and device of acquiring information | |
CN101853300A (en) | Method and system for identifying and evaluating video downloading service website | |
CN105955962A (en) | Method and device for calculating similarity of topics | |
CN108197030A (en) | Software interface based on deep learning tests cloud platform device and test method automatically | |
CN103678510B (en) | The method and device of visualization mark is provided webpage | |
CN107481218A (en) | Image aesthetic feeling appraisal procedure and device | |
CN102902794B (en) | Web page classification system and method | |
CN102902790B (en) | Web page classification system and method | |
CN108710672A (en) | A kind of Theme Crawler of Content method based on increment bayesian algorithm | |
CN102567392A (en) | Control method for interest subject excavation based on time window | |
CN110309386B (en) | Method and device for crawling web page | |
CN107562966A (en) | The optimization system and method based on intelligence learning for web page interlinkage retrieval ordering | |
CN106503247A (en) | Ancient book document management system and method based on knowledge discovery technology | |
CN106547824B (en) | One kind crawling paths planning method and device | |
CN104156458B (en) | The extracting method and device of a kind of information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |