CN110413861A - Link extracting method, device, equipment and storage medium based on web crawlers - Google Patents

Link extracting method, device, equipment and storage medium based on web crawlers Download PDF

Info

Publication number
CN110413861A
CN110413861A CN201910670515.5A CN201910670515A CN110413861A CN 110413861 A CN110413861 A CN 110413861A CN 201910670515 A CN201910670515 A CN 201910670515A CN 110413861 A CN110413861 A CN 110413861A
Authority
CN
China
Prior art keywords
url
link
url link
subject information
crawled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910670515.5A
Other languages
Chinese (zh)
Other versions
CN110413861B (en
Inventor
郑禄
王锦群
雷建云
毛腾跃
刘晶
马尧
刘越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN201910670515.5A priority Critical patent/CN110413861B/en
Publication of CN110413861A publication Critical patent/CN110413861A/en
Application granted granted Critical
Publication of CN110413861B publication Critical patent/CN110413861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to Internet technical fields, disclose a kind of link extracting method, device, equipment and storage medium based on web crawlers.The present invention passes through the anchor multiple attributes integration mode that polymerize based on path, the second URL link treated in the first URL link and URL queue to be crawled of access platform is handled, obtain the multiple attributes subject information of the corresponding rich text format of the second URL link, and it the corresponding multiple attributes subject information of each second URL link will be compared with agricultural product subject information in URL queue to be crawled, it extracts and meets corresponding second URL link of the multiple attributes subject information of preset threshold with agricultural product subject information similarity, the accuracy rate extracted to specific URL link is effectively guaranteed, and then it can be to avoid web crawlers because of the wasting of resources caused by the crawling of unrelated link, to significantly improve the performance of web crawlers, enable information needed for the acquisition people of web crawlers fast accurate, it mentions Rise user experience.

Description

Link extracting method, device, equipment and storage medium based on web crawlers
Technical field
The present invention relates to Internet technical field more particularly to a kind of link extracting method based on web crawlers, device, Equipment and storage medium.
Background technique
As form web page and content tend to diversification and complicate, the information that webpage is presented is very different.Web crawlers It is not meant to handle all page links, but link relevant to theme is selected to be crawled.Therefore, web crawlers is extracting It is whether related to theme that link should be prejudged before page link.Topic links extracting method common at present has: rule-based phase Close the related link that link is extracted, set based on web page release and DOM Document Object Model (Document Object Model, DOM) It extracts, and the relation link extraction based on machine learning.
Although rule-based relation link extraction mode can effectively remove the noise link in link, in reality In, since the author of each website is different, web page interlinkage also meets the extracting rule of formulation not to the utmost, thus versatility is not Height cannot keep higher accuracy rate, to seriously affect the performance of web crawlers.
And although the extraction of theme related link may be implemented in the relation link extraction mode based on web page release and dom tree Operation, but the realization of this method is needed the page according to website layout feature dom tree by blocky parsing, and is different location Web page blocks assign different different degree weights, in combination with page title and page subject matter, link Anchor Text and linked to judge It is whether related, thus realize that process is relative complex, and accuracy rate is influenced by the different degree weight assigned, to seriously affect The performance of web crawlers.
And although the relation link extraction mode based on machine learning more can accurately extract the related link of the page, But the training dataset compiling costs of early period is too big, and the scalability of page related link decimation pattern is bad, thus seriously Affect the performance of web crawlers.
Therefore, the extracting mode of it is urgent to provide a kind of theme related link based on web crawlers, to promote web crawlers Performance, enable information needed for the acquisition people of web crawlers fast accurate, and then promote user experience.
Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill Art.
Summary of the invention
It the link extracting method that the main purpose of the present invention is to provide a kind of based on web crawlers, device, equipment and deposits Storage media, it is intended to the performance of web crawlers be improved by the optimization to link extracting mode, to guarantee that web crawlers can Information needed for the acquisition people of fast accurate promotes user experience.
To achieve the above object, the present invention provides a kind of link extracting method based on web crawlers, the method packet Include following steps:
In the data grabber request for receiving agricultural product to be analyzed, extracted from data grabber request to be visited flat First uniform resource position mark URL of platform links and agricultural product subject information relevant to the agricultural product to be analyzed;
According to first URL link, access request is sent to the platform to be visited;
After receiving the response that the platform to be visited is made according to the access request, the first URL chain is grabbed Connect the data information in the corresponding page;
The data information is parsed, obtains the second URL link embedded in the page, and by described second URL link is added to URL queue to be crawled;
Based on the anchor multiple attributes integration mode of path polymerization, to first URL link and the URL queue to be crawled In the second URL link handled, obtain the corresponding rich text format of second URL link multiple attributes theme letter Breath;
Respectively by the corresponding multiple attributes subject information of each second URL link and institute in the URL queue to be crawled It states agricultural product subject information to compare, extracts the multiple attributes for meeting preset threshold with the agricultural product subject information similarity Corresponding second URL link of subject information.
Preferably, described that the data information is parsed, the second URL link embedded in the page is obtained, and The step of second URL link is added to URL queue to be crawled, comprising:
The data information is parsed, the second URL link embedded in the page is obtained;
Second URL link is parsed, the corresponding standardization label of second URL link is obtained;
The corresponding abstract tree of second URL link is generated according to the standardization label;
Based on dom tree matching process, the node content of the abstract tree is matched with the agricultural product subject information, Unmatched node content is removed, is obtained and matched second URL link of the agricultural product subject information;
URL queue to be crawled will be added to matched second URL link of the agricultural product subject information.
Preferably, it is described based on path polymerization anchor multiple attributes integration mode, to first URL link and it is described to The second URL link crawled in URL queue is handled, and the multiple of the corresponding rich text format of second URL link is obtained The step of attribute subject information, comprising:
According to the second URL link in first URL link and the URL queue to be crawled, generate described to be analyzed The corresponding path access digraph of agricultural product;
Based on the anchor multiple attributes integration mode of path polymerization, the most short access road in the path access digraph is determined Diameter obtains most short access path set;
It determines the corresponding Anchor Text of the most short access path of each in the most short access path set, obtains described most short The corresponding access path Anchor Text set of access path set, and be each element point in the access path Anchor Text set With a weight;
Formula pair is standardized according to preset weight, it is corresponding to each element in the access path Anchor Text set Weight is standardized;
Descending sort is carried out to the weight after standardization, obtains the more of the corresponding rich text format of second URL link Weight attribute subject information.
Preferably, which is characterized in that described respectively that each second URL link in the URL queue to be crawled is corresponding Multiple attributes subject information and agricultural product subject information the step of comparing, extract and the agricultural product subject information Similarity meets the step of multiple attributes subject information corresponding second URL link of preset threshold, comprising:
Multiple attributes theme feature word is extracted from the multiple attributes subject information, to the multiple attributes theme feature Word carries out Hash processing, obtains the first cryptographic Hash, and the multiple attributes theme feature word is the multiple attributes subject information pair The element in access path Anchor Text set answered;
The corresponding weight of the multiple attributes theme feature word is obtained from the access path Anchor Text set, and is combined First cryptographic Hash is quantified as primary vector by the weight;
From the agricultural product subject information extract agricultural product theme feature word, and to the agricultural product theme feature word into The processing of row Hash, obtains the second cryptographic Hash;
Second cryptographic Hash is quantified as secondary vector according to for the preset weight of agricultural product theme feature word;
The primary vector and the secondary vector are compared, extracted full with the agricultural product subject information similarity Corresponding second URL link of the multiple attributes subject information of sufficient preset threshold.
Preferably, which is characterized in that the anchor multiple attributes integration mode based on path polymerization is to the URL to be crawled Before the second URL link carries out the step of feature extraction in queue, the method also includes:
Using the counting bloom filter of chain feature, and in conjunction with multiple Hash in the URL queue to be crawled Two URL links carry out joint duplicate removal, are all different the second URL link of any two in the URL queue to be crawled.
Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled Before the step of the second URL link in queue carries out joint duplicate removal, the method also includes:
The URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, is mentioned Take protocol type part, path sections and the inquiry part of current second URL link;
According to the protocol type part, the path sections and the inquiry part, the current 2nd URL chain is obtained Connect corresponding global feature URL link;
The corresponding relationship between current second URL link and the global feature URL link is established, and will be described right Update should be related into the URL queue to be crawled.
Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled The step of the second URL link in queue carries out joint duplicate removal, comprising:
The URL queue to be crawled is traversed, the corresponding global feature of current second URL link traversed is obtained URL link;
Whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtains institute State the corresponding duplicate checking mark of global feature URL link;
According to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple characteristic fragments;
According to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL chain Tab segments, the N are the integer more than or equal to 1;
Multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtains that current second URL link is corresponding to be looked into Weight result;
According to the duplicate checking as a result, being retained or being abandoned behaviour to the second URL link in the URL queue to be crawled Make.
In addition, to achieve the above object, the present invention also proposes a kind of link extraction element based on web crawlers, the dress It sets and includes:
Extraction module, for being requested from the data grabber in the data grabber request for receiving agricultural product to be analyzed Middle the first uniform resource position mark URL link for extracting platform to be visited and theme relevant to the agricultural product to be analyzed are believed Breath;
Sending module, for sending access request to the platform to be visited according to first URL link;
Handling module, for grabbing after receiving the response that the platform to be visited is made according to the access request Data information in the corresponding page of first URL link;
Parsing module obtains the second URL link embedded in the page for parsing to the data information, And second URL link is added to URL queue to be crawled;
Processing module, the anchor multiple attributes integration mode for being polymerize based on path, to first URL link and described The second URL link in URL queue to be crawled is handled, and the more of the corresponding rich text format of second URL link are obtained Weight attribute subject information;
Extraction module, for respectively by each corresponding multiple attributes of the second URL link in the URL queue to be crawled Subject information is compared with the agricultural product subject information, is extracted and is met default threshold with the agricultural product subject information similarity Corresponding second URL link of the multiple attributes subject information of value.
In addition, to achieve the above object, the present invention also proposes a kind of link extract equipment based on web crawlers, described to set It is standby include: memory, processor and be stored on the memory and can run on the processor based on web crawlers Link extraction procedure, the link extraction procedure based on web crawlers is arranged for carrying out as described above is climbed based on network The step of link extracting method of worm.
In addition, to achieve the above object, the present invention also proposes a kind of computer readable storage medium, described computer-readable The link extraction procedure based on web crawlers is stored on storage medium, the link extraction procedure based on web crawlers is located Manage the step of realizing the link extracting method based on web crawlers as described above when device executes.
Link extraction scheme provided by the invention based on web crawlers passes through the anchor multiple attributes collection polymerizeing based on path At mode, the second URL link treated in the first URL link and URL queue to be crawled of access platform is handled, and obtains The multiple attributes subject information of the corresponding rich text format of two URL links, and will each the 2nd URL in URL queue be crawled It links corresponding multiple attributes subject information to compare with agricultural product subject information, extract and agricultural product subject information similarity Corresponding second URL link of multiple attributes subject information for meeting preset threshold is effectively guaranteed and extracts to specific URL link Accuracy rate, and then can be to avoid web crawlers because of the wasting of resources caused by the crawling of unrelated link, to significantly improve net The performance of network crawler, enables information needed for the acquisition people of web crawlers fast accurate, promotes user experience.
Detailed description of the invention
Fig. 1 is the link extract equipment based on web crawlers for the hardware running environment that the embodiment of the present invention is related to Structural schematic diagram;
Fig. 2 is that the present invention is based on the flow diagrams of the link extracting method first embodiment of web crawlers;
Fig. 3 is that the present invention is based on the signals of path access digraph in the link extracting method first embodiment of web crawlers Figure;
Fig. 4 is that the present invention is based on the flow diagrams of the link extracting method second embodiment of web crawlers;
Fig. 5 is the structural block diagram of the link extraction element first embodiment the present invention is based on web crawlers.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that described herein, specific examples are only used to explain the present invention, is not intended to limit the present invention.
Referring to Fig.1, Fig. 1 is that the link based on web crawlers for the hardware running environment that the embodiment of the present invention is related to mentions Take device structure schematic diagram.
As shown in Figure 1, being somebody's turn to do the link extract equipment based on web crawlers may include: processor 1001, such as centre It manages device (Central Processing Unit, CPU), communication bus 1002, user interface 1003, network interface 1004, storage Device 1005.Wherein, communication bus 1002 is for realizing the connection communication between these components.User interface 1003 may include showing Display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include the wired of standard Interface, wireless interface.Network interface 1004 optionally may include standard wireline interface and wireless interface (such as Wireless Fidelity (WIreless-FIdelity, WI-FI) interface).Memory 1005 can be the random access memory (Random of high speed Access Memory, RAM) memory, be also possible to stable nonvolatile memory (Non-Volatile Memory, ), such as magnetic disk storage NVM.Memory 1005 optionally can also be the storage device independently of aforementioned processor 1001.
The link based on web crawlers is mentioned it will be understood by those skilled in the art that structure shown in Fig. 1 is not constituted The restriction for taking equipment may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.
As shown in Figure 1, as may include operating system, network communication mould in a kind of memory 1005 of storage medium Block, Subscriber Interface Module SIM and the link extraction procedure based on web crawlers.
In link extract equipment based on web crawlers shown in Fig. 1, network interface 1004 is mainly used for taking with network Business device carries out data communication;User interface 1003 is mainly used for carrying out data interaction with user;The present invention is based on web crawlers Processor 1001, memory 1005 in link extract equipment can be set in the link extract equipment based on web crawlers, The link extract equipment based on web crawlers calls what is stored in memory 1005 to climb based on network by processor 1001 The link extraction procedure of worm, and execute the link extracting method provided in an embodiment of the present invention based on web crawlers.
The link extracting method based on web crawlers that the embodiment of the invention provides a kind of is the present invention referring to Fig. 2, Fig. 2 A kind of flow diagram of the link extracting method first embodiment based on web crawlers.
In the present embodiment, the link extracting method based on web crawlers the following steps are included:
Step S10 is extracted from data grabber request in the data grabber request for receiving agricultural product to be analyzed First uniform resource position mark URL of platform to be visited links and agricultural product theme relevant to the agricultural product to be analyzed is believed Breath.
Specifically, the executing subject of the present embodiment is any deployment or the terminal device for being equipped with network crawler system.
It is noted that in the present embodiment, in order to improve the crawl of the corresponding data of agricultural product to be analyzed as far as possible Speed, resolution speed etc. operate, described network crawler system preferred distribution formula network crawler system in the present embodiment.
However, it should be understood that the terminal device can be client device in practical applications, it is also possible to take Business device end equipment, herein with no restrictions.
In addition, above-mentioned described platform to be visited can be the network provider for showing and needing anal yzing agricul products in practical applications City.
Correspondingly, described uniform resource locator (Uniform Resource Locator, URL) is to access the net Network address needed for network store.
However, it should be understood that above-mentioned described agricultural product to be analyzed are to various agricultural product common at present One general designation, agricultural product to be analyzed can be tea product, fruit and vegetable food, cereal product etc. in practical applications, herein no longer It enumerates, any restrictions is not also done to this.
In order to make it easy to understand, the present embodiment is using tea product as agricultural product to be analyzed.
Correspondingly, above-mentioned described agricultural product subject information is then tea product main information, in practical applications tea product Main information can specifically include characteristic information relevant to the tea product to be analyzed, for example define tea product to be analyzed Type be green tea, produce season be it is clear and bright before, price in 500/kg~1000/kg etc., will not enumerate herein, this also do not appointed What is limited.
Step S20 sends access request to the platform to be visited according to first URL link.
Specifically, in practical applications, web crawlers can be using based on transmission control protocol/Internet Protocol (Transmission Control Protocol/Internet Protocol, ICP/IP protocol) transmits the super texts of data This transport protocol (HyperText Transfer Protocol, HTTP) is to platform (the substantially clothes of the platform to be visited Business device) send access request.
It should be understood that being given above only a kind of specific implementation for sending access request to the platform to be visited Mode does not constitute any restriction to technical solution of the present invention, and in practical applications, those skilled in the art can basis It needs to be configured, herein with no restrictions.
Step S30 grabs described the after receiving the response that the platform to be visited is made according to the access request Data information in the corresponding page of one URL link.
It should be understood that in practical applications, if the access request success sent to the platform to be visited, and After the platform to be visited is proved to be successful the first URL link carried in the access request, and successful response can be made, And feed back the data information in the corresponding page of first URL link.At this point, web crawlers and can grab described to be visited The data information of platform feedback being directed in the corresponding page of first URL link.
Step S40 parses the data information, obtains the second URL link embedded in the page, and by institute It states the second URL link and is added to URL queue to be crawled.
It should be understood that in practical applications, in addition to that can show with described wait divide in the corresponding page of the first URL link Analyse the identical data information of agricultural product, it is also possible to multiple URL links relevant to the data information can be shown, for the ease of area Divide referred to herein as the second URL link.
Such as a net including the agricultural product to be analyzed is shown in the corresponding page of the first URL link Network store homepage mainly shows the four major class agricultural production such as agricultural product A, agricultural product B, agricultural product C and agricultural product D in the homepage Product information, while each big agricultural products are corresponding with second URL link again, it is main in the corresponding page of the second URL link Show the small agricultural products that corresponding agricultural product include.
For example, mainly showing agricultural product A-1, agricultural product A- in the corresponding page of corresponding second URL link of agricultural product A 2 and agricultural product A-3;Agricultural product B-1 and agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product B B-2;Agricultural product C-1, agricultural product C-2, agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product C C-3 and agricultural product C4;Agricultural product D-1 and agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product D D-2。
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.
In addition, in the present embodiment, why the second URL link embedded in the page is added to wait crawl URL queue is data the second URL link number that is more, thus parsing because web crawlers crawls in practical applications It measures relatively bulky.And it often crawls, parse second URL link and can consume many times, thus a large amount of second URL link It tends not to access in the short time, therefore needs for the second URL link got every time to be added in URL queue to be crawled.
" second " in addition, " first " in above-mentioned described " the first URL link ", and in " the second URL link " is only It is only for distinguishing the URL link embedded in the corresponding URL link of the platform to be visited page corresponding with the URL link, not URL link itself is caused to limit.In practical applications, any one " second URL link " is relative in its corresponding page Embedded URL link can be regarded as one " the first URL link ".
In addition, it is noted that due in practical applications, in addition to meeting in the corresponding page of second URL link Relevant information including agricultural product to be analyzed, will also include some interference informations, for example, various formats advertisement (picture, audio, Video etc.) information.Therefore, in order to simplify the structure of the corresponding page of the second URL link as far as possible, while facilitating web crawlers golden Belong to data to crawl, it, can be right when obtaining the second URL link, and the second URL link is added to URL queue to be crawled Second URL link carries out denoising.
In order to make it easy to understand, the present embodiment provides a kind of specific denoising mode, approximately as:
(1) data information is parsed, obtains the second URL link embedded in the page.
(2) second URL link is parsed, obtains the corresponding standardization label of second URL link.
Specifically, in practical applications, the mark in the substantially corresponding page of the second URL link to standardize herein Label.
Due to current Webpage be normally based on hypertext markup language (HyperText Markup Language, HTML it) compiles.
Further, since noise link is normally present some picture tags, the hyperlink of tag definition in practical applications, And in the URL of some specified hyperlink targets, therefore it need to only standardize to this kind of label.
(3) the corresponding abstract tree of second URL link is generated according to the standardization label.
(4) it is based on dom tree matching process, by the node content of the abstract tree and agricultural product subject information progress Match, remove unmatched node content, obtains and matched second URL link of the agricultural product subject information.
It is generated when specifically, due to the abstract tree according to the standardization label, thus each of abstract tree Node essence is exactly a standardization label.Therefore it is carried out by the node content of the abstract tree and the agricultural product subject information When matching, the keyword in the keyword and agricultural product subject information in the node is specifically extracted, then by two keywords It compares, and then determines whether the node needs to remove.In this way, by it is described it is abstract tree each of node Content wants after being matched with the agricultural product theme, can complete the removal linked to noise, so obtain with it is described Matched second URL link of agricultural product subject information.
(5) URL queue to be crawled will be added to matched second URL link of the agricultural product subject information.
It should be understood that only a kind of specific denoising mode that the present embodiment provides, to technical solution of the present invention Any restriction is not constituted, in practical applications, those skilled in the art, which can according to need, to be configured, and is not limited herein System.
Step S50, based on the anchor multiple attributes integration mode of path polymerization, to first URL link and described wait climb It takes the second URL link in URL queue to be handled, obtains the multiple category of the corresponding rich text format of second URL link Sexual Themes information.
The multiple attributes master of the corresponding rich text format of each described second URL link of above-mentioned acquisition in order to facilitate understanding The operation for inscribing information, is given below a kind of concrete implementation mode, approximately as:
(1) it according to the second URL link in first URL link and the URL queue to be crawled, generates described wait divide Analyse the corresponding path access digraph of agricultural product.
Specifically, each vertex of the path access digraph is exactly the corresponding page of a URL link, with Fig. 3 For be specifically described.
As shown in figure 3, the source web page u essence in figure is exactly the corresponding page of the first URL link, teas type and tea price are then For the two embedded corresponding pages of the second URL link parsed from the data information of the page, page object v1, page object V2 and page object v3 is then the 2nd URL of the next layer of page parsed from the corresponding page of above-mentioned two second URL link Link.
(2) the anchor multiple attributes integration mode based on path polymerization, determines the most short visit in the path access digraph It asks the way diameter, obtains most short access path set.
Specifically, in practical applications in a path access digraph there may be mulitpath, and it is therein There may be loop free path (inc) for shortest path, it is also possible to which there are endless path (closures).
For the ease of distinguishing, in practical applications, it can indicate that shortest path has ring actually with different set Or it is acyclic.
In order to make it easy to understand, a kind of specific representation of most short loop free path set is given below.
Such as M most short loop free paths of source web page to page object, following set expression can be used:
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.
(3) it determines the corresponding Anchor Text of the most short access path of each in the most short access path set, obtains described The corresponding access path Anchor Text set of most short access path set, and be each member in the access path Anchor Text set Element one weight of distribution.
It, specifically can be according to as follows in for the access path Anchor Text set when each one weight of Elemental partition Mode carries out:
Firstly, agreement PmIt is most short loop free path, the value range of m meets:
Then, arrange w (Pm)≤w(Pm+1), the value range of m meets:
Then, arrange w (PmThe value range of)≤w (P), P meet:
Finally, agreement PmIn Pm+1It determines before, the value range of m meets:
Wherein, W is weight, and M takes positive integer.
Starting provides that the weight of each element (i.e. each edge) is 1, if therefore path P pass through m directed edge, then w (P) =m.
It should be understood that only a kind of specific implementation for distributing weight is given above, to technology of the invention Scheme does not constitute any restriction, and in practical applications, those skilled in the art, which can according to need, to be configured, herein not It is limited.
(4) formula pair is standardized according to preset weight, to each element pair in the access path Anchor Text set The weight answered is standardized.
Specifically, the weight standardization formula used in the present embodiment is as follows:
Wherein,It is that element e existsIn weight.
Still by taking access path digraph shown in Fig. 3 as an example, by the weight normalized form, former access road can be modified In diameter from source web page u to the Anchor Text of page object v1, page object v2 and page object v3 in element weights.
(5) descending sort is carried out to the weight after standardization, obtains the corresponding rich text format of second URL link Multiple attributes subject information.
Step S60, respectively by the corresponding multiple attributes theme of each second URL link in the URL queue to be crawled Information is compared with the agricultural product subject information, is extracted and is met preset threshold with the agricultural product subject information similarity Corresponding second URL link of multiple attributes subject information.
The operation that link based on web crawlers is extracted in order to facilitate understanding, is given below a kind of concrete implementation mode, Approximately as:
(1) multiple attributes theme feature word is extracted from the multiple attributes subject information, to the multiple attributes theme Feature Words carry out Hash processing, obtain the first cryptographic Hash, and the multiple attributes theme feature word is multiple attributes theme letter Cease an element in corresponding access path Anchor Text set.
(2) the corresponding weight of the multiple attributes theme feature word is obtained from the access path Anchor Text set, and First cryptographic Hash is quantified as primary vector in conjunction with the weight.
(3) agricultural product theme feature word is extracted from the agricultural product subject information, and to the agricultural product theme feature Word carries out Hash processing, obtains the second cryptographic Hash.
(4) according to be the preset weight of agricultural product theme feature word by second cryptographic Hash be quantified as second to Amount.
(5) primary vector and the secondary vector are compared, is extracted similar to the agricultural product subject information Degree meets corresponding second URL link of multiple attributes subject information of preset threshold.
Specifically, the present embodiment is by by the comparison process of multiple attributes subject information and the agricultural product subject information The comparison between two vectors is converted to, so as to the more visual in image comparing result that obtains, that is, facilitates mentioning for link It takes, guaranteed accuracy.
By foregoing description it is not difficult to find that the link extracting method provided in this embodiment based on web crawlers, passes through base In the anchor multiple attributes integration mode of path polymerization, the in the first URL link and URL queue to be crawled of access platform is treated Two URL links are handled, and obtain the multiple attributes subject information of the corresponding rich text format of the second URL link, and will be wait climb It takes the corresponding multiple attributes subject information of each second URL link in URL queue to compare with agricultural product subject information, mentions It takes and meets corresponding second URL link of the multiple attributes subject information of preset threshold with agricultural product subject information similarity, effectively Ensure that the accuracy rate extracted to specific URL link, and then can to avoid web crawlers because of the crawling of unrelated link caused by The wasting of resources enables needed for the acquisition people of web crawlers fast accurate to significantly improve the performance of web crawlers Information, promoted user experience.
With reference to Fig. 4, Fig. 4 is a kind of process signal of link extracting method second embodiment based on web crawlers of the present invention Figure.
Based on above-mentioned first embodiment, the present embodiment based on the link extracting method of web crawlers the step S50 it Before, further includes:
Step S00, using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the URL team to be crawled Second URL link in column carries out joint duplicate removal.
Specifically, the above-mentioned described counting bloom filter using chain feature, and in conjunction with multiple Hash to described The joint duplicate removal that second URL link in URL queue to be crawled carries out, is broadly divided into corresponding to the URL link whole Body characteristics URL link duplicate removal and to URL link segment duplicate removal.
And URL link segment is obtained according to global feature URL link, thus in order to guarantee above-mentioned joint duplicate removal Operation can be gone on smoothly, and need first to determine the corresponding relationship between the second URL link and global feature URL link.
In order to make it easy to understand, the present embodiment provide it is corresponding between a kind of the second URL link of determination and global feature URL link The specific implementation of relationship, approximately as:
(1) the URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, Extract protocol type part, path sections and the inquiry part of current second URL link.
Specifically, since URL link in practical applications is the resource on unique identification network.Also, one As for, a URL link would generally include following five component parts: protocol type part (usually use Protocol table Show), server address part (usual user Host indicate), port numbers part (usually being indicated with Port), path sections (usually Indicated with Path) and inquiry part (usually being indicated with Fragment).
Wherein, protocol type part, path sections and these three parts of inquiry part can usually embody a URL chain The feature connect.
Thus, the present embodiment is by traversing the URL queue to be crawled, and to current 2nd URL traversed Link carries out signature analysis, and then extracts the protocol type part of current second URL link (below subsequent explanation User p1Indicate), path sections are (for the ease of user p below subsequent explanation2Indicate) and inquire that part (continues for the ease of after Bright following user p3It indicates).
(2) according to the protocol type part, the path sections and the inquiry part, described current second is obtained The corresponding global feature URL link of URL link.
Specifically, due to p1、p2And p3This three parts can embody whole features of current second URL link, thus By to p1、p2And p3The corresponding global feature URL link of current second URL link can be obtained by being combined, and be used below p1p2p3Indicate the corresponding global feature URL link of each second URL link.
(3) corresponding relationship between current second URL link and the global feature URL link is established, and by institute Corresponding relationship is stated to update into the URL queue to be crawled.
Specifically, current second URL link and the global feature URL chain why are established in the present embodiment Corresponding relationship between connecing, and it is subsequent to for convenience into the URL queue to be crawled that the corresponding relationship, which updated, During two URL link duplicate removals, can the corresponding relationship be quickly found out the corresponding global feature URL chain of current second URL link It connects, and then the corresponding URL link segment of current second URL link is obtained according to whole URL link.
In addition, in practical applications, the corresponding relationship can not also be updated into the URL queue to be crawled, and It is individually to store.When treating the second URL link crawled in URL queue and carrying out joint duplicate removal, according to current traversed Two URL links search the corresponding global feature URL link of current second URL link i.e. from the mapping table individually stored It can.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.
Further, obtain above-mentioned corresponding relationship and the corresponding global feature URL link of each second URL link it Afterwards, the above-mentioned described counting bloom filter using chain feature, and in conjunction with multiple Hash in the URL queue to be crawled Second URL link carry out the operation of joint duplicate removal, specifically can be as described below:
(1) the URL queue to be crawled is traversed, obtains the corresponding entirety of current second URL link traversed Feature URL link.
Specifically, obtaining the corresponding global feature URL link of current second URL link traversed is according to above-mentioned Described corresponding relationship obtains.
(2) whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtained The corresponding duplicate checking mark of the global feature URL link.
Specifically, counting bloom filter employed in the present embodiment and non-existing use when link duplicate removal The counting grand device of cloth, but the counting bloom filter of the chain feature based on URL link.
That is, the calculating Bloom filter of the present embodiment is climbed when carrying out duplicate removal to link particular by treating The corresponding global feature URL link of each second URL link in URL queue is taken to carry out feature identification, then basis recognizes Feature carry out whole duplicate checking, i.e., be to have entered link to each second to carry out Characteristic Contrast, and then realize whole look into duplicate removal Weight.
Also, identify for convenience it is subsequent recombinated according to characteristic fragment after URL link segment, can also be global feature URL link distributes corresponding duplicate checking mark.
(3) according to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple feature pieces Section.
It specifically, with global feature URL link is still p1p2p3For, by being carried out to the global feature URL link After feature identification, obtained multiple characteristic fragments, which specifically can be, respectively includes protocol type part, path sections and asking portion The segment divided, i.e., to characteristic fragment p1, characteristic fragment p2With characteristic fragment p3
(4) according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link segment.
It should be understood that since a global feature URL link is by protocol type part, path sections and inquiry three Part composition, thus 1 recombination URL link segment can be at least obtained, therefore N is the integer more than or equal to 1 in the present embodiment.
In addition, total in practical application, the URL link reformulation rule can by those skilled in the art as needed into URL link segment after row setting, such as regulation recombination must include characteristic fragment p1, or the URL link segment after recombination It cannot include characteristic fragment p3Deng will not enumerate, any restrictions also do not done to this herein.
Correspondingly, if URL link reformulation rule be recombination after URL link segment must include characteristic fragment p1, then It only includes p that obtained recombination URL link segment, which generally comprises,1The URL link segment of characteristic fragment only includes p1Characteristic fragment and p2The URL link segment of characteristic fragment, and only include p1Characteristic fragment and p3The URL link segment of characteristic fragment.
If URL link reformulation rule is that the URL link segment after recombination cannot include characteristic fragment p3, then the weight that obtains It only includes p that group URL link segment, which generally comprises,1The URL link segment of characteristic fragment and only include p1Characteristic fragment and p2Feature piece The URL link segment of section.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art can be configured according to actual needs, herein with no restrictions.
(5) multiple Hash duplicate checking is carried out to N number of recombination URL link segment, it is corresponding obtains current second URL link Duplicate checking result.
It is noted that the second URL link being buffered in URL queue to be crawled may due in practical applications Have largely, thus the URL link segment obtained after recombinating can be more more.Therefore, in the present embodiment, in order to reduce as far as possible pair The second URL link cached in URL queue to be crawled is to the occupancy of memory space, according to preset URL link reformulation rule, The multiple characteristic fragment is recombinated, after obtaining N number of recombination URL link segment, can first be based on MD5 algorithm, to To N number of recombination URL link segment compressed respectively, and then it is close to obtain the corresponding character string of N number of recombination URL link segment The character string ciphertext is finally replaced the content in corresponding recombination URL link segment by text.
It should be understood that being given above only a kind of specific compress mode, not to technical solution of the present invention Any restriction is constituted, in practical applications, those skilled in the art can choose suitable compression method according to actual needs, Herein with no restrictions.
Correspondingly, above-mentioned that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain The operation of corresponding duplicate checking result is connect, specifically:
(5-1) extracts the corresponding character string ciphertext of N number of recombination URL link segment, chooses from N number of character string ciphertext any One character string ciphertext carries out K Hash processing, obtains K cryptographic Hash.
It should be understood that due to link deduplication operation provided in this embodiment, the tool when carrying out joint duplicate removal to link What body combined is multiple Hash, i.e., at least needs to carry out 2 Hash processing to a character string ciphertext, therefore above-mentioned described K is Integer more than or equal to 2.
(5-2) joins using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each Examine the corresponding spatially-variable counter setting initial count value of cryptographic Hash.
Specifically, each in the present embodiment with reference to the initial meter shown on the corresponding spatially-variable counter of cryptographic Hash Numerical value is indicated with " 0 ".
(5-3) carries out K Hash processing to remaining N-1 character string ciphertext respectively, and it is close to obtain each remaining character string The corresponding K cryptographic Hash of text.
(5-4) by the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and It is adjacent with reference to cryptographic Hash with any one.
Specifically, it is referred to actually with that for the ease of determining newly to hash to the cryptographic Hash in institute's bit vector space Cryptographic Hash is adjacent, can preset a determining standard, such as two neighboring with reference to being inserted into new Hash between cryptographic Hash It, can be using the nearest reference cryptographic Hash of cryptographic Hash that selected distance is newly inserted into as adjacent reference cryptographic Hash when value.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality In the application of border, those skilled in the art can be configured according to actual needs, herein with no restrictions.
(5-5) uses head to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute The cryptographic Hash in bit vector space is inserted into a preset characters.
Specifically, the preset characters select " 1 " to indicate in the present embodiment.
Such as cryptographic Hash is referred to for one, the initial count value shown on corresponding spatially-variable counter is "0".When there is a new cryptographic Hash hash to position adjacent thereto, it is necessary to insert method being previously inserted into " 0 " using head One preset characters " 1 ", the count value shown on spatially-variable counter at this time become " 10 ".
Correspondingly, position has been thought with reference to cryptographic Hash to this if there are two new cryptographic Hash hash, needed using head The method of inserting is previously inserted into two preset characters " 1 " in " 0 ", and the count value shown on spatially-variable counter at this time becomes " 110 ".
(5-6) counts each with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to described default The number of character determines the corresponding duplicate checking result of current second URL link.
Specifically, determining duplicate checking result can be with are as follows:
If the number of the preset characters " 1 " before initial count value " 0 " is greater than 1, it is determined that the recombination URL segment weight It is multiple, it needs to abandon;
Otherwise, it determines the recombination URL segment does not repeat, can retain.
(6) according to the duplicate checking as a result, the second URL link in the URL queue to be crawled is retained or abandoned Operation.
It should be understood that only a kind of specific implementation for combining duplicate removal is given above, to technology of the invention Scheme does not constitute any restriction, and in practical applications, those skilled in the art, which can according to need, to be reasonably adjusted, herein not It is limited.
In addition, in practical applications, in order to further reduce the occupancy to memory space, in the meter using chain feature Number Bloom filter, and joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash Later, it is also based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is pressed Contracting, and then obtain the corresponding character string ciphertext of each second URL link;Finally the character string ciphertext is replaced corresponding Content in second URL link reduces empty to storage to compress the second URL link in URL queue to be crawled as far as possible Between occupancy.
By foregoing description, it is not difficult to find out that, the link extracting method provided in this embodiment based on web crawlers is being treated It crawls before the second URL link in URL queue extracts operation, by treating the second URL link crawled in URL queue Deduplication operation is carried out, to further reduce unnecessary interference in link extraction process, improves the extraction effect of web crawlers Rate.
In addition, counting bloom filter of the present embodiment by using chain feature, and in conjunction with multiple Hash to it is described to It crawls the second URL link cached in URL queue and carries out whole and part joint duplicate removal, to reduce counting as far as possible The False Rate of Bloom filter effectively improves the performance of web crawlers, enables the acquisition people of web crawlers fast accurate Needed for information, the user experience is improved as far as possible.
In addition, during duplicate removal, by being based on compression algorithm, if MD5 algorithm compresses URL link, thus to the greatest extent The possible occupancy reduced to memory space.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium On be stored with the link extraction procedure based on web crawlers, the link extraction procedure based on web crawlers is executed by processor The step of Shi Shixian as described above link extracting method based on web crawlers.
It is the structural block diagram of the link extraction element first embodiment the present invention is based on web crawlers referring to Fig. 5, Fig. 5.
As shown in figure 5, the link extraction element based on web crawlers that the embodiment of the present invention proposes includes: extraction module 5001, sending module 5002, handling module 5003, parsing module 5004, processing module 5005 and extraction module 5006.
Wherein, extraction module 5001, for receive agricultural product to be analyzed data grabber request when, from the data Crawl request in extract platform to be visited the first uniform resource position mark URL link and it is relevant to the agricultural product to be analyzed Subject information;Sending module 5002, for sending access request to the platform to be visited according to first URL link;It grabs Modulus block 5003 grabs described for after receiving the response that the platform to be visited is made according to the access request Data information in the corresponding page of one URL link;Parsing module 5004 is obtained for parsing to the data information The second URL link embedded in the page, and second URL link is added to URL queue to be crawled;Processing module 5005, the anchor multiple attributes integration mode for being polymerize based on path, to first URL link and the URL team to be crawled The second URL link in column is handled, and the multiple attributes theme letter of the corresponding rich text format of second URL link is obtained Breath;Extraction module 5006, for respectively by each corresponding multiple attributes of the second URL link in the URL queue to be crawled Subject information is compared with the agricultural product subject information, is extracted and is met default threshold with the agricultural product subject information similarity Corresponding second URL link of the multiple attributes subject information of value.
It should be understood that each module involved in the present embodiment is logic module, and in practical applications, one Logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physical units Combination realize.In addition, in order to protrude innovative part of the invention, it will not be proposed by the invention with solution in the present embodiment The technical issues of the less close unit of relationship introduce, but this does not indicate that there is no other units in present embodiment.
In addition, the link extraction element based on web crawlers provided in the present embodiment in order to facilitate understanding is in practical application In each functional module specific process flow, below for parsing module 5004, processing module 5005 and extraction module 5006 Processing be specifically described.
Specifically, the execution of parsing module 5004 parses the data information, obtain in the page The second embedding URL link, and second URL link is added to the operation of URL queue to be crawled, it realizes in a particular application Process approximately as:
Firstly, parsing to the data information, the second URL link embedded in the page is obtained;
Then, second URL link is parsed, obtains the corresponding standardization label of second URL link;
Then, the corresponding abstract tree of second URL link is generated according to the standardization label;
Then, it is based on dom tree matching process, the node content of the abstract tree and the agricultural product subject information are carried out Matching, removes unmatched node content, obtains and matched second URL link of the agricultural product subject information;
Finally, by URL queue to be crawled is added to matched second URL link of the agricultural product subject information.
It is denoised it should be understood that being given above only a kind of the second URL link crawled in URL queue for the treatment of Specific implementation, any restriction, in a particular application, those skilled in the art are not constituted to technical solution of the present invention Member, which can according to need, to be configured, and the present invention is without limitation.
In addition, the anchor multiple attributes integration mode based on path polymerization that the processing module 5005 executes, to described the The second URL link in one URL link and the URL queue to be crawled is handled, and it is corresponding to obtain second URL link The operation of the multiple attributes subject information of rich text format, in a particular application implementation process approximately as:
Firstly, according to the second URL link in first URL link and the URL queue to be crawled, generate it is described to The corresponding path access digraph of anal yzing agricul products;
Then, the anchor multiple attributes integration mode based on path polymerization, determines most short in the path access digraph Access path obtains most short access path set;
It is then determined the corresponding Anchor Text of the most short access path of each in the most short access path set, obtains institute The corresponding access path Anchor Text set of most short access path set is stated, and is each in the access path Anchor Text set One weight of Elemental partition;
Then, formula pair is standardized according to preset weight, to each element in the access path Anchor Text set Corresponding weight is standardized;
Finally, carrying out descending sort to the weight after standardization, the corresponding rich text format of second URL link is obtained Multiple attributes subject information.
It should be understood that being given above, only a kind of to obtain each second URL link in URL queue to be crawled corresponding Rich text format multiple attributes subject information specific implementation, any limit is not constituted to technical solution of the present invention Fixed, in a particular application, those skilled in the art, which can according to need, to be configured, and the present invention is without limitation.
In addition, the extraction module 5006 execute respectively by each second URL link in the URL queue to be crawled The step of corresponding multiple attributes subject information and the agricultural product subject information compare is extracted and the agricultural product theme Information similarity meets the corresponding second URL link operation of multiple attributes subject information of preset threshold, real in a particular application Existing process approximately as:
Firstly, multiple attributes theme feature word is extracted from the multiple attributes subject information, to the multiple attributes master It inscribes Feature Words and carries out Hash processing, obtain the first cryptographic Hash, the multiple attributes theme feature word is the multiple attributes theme An element in the corresponding access path Anchor Text set of information;
Then, the corresponding weight of the multiple attributes theme feature word is obtained from the access path Anchor Text set, And first cryptographic Hash is quantified as primary vector in conjunction with the weight;
Then, agricultural product theme feature word is extracted from the agricultural product subject information, and special to the agricultural product theme It levies word and carries out Hash processing, obtain the second cryptographic Hash;
Then, according to for the preset weight of agricultural product theme feature word by second cryptographic Hash be quantified as second to Amount;
Finally, the primary vector and the secondary vector are compared, extract and the agricultural product subject information phase Meet corresponding second URL link of multiple attributes subject information of preset threshold like degree.
It should be understood that being given above only a kind of specific implementation for extracting specific link from URL queue to be crawled Mode does not constitute any restriction to technical solution of the present invention, and in a particular application, those skilled in the art can basis It needs to be configured, the present invention is without limitation.
By foregoing description it is not difficult to find that the link extraction element provided in this embodiment based on web crawlers, passes through base In the anchor multiple attributes integration mode of path polymerization, the in the first URL link and URL queue to be crawled of access platform is treated Two URL links are handled, and obtain the multiple attributes subject information of the corresponding rich text format of the second URL link, and will be wait climb It takes the corresponding multiple attributes subject information of each second URL link in URL queue to compare with agricultural product subject information, mentions It takes and meets corresponding second URL link of the multiple attributes subject information of preset threshold with agricultural product subject information similarity, effectively Ensure that the accuracy rate extracted to specific URL link, and then can to avoid web crawlers because of the crawling of unrelated link caused by The wasting of resources enables needed for the acquisition people of web crawlers fast accurate to significantly improve the performance of web crawlers Information, promoted user experience.
It should be noted that workflow described above is only schematical, not to protection model of the invention Enclose composition limit, in practical applications, those skilled in the art can select according to the actual needs part therein or It all achieves the purpose of the solution of this embodiment, herein with no restrictions.
In addition, the not technical detail of detailed description in the present embodiment, reference can be made to provided by any embodiment of the invention De-weight method is linked, details are not described herein again.
Based on the first embodiment of the above-mentioned link extraction element based on web crawlers, propose that the present invention is based on web crawlers Link extraction element second embodiment.
In the present embodiment, the link extraction element based on web crawlers also packet deduplication module.
Wherein, deduplication module, for using chain feature counting bloom filter, and in conjunction with multiple Hash to it is described to Second URL link crawled in URL queue carries out joint duplicate removal.
It should be noted that each module involved in the present embodiment is logic module, and in practical applications, one Logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physical units Combination realize.In addition, in order to protrude innovative part of the invention, it will not be proposed by the invention with solution in the present embodiment The technical issues of the less close unit of relationship introduce, but this does not indicate that there is no other units in present embodiment.
In addition, it is noted that in the present embodiment deduplication module using chain feature counting bloom filter, and When carrying out joint duplicate removal to second URL link wait crawl in URL queue in conjunction with multiple Hash, it is specifically divided into institute State the corresponding global feature URL link duplicate removal of URL link and to URL link segment duplicate removal.
And URL link segment is obtained according to global feature URL link, thus in order to guarantee that deduplication module can be suitable Benefit executes aforesaid operations, needs first to determine the corresponding relationship between the second URL link and global feature URL link.
It, substantially can following institute about the mode for determining corresponding relationship between the second URL link and global feature URL link It states:
Firstly, traversing to the URL queue to be crawled, feature point is carried out to current second URL link traversed Protocol type part, path sections and the inquiry part of current second URL link are extracted in analysis;
Then, according to the protocol type part, the path sections and the inquiry part, described current second is obtained The corresponding global feature URL link of URL link;
Finally, establishing the corresponding relationship between current second URL link and the global feature URL link, and will The corresponding relationship is updated into the URL queue to be crawled.
Correspondingly, after obtaining above-mentioned corresponding relationship, the operation that the deduplication module executes, specifically:
Firstly, traversing to the URL queue to be crawled, it is corresponding whole to obtain current second URL link traversed Body characteristics URL link;
Then, whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtained To the corresponding duplicate checking mark of the global feature URL link;
Then, according to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple features Segment;
Then, according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link segment;
Then, multiple Hash duplicate checking is carried out to N number of recombination URL link segment, it is corresponding obtains current second URL link Duplicate checking result;
Finally, according to the duplicate checking as a result, the second URL link in the URL queue to be crawled is retained or lost Abandon operation.
It should be noted that in the present embodiment, above-mentioned described N is the integer more than or equal to 1.
However, it should be understood that being given above only a kind of the second URL link of determination and global feature URL link Between corresponding relationship, and using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the URL team to be crawled Second URL link in column carries out the specific implementation of joint duplicate removal, does not constitute and appoints to technical solution of the present invention What is limited, and in a particular application, those skilled in the art, which can according to need, to be configured, and the present invention is without limitation.
Further, in practical applications, in order to reduce the 2nd URL chain for treating and crawling and caching in URL queue as far as possible The occupancy to memory space is connect, according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, is obtained To after N number of recombination URL link segment, it can be first based on MD5 algorithm, obtained N number of recombination URL link segment is carried out respectively Compression, and then obtains the corresponding character string ciphertext of N number of recombination URL link segment, finally replaces the character string ciphertext pair The content in recombination URL link segment answered.
Correspondingly, described that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain The operation of corresponding duplicate checking result is connect, specifically:
Firstly, extracting the corresponding character string ciphertext of N number of recombination URL link segment, chosen from N number of character string ciphertext any One character string ciphertext carries out K Hash processing, obtains K cryptographic Hash;
Then, join using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each Examine the corresponding spatially-variable counter setting initial count value of cryptographic Hash;
Then, K Hash processing is carried out to remaining N-1 character string ciphertext respectively, it is close obtains each remaining character string The corresponding K cryptographic Hash of text;
Then, by each corresponding K cryptographic Hash random hash of residue character string ciphertext to institute's bit vector space, and It is adjacent with reference to cryptographic Hash with any one;
Then, head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute The cryptographic Hash in bit vector space is inserted into a preset characters;
Finally, counting each with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to described default The number of character determines the corresponding duplicate checking result of current second URL link.
It should be noted that in the present embodiment, above-mentioned described K is the integer more than or equal to 2.
However, it should be understood that being given above the corresponding duplicate checking result of only a kind of current second URL link of acquisition Specific implementation, any restriction, in a particular application, those skilled in the art are not constituted to technical solution of the present invention Member, which can according to need, to be configured, and the present invention is without limitation.
In addition, in practical applications, in order to further reduce the occupancy to memory space, to the URL to be crawled After the second URL link in queue carries out joint duplicate removal, it is also based on MD5 algorithm, to URL to be crawled described in after duplicate removal The second URL link of each of queue is compressed, and then obtains the corresponding character string ciphertext of each second URL link; The character string ciphertext is finally replaced into the content in corresponding second URL link, to compress URL to be crawled as far as possible The second URL link in queue reduces the occupancy to memory space.
By foregoing description, it is not difficult to find out that, the link extraction element provided in this embodiment based on web crawlers is being treated It crawls before the second URL link in URL queue extracts operation, by treating the second URL link crawled in URL queue Deduplication operation is carried out, to further reduce unnecessary interference in link extraction process, improves the extraction effect of web crawlers Rate.
In addition, counting bloom filter of the present embodiment by using chain feature, and in conjunction with multiple Hash to it is described to It crawls the second URL link cached in URL queue and carries out whole and part joint duplicate removal, to reduce counting as far as possible The False Rate of Bloom filter effectively improves the performance of web crawlers, enables the acquisition people of web crawlers fast accurate Needed for information, the user experience is improved as far as possible.
In addition, during duplicate removal, by being based on compression algorithm, if MD5 algorithm compresses URL link, thus to the greatest extent The possible occupancy reduced to memory space.
It should be noted that workflow described above is only schematical, not to protection model of the invention Enclose composition limit, in practical applications, those skilled in the art can select according to the actual needs part therein or It all achieves the purpose of the solution of this embodiment, herein with no restrictions.
In addition, the not technical detail of detailed description in the present embodiment, reference can be made to provided by any embodiment of the invention De-weight method is linked, details are not described herein again.
In addition, it should be noted that, herein, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that process, method, article or system including a series of elements are not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or system Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or system including the element.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as read-only memory (Read Only Memory, ROM)/RAM, magnetic disk, CD), including some instructions are used so that one Terminal device (can be mobile phone, computer, server or the network equipment etc.) executes side described in each embodiment of the present invention Method.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of link extracting method based on web crawlers, which is characterized in that the described method comprises the following steps:
In the data grabber request for receiving agricultural product to be analyzed, platform to be visited is extracted from data grabber request The link of first uniform resource position mark URL and agricultural product subject information relevant to the agricultural product to be analyzed;
According to first URL link, access request is sent to the platform to be visited;
After receiving the response that the platform to be visited is made according to the access request, first URL link pair is grabbed The data information in the page answered;
The data information is parsed, obtains the second URL link embedded in the page, and by the 2nd URL chain It connects and is added to URL queue to be crawled;
Based on the anchor multiple attributes integration mode of path polymerization, in first URL link and the URL queue to be crawled Second URL link is handled, and the multiple attributes subject information of the corresponding rich text format of second URL link is obtained;
Respectively by the corresponding multiple attributes subject information of each second URL link in the URL queue to be crawled and the agriculture Product subject information compares, and extracts the multiple attributes theme for meeting preset threshold with the agricultural product subject information similarity Corresponding second URL link of information.
2. the method as described in claim 1, which is characterized in that it is described that the data information is parsed, obtain the page The second URL link embedded in face, and the step of second URL link is added to URL queue to be crawled, comprising:
The data information is parsed, the second URL link embedded in the page is obtained;
Second URL link is parsed, the corresponding standardization label of second URL link is obtained;
The corresponding abstract tree of second URL link is generated according to the standardization label;
Based on dom tree matching process, the node content of the abstract tree is matched with the agricultural product subject information, is removed Unmatched node content obtains and matched second URL link of the agricultural product subject information;
URL queue to be crawled will be added to matched second URL link of the agricultural product subject information.
3. method according to claim 2, which is characterized in that the anchor multiple attributes integration mode based on path polymerization, The second URL link in first URL link and the URL queue to be crawled is handled, the 2nd URL chain is obtained The step of connecing the multiple attributes subject information of corresponding rich text format, comprising:
According to the second URL link in first URL link and the URL queue to be crawled, the agricultural production to be analyzed is generated The corresponding path access digraph of product;
Based on the anchor multiple attributes integration mode of path polymerization, the most short access path in the path access digraph is determined, Obtain most short access path set;
It determines the corresponding Anchor Text of the most short access path of each in the most short access path set, obtains the most short access The corresponding access path Anchor Text set of set of paths, and be each Elemental partition one in the access path Anchor Text set A weight;
Formula pair is standardized according to preset weight, to the corresponding weight of each element in the access path Anchor Text set It is standardized;
Descending sort is carried out to the weight after standardization, obtains the multiple category of the corresponding rich text format of second URL link Sexual Themes information.
4. method as claimed in claim 3, which is characterized in that it is characterized in that, described respectively by the URL queue to be crawled In the corresponding multiple attributes subject information of each second URL link and agricultural product subject information the step of comparing, It extracts and meets corresponding second URL link of the multiple attributes subject information of preset threshold with the agricultural product subject information similarity The step of, comprising:
From the multiple attributes subject information extract multiple attributes theme feature word, to the multiple attributes theme feature word into The processing of row Hash, obtains the first cryptographic Hash, and the multiple attributes theme feature word is that the multiple attributes subject information is corresponding An element in access path Anchor Text set;
The corresponding weight of the multiple attributes theme feature word is obtained from the access path Anchor Text set, and in conjunction with described First cryptographic Hash is quantified as primary vector by weight;
Agricultural product theme feature word is extracted from the agricultural product subject information, and the agricultural product theme feature word is breathed out Uncommon processing, obtains the second cryptographic Hash;
Second cryptographic Hash is quantified as secondary vector according to for the preset weight of agricultural product theme feature word;
The primary vector and the secondary vector are compared, extracts and meets in advance with the agricultural product subject information similarity If corresponding second URL link of the multiple attributes subject information of threshold value.
5. such as the described in any item methods of Claims 1-4, which is characterized in that the anchor multiple attributes based on path polymerization Before the step of integration mode carries out feature extraction to the second URL link in the URL queue to be crawled, the method is also wrapped It includes:
Using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the 2nd URL in the URL queue to be crawled Link carries out joint duplicate removal, is all different the second URL link of any two in the URL queue to be crawled.
6. method as claimed in claim 5, which is characterized in that the counting bloom filter using chain feature, and tie Before closing the step of multiple Hash carries out joint duplicate removal to the second URL link in the URL queue to be crawled, the method is also Include:
The URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, extracts institute State protocol type part, path sections and the inquiry part of current second URL link;
According to the protocol type part, the path sections and the inquiry part, current second URL link pair is obtained The global feature URL link answered;
The corresponding relationship between current second URL link and the global feature URL link is established, and the correspondence is closed System updates into the URL queue to be crawled.
7. method as claimed in claim 6, which is characterized in that the counting bloom filter using chain feature, and tie Close the step of multiple Hash carries out joint duplicate removal to the second URL link in the URL queue to be crawled, comprising:
The URL queue to be crawled is traversed, the corresponding global feature URL of current second URL link traversed is obtained Link;
Whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, is obtained described whole The corresponding duplicate checking mark of body characteristics URL link;
According to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple characteristic fragments;
According to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link piece Section, the N are the integer more than or equal to 1;
Multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtains the corresponding duplicate checking knot of current second URL link Fruit;
According to the duplicate checking as a result, being retained or being abandoned operation to the second URL link in the URL queue to be crawled.
8. a kind of link extraction element based on web crawlers, which is characterized in that described device includes:
Extraction module, for being mentioned from data grabber request in the data grabber request for receiving agricultural product to be analyzed The first uniform resource position mark URL of platform to be visited is taken to link and subject information relevant to the agricultural product to be analyzed;
Sending module, for sending access request to the platform to be visited according to first URL link;
Handling module, for after receiving the response that the platform to be visited is made according to the access request, described in crawl Data information in the corresponding page of first URL link;
Parsing module obtains the second URL link embedded in the page, and will for parsing to the data information Second URL link is added to URL queue to be crawled;
Processing module, the anchor multiple attributes integration mode for being polymerize based on path, to first URL link and described wait climb It takes the second URL link in URL queue to be handled, obtains the multiple category of the corresponding rich text format of second URL link Sexual Themes information;
Extraction module, for respectively by the corresponding multiple attributes theme of each second URL link in the URL queue to be crawled Information is compared with the agricultural product subject information, is extracted and is met preset threshold with the agricultural product subject information similarity Corresponding second URL link of multiple attributes subject information.
9. a kind of link extract equipment based on web crawlers, which is characterized in that the equipment include: memory, processor and The link extraction procedure based on web crawlers that is stored on the memory and can run on the processor, it is described to be based on The link extraction procedure of web crawlers is arranged for carrying out the chain based on web crawlers as described in any one of claims 1 to 7 The step of connecing extracting method.
10. a kind of computer readable storage medium, which is characterized in that be stored on the computer readable storage medium based on net The link extraction procedure of network crawler is realized when the link extraction procedure based on web crawlers is executed by processor as right is wanted The step of seeking 1 to 7 described in any item link extracting methods based on web crawlers.
CN201910670515.5A 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler Active CN110413861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910670515.5A CN110413861B (en) 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910670515.5A CN110413861B (en) 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler

Publications (2)

Publication Number Publication Date
CN110413861A true CN110413861A (en) 2019-11-05
CN110413861B CN110413861B (en) 2021-10-22

Family

ID=68362839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910670515.5A Active CN110413861B (en) 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler

Country Status (1)

Country Link
CN (1) CN110413861B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN113065051A (en) * 2021-04-02 2021-07-02 西南石油大学 Visual agricultural big data analysis interactive system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030052918A1 (en) * 2001-09-20 2003-03-20 International Business Machines Corporation Method and apparatus for allowing one bookmark to replace another
CN106202467A (en) * 2016-07-18 2016-12-07 浪潮集团有限公司 Peer-to-peer network-oriented web crawler method capable of defining search key points
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN109561163A (en) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 The generation method and device of uniform resource locator rewriting rule

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030052918A1 (en) * 2001-09-20 2003-03-20 International Business Machines Corporation Method and apparatus for allowing one bookmark to replace another
CN106202467A (en) * 2016-07-18 2016-12-07 浪潮集团有限公司 Peer-to-peer network-oriented web crawler method capable of defining search key points
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN109561163A (en) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 The generation method and device of uniform resource locator rewriting rule
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王琦等: ""基于DOM的网页主题信息自动提取"", 《计算机研究与发展》 *
马艳红等: ""基于链接路径搜索的URL属性集成方法"", 《计算机工程》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN113065051A (en) * 2021-04-02 2021-07-02 西南石油大学 Visual agricultural big data analysis interactive system

Also Published As

Publication number Publication date
CN110413861B (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CA2610208C (en) Learning facts from semi-structured text
CN103544176B (en) Method and apparatus for generating the page structure template corresponding to multiple pages
CN101534306B (en) Detecting method and a device for fishing website
CN104766014B (en) Method and system for detecting malicious website
CN110399546A (en) Link De-weight method, device, equipment and storage medium based on web crawlers
US8533225B2 (en) Representing and processing inter-slot constraints on component selection for dynamic ads
CN102436564A (en) Method and device for identifying falsified webpage
CN106294535B (en) The recognition methods of website and device
CN105718559B (en) Search forms pages and the method and apparatus of target pages transforming relationship
CN106708841B (en) The polymerization and device of website visitation path
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN106909663A (en) Based on tagging user Brang Preference behavior prediction method and its device
CN110413861A (en) Link extracting method, device, equipment and storage medium based on web crawlers
CN105808738A (en) Duplication elimination method based on search results of metasearch engine
CN107885820A (en) Breadth traversal orientation grasping means based on crawler system
CN107491465A (en) For searching for the method and apparatus and data handling system of content
CN104281629B (en) The method, apparatus and client device of picture are extracted from webpage
CN105117434A (en) Webpage classification method and webpage classification system
CN100456296C (en) Method for sequencing multi-medium file search engine
CN103631793B (en) A kind of method, apparatus and equipment for being ranked up to search result
CN103631944B (en) A kind of content-based similar webpage splitting method
CN104881453B (en) A kind of method and apparatus identifying type of webpage
JP5462591B2 (en) Specific content determination device, specific content determination method, specific content determination program, and related content insertion device
CN104156458B (en) The extracting method and device of a kind of information
CN109656954A (en) Trade mark inquiry method, apparatus and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20191105

Assignee: Xiangyang Goode Cultural Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980041350

Denomination of invention: Link extraction methods, devices, devices, and storage media based on web crawlers

Granted publication date: 20211022

License type: Common License

Record date: 20230908

Application publication date: 20191105

Assignee: Hubei Fengyun Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980041308

Denomination of invention: Link extraction methods, devices, devices, and storage media based on web crawlers

Granted publication date: 20211022

License type: Common License

Record date: 20230908