CN110399546A - Link De-weight method, device, equipment and storage medium based on web crawlers - Google Patents
Link De-weight method, device, equipment and storage medium based on web crawlers Download PDFInfo
- Publication number
- CN110399546A CN110399546A CN201910670803.0A CN201910670803A CN110399546A CN 110399546 A CN110399546 A CN 110399546A CN 201910670803 A CN201910670803 A CN 201910670803A CN 110399546 A CN110399546 A CN 110399546A
- Authority
- CN
- China
- Prior art keywords
- url
- link
- url link
- crawled
- queue
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention relates to Internet technical fields, disclose a kind of link De-weight method, device, equipment and storage medium based on web crawlers.This method comprises: extracting the first uniform resource position mark URL link of platform to be visited from data grabber request in the data grabber request for receiving agricultural product to be analyzed;According to the first URL link, access request is sent to platform to be visited;Data information after receiving the response that platform to be visited is made according to access request, in the corresponding page of the first URL link of crawl;Data information is parsed, obtains the second URL link embedded in the page, and the second URL link is added to URL queue to be crawled;Using the counting bloom filter of chain feature, and the second URL link crawled in URL queue is treated in conjunction with multiple Hash and carries out joint duplicate removal.The present invention improves the performance of web crawlers by the optimization to link duplicate removal mode, to guarantee information needed for web crawlers can quickly obtain people, promotes user experience.
Description
Technical field
The present invention relates to Internet technical field more particularly to a kind of link De-weight method based on web crawlers, device,
Equipment and storage medium.
Background technique
The case where web crawlers inevitably encounters the repeated downloads of webpage when carrying out web page crawl, in order to prevent
Decline because web crawlers repeats efficiency caused by crawling, wastes server resource, it is therefore desirable to uniform resource locator
(Uniform Resource Locator, URL) is filtered duplicate removal.Link duplicate removal mode common at present has: based on the 5th
Duplicate removal, depositing based on hash algorithm are compressed in link for message digest algorithm (message-digest algorithm 5, MD5)
It stores up the modes such as duplicate removal, link duplicate removal based on Bloom filter and duplicate removal is carried out to link.
Although the link compression duplicate removal mode based on MD5 solves uniform resource locator (Uniform Resource
Locator, URL) occupy very big memory space the problem of.But as URL is more and more, memory headroom occupancy can also be got over
Come higher, and the low characteristic of the probability that conflicts can reduce the accuracy rate of duplicate checking, therefore can seriously affect the performance of web crawlers.
Although and the storage duplicate removal mode duplicate checking speed based on hash algorithm is fast and accuracy rate is higher, needs to design one
Good hash function, and need to safeguard Hash table.In addition, consuming memory can be excessively high as the scale of crawl webpage increases, because
This can also seriously affect the performance of web crawlers.
And the link duplicate removal mode based on Bloom filter the problem of although can solve space complexity, but have certain
Erroneous judgement, and existing element cannot be deleted.That is, element is more, rate of false alarm can be bigger, therefore can also seriously affect network
The performance of crawler.
Therefore, the link duplicate removal mode based on web crawlers that it is urgent to provide a kind of, to promote the performance of web crawlers, so that
Information needed for web crawlers can quickly obtain people, and then promote user experience.
Above content is only used to facilitate the understanding of the technical scheme, and is not represented and is recognized that above content is existing skill
Art.
Summary of the invention
It the link De-weight method that the main purpose of the present invention is to provide a kind of based on web crawlers, device, equipment and deposits
Storage media, it is intended to the performance of web crawlers be improved by the optimization to link duplicate removal mode, to guarantee that web crawlers can
Information needed for quickly obtaining people promotes user experience.
To achieve the above object, the present invention provides a kind of link De-weight method based on web crawlers, the method packet
Include following steps:
In the data grabber request for receiving agricultural product to be analyzed, extracted from data grabber request to be visited flat
First uniform resource position mark URL of platform links;
According to first URL link, access request is sent to the platform to be visited;
After receiving the response that the platform to be visited is made according to the access request, the first URL chain is grabbed
Connect the data information in the corresponding page;
The data information is parsed, obtains the second URL link embedded in the page, and by described second
URL link is added to URL queue to be crawled;
Using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the institute in the URL queue to be crawled
It states the second URL link and carries out joint duplicate removal.
Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled
Before the step of second URL link in queue carries out joint duplicate removal, the method also includes:
The URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, is mentioned
Take protocol type part, path sections and the inquiry part of current second URL link;
According to the protocol type part, the path sections and the inquiry part, the current 2nd URL chain is obtained
Connect corresponding global feature URL link;
The corresponding relationship between current second URL link and the global feature URL link is established, and will be described right
Update should be related into the URL queue to be crawled.
Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled
The step of second URL link in queue carries out joint duplicate removal, comprising:
The URL queue to be crawled is traversed, the corresponding global feature of current second URL link traversed is obtained
URL link;
Whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtains institute
State the corresponding duplicate checking mark of global feature URL link;
According to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple characteristic fragments;
According to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL chain
Tab segments, the N are the integer more than or equal to 1;
Multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtains that current second URL link is corresponding to be looked into
Weight result;
According to the duplicate checking as a result, being retained or being abandoned behaviour to the second URL link in the URL queue to be crawled
Make.
Preferably, described according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, N is obtained
After the step of a recombination URL link segment, the method also includes:
Based on MD5 algorithm, obtained N number of recombination URL link segment is compressed respectively, obtains N number of recombination URL link
The corresponding character string ciphertext of segment;
The character string ciphertext is replaced into the content in corresponding recombination URL link segment.
Preferably, described that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain
The step of connecing corresponding duplicate checking result, comprising:
The corresponding character string ciphertext of N number of recombination URL link segment is extracted, chooses any one from N number of character string ciphertext
Character string ciphertext carries out K Hash processing, obtains K cryptographic Hash, and the K is the integer more than or equal to 2;
It is breathed out using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each reference
It is uncommon to be worth corresponding spatially-variable counter setting initial count value;
K Hash processing is carried out to remaining N-1 character string ciphertext respectively, it is corresponding to obtain each remaining character string ciphertext
K cryptographic Hash;
By the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and with times
Reference cryptographic Hash of anticipating is adjacent;
Head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute's rheme
The cryptographic Hash of vector space is inserted into a preset characters;
Each is counted with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to the preset characters
Number determines the corresponding duplicate checking result of current second URL link.
Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled
After the step of second URL link in queue carries out joint duplicate removal, the method also includes:
Based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is compressed,
Obtain the corresponding character string ciphertext of each second URL link;
The character string ciphertext is replaced into the content in corresponding second URL link.
Preferably, the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL to be crawled
After the step of second URL link in queue carries out joint duplicate removal, the method also includes:
With the presence or absence of the second URL link accessed in the URL queue to be crawled after judging duplicate removal;
If there is the second URL link accessed in the URL queue to be crawled, by the 2nd URL accessed
Link is deleted from the URL queue to be crawled.
In addition, to achieve the above object, the present invention also proposes a kind of link duplicate removal device based on web crawlers, the dress
It sets and includes:
Extraction module, for being requested from the data grabber in the data grabber request for receiving agricultural product to be analyzed
Middle the first uniform resource position mark URL link for extracting platform to be visited;
Sending module, for sending access request to the platform to be visited according to first URL link;
Handling module, for grabbing after receiving the response that the platform to be visited is made according to the access request
Data information in the corresponding page of first URL link;
Parsing module obtains the second URL link embedded in the page for parsing to the data information,
And second URL link is added to URL queue to be crawled;
Deduplication module, for using chain feature counting bloom filter, and in conjunction with multiple Hash to described wait crawl
Second URL link in URL queue carries out joint duplicate removal.
In addition, to achieve the above object, the present invention also proposes that heavy equipment is gone in a kind of link based on web crawlers, described to set
It is standby include: memory, processor and be stored on the memory and can run on the processor based on web crawlers
Link remove master control program, the link based on web crawlers goes master control program to be arranged for carrying out as described above to climb based on network
The step of link De-weight method of worm.
In addition, to achieve the above object, the present invention also proposes a kind of computer readable storage medium, described computer-readable
It is stored with the link based on web crawlers on storage medium and removes master control program, the link based on web crawlers goes master control program to be located
Manage the step of realizing the link De-weight method based on web crawlers as described above when device executes.
Link duplicate removal scheme provided by the invention based on web crawlers, by using the grand filtering of counting cloth of chain feature
Device, and joint duplicate removal is carried out to the second URL link cached in the URL queue to be crawled in conjunction with multiple Hash, as far as possible
The False Rate of counting bloom filter is reduced, the performance of web crawlers is significantly improved, to guarantee that web crawlers can
Information needed for quickly obtaining people promotes user experience.
Detailed description of the invention
Fig. 1 is that heavy equipment is gone in the link based on web crawlers for the hardware running environment that the embodiment of the present invention is related to
Structural schematic diagram;
Fig. 2 is that the present invention is based on the flow diagrams of the link De-weight method first embodiment of web crawlers;
Fig. 3 is that the present invention is based on the flow diagrams of the link De-weight method second embodiment of web crawlers;
Fig. 4 is the structural block diagram of the link duplicate removal device first embodiment the present invention is based on web crawlers.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that described herein, specific examples are only used to explain the present invention, is not intended to limit the present invention.
Referring to Fig.1, Fig. 1 is that the link based on web crawlers for the hardware running environment that the embodiment of the present invention is related to is gone
Heavy equipment structural schematic diagram.
As shown in Figure 1, being somebody's turn to do the link based on web crawlers to go heavy equipment may include: processor 1001, such as centre
It manages device (Central Processing Unit, CPU), communication bus 1002, user interface 1003, network interface 1004, storage
Device 1005.Wherein, communication bus 1002 is for realizing the connection communication between these components.User interface 1003 may include showing
Display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include the wired of standard
Interface, wireless interface.Network interface 1004 optionally may include standard wireline interface and wireless interface (such as Wireless Fidelity
(WIreless-FIdelity, WI-FI) interface).Memory 1005 can be the random access memory (Random of high speed
Access Memory, RAM) memory, be also possible to stable nonvolatile memory (Non-Volatile Memory,
), such as magnetic disk storage NVM.Memory 1005 optionally can also be the storage device independently of aforementioned processor 1001.
The link based on web crawlers is gone it will be understood by those skilled in the art that structure shown in Fig. 1 is not constituted
The restriction of heavy equipment may include perhaps combining certain components or different component cloth than illustrating more or fewer components
It sets.
As shown in Figure 1, as may include operating system, network communication mould in a kind of memory 1005 of storage medium
Master control program is removed in block, Subscriber Interface Module SIM and the link based on web crawlers.
Link based on web crawlers shown in Fig. 1 is gone in heavy equipment, and network interface 1004 is mainly used for taking with network
Business device carries out data communication;User interface 1003 is mainly used for carrying out data interaction with user;The present invention is based on web crawlers
Processor 1001, the memory 1005 that link is gone in heavy equipment can be set to be gone in heavy equipment in the link based on web crawlers,
The link based on web crawlers goes heavy equipment to call what is stored in memory 1005 to climb based on network by processor 1001
Master control program is removed in the link of worm, and executes the link De-weight method provided in an embodiment of the present invention based on web crawlers.
The link De-weight method based on web crawlers that the embodiment of the invention provides a kind of is the present invention referring to Fig. 2, Fig. 2
A kind of flow diagram of the link De-weight method first embodiment based on web crawlers.
In the present embodiment, the link De-weight method based on web crawlers the following steps are included:
Step S10 is extracted from data grabber request in the data grabber request for receiving agricultural product to be analyzed
First uniform resource position mark URL of platform to be visited links.
Specifically, the executing subject of the present embodiment is any deployment or the terminal device for being equipped with network crawler system.
It is noted that in the present embodiment, in order to improve the crawl of the corresponding data of agricultural product to be analyzed as far as possible
Speed, resolution speed etc. operate, described network crawler system preferred distribution formula network crawler system in the present embodiment.
However, it should be understood that the terminal device can be client device in practical applications, it is also possible to take
Business device end equipment, herein with no restrictions.
In addition, above-mentioned described platform to be visited can be the network provider for showing and needing anal yzing agricul products in practical applications
City.
Correspondingly, described uniform resource locator (Uniform Resource Locator, URL) is to access the net
Network address needed for network store.
However, it should be understood that above-mentioned described agricultural product to be analyzed are to various agricultural product common at present
One general designation, agricultural product to be analyzed can be tea product, fruit and vegetable food, cereal product etc. in practical applications, herein no longer
It enumerates, any restrictions is not also done to this.
Step S20 sends access request to the platform to be visited according to first URL link.
Specifically, in practical applications, web crawlers can be using based on transmission control protocol/Internet Protocol
(Transmission Control Protocol/Internet Protocol, ICP/IP protocol) transmits the super texts of data
This transport protocol (HyperText Transfer Protocol, HTTP) is to platform (the substantially clothes of the platform to be visited
Business device) send access request.
It should be understood that being given above only a kind of specific implementation for sending access request to the platform to be visited
Mode does not constitute any restriction to technical solution of the present invention, and in practical applications, those skilled in the art can basis
It needs to be configured, herein with no restrictions.
Step S30 grabs described the after receiving the response that the platform to be visited is made according to the access request
Data information in the corresponding page of one URL link.
It should be understood that in practical applications, if the access request success sent to the platform to be visited, and
After the platform to be visited is proved to be successful the first URL link carried in the access request, and successful response can be made,
And feed back the data information in the corresponding page of first URL link.At this point, web crawlers and can grab described to be visited
The data information of platform feedback being directed in the corresponding page of first URL link.
Step S40 parses the data information, obtains the second URL link embedded in the page, and by institute
It states the second URL link and is added to URL queue to be crawled.
It should be understood that in practical applications, in addition to that can show with described wait divide in the corresponding page of the first URL link
Analyse the identical data information of agricultural product, it is also possible to multiple URL links relevant to the data information can be shown, for the ease of area
Divide referred to herein as the second URL link.
Such as a net including the agricultural product to be analyzed is shown in the corresponding page of the first URL link
Network store homepage mainly shows the four major class agricultural production such as agricultural product A, agricultural product B, agricultural product C and agricultural product D in the homepage
Product information, while each big agricultural products are corresponding with second URL link again, it is main in the corresponding page of the second URL link
Show the small agricultural products that corresponding agricultural product include.
For example, mainly showing agricultural product A-1, agricultural product A- in the corresponding page of corresponding second URL link of agricultural product A
2 and agricultural product A-3;Agricultural product B-1 and agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product B
B-2;Agricultural product C-1, agricultural product C-2, agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product C
C-3 and agricultural product C4;Agricultural product D-1 and agricultural product are mainly shown in the corresponding page of corresponding second URL link of agricultural product D
D-2。
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality
In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.
In addition, in the present embodiment, why the second URL link embedded in the page is added to wait crawl
URL queue is data the second URL link number that is more, thus parsing because web crawlers crawls in practical applications
It measures relatively bulky.And it often crawls, parse second URL link and can consume many times, thus a large amount of second URL link
It tends not to access in the short time, therefore needs for the second URL link got every time to be added in URL queue to be crawled.
" second " in addition, " first " in above-mentioned described " the first URL link ", and in " the second URL link " is only
It is only for distinguishing the URL link embedded in the corresponding URL link of the platform to be visited page corresponding with the URL link, not
URL link itself is caused to limit.In practical applications, any one " second URL link " is relative in its corresponding page
Embedded URL link can be regarded as one " the first URL link ".
Step S50, using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the URL team to be crawled
Second URL link in column carries out joint duplicate removal.
Specifically, the above-mentioned described counting bloom filter using chain feature, and in conjunction with multiple Hash to described
The joint duplicate removal that second URL link in URL queue to be crawled carries out, is broadly divided into corresponding to the URL link whole
Body characteristics URL link duplicate removal and to URL link segment duplicate removal.
And URL link segment is obtained according to global feature URL link, thus in order to guarantee above-mentioned joint duplicate removal
Operation can be gone on smoothly, and need first to determine the corresponding relationship between the second URL link and global feature URL link.
In order to make it easy to understand, the present embodiment provide it is corresponding between a kind of the second URL link of determination and global feature URL link
The specific implementation of relationship, approximately as:
(1) the URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed,
Extract protocol type part, path sections and the inquiry part of current second URL link.
Specifically, since URL link in practical applications is the resource on unique identification network.Also, one
As for, a URL link would generally include following five component parts: protocol type part (usually use Protocol table
Show), server address part (usual user Host indicate), port numbers part (usually being indicated with Port), path sections (usually
Indicated with Path) and inquiry part (usually being indicated with Fragment).
Wherein, protocol type part, path sections and these three parts of inquiry part can usually embody a URL chain
The feature connect.
Thus, the present embodiment is by traversing the URL queue to be crawled, and to current 2nd URL traversed
Link carries out signature analysis, and then extracts the protocol type part of current second URL link (below subsequent explanation
User p1Indicate), path sections are (for the ease of user p below subsequent explanation2Indicate) and inquire that part (continues for the ease of after
Bright following user p3It indicates).
(2) according to the protocol type part, the path sections and the inquiry part, described current second is obtained
The corresponding global feature URL link of URL link.
Specifically, due to p1、p2And p3This three parts can embody whole features of current second URL link, thus
By to p1、p2And p3The corresponding global feature URL link of current second URL link can be obtained by being combined, and be used below
p1p2p3Indicate the corresponding global feature URL link of each second URL link.
(3) corresponding relationship between current second URL link and the global feature URL link is established, and by institute
Corresponding relationship is stated to update into the URL queue to be crawled.
Specifically, current second URL link and the global feature URL chain why are established in the present embodiment
Corresponding relationship between connecing, and it is subsequent to for convenience into the URL queue to be crawled that the corresponding relationship, which updated,
During two URL link duplicate removals, can the corresponding relationship be quickly found out the corresponding global feature URL chain of current second URL link
It connects, and then the corresponding URL link segment of current second URL link is obtained according to whole URL link.
In addition, in practical applications, the corresponding relationship can not also be updated into the URL queue to be crawled, and
It is individually to store.When treating the second URL link crawled in URL queue and carrying out joint duplicate removal, according to current traversed
Two URL links search the corresponding global feature URL link of current second URL link i.e. from the mapping table individually stored
It can.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality
In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.
Further, obtain above-mentioned corresponding relationship and the corresponding global feature URL link of each second URL link it
Afterwards, the above-mentioned described counting bloom filter using chain feature, and in conjunction with multiple Hash in the URL queue to be crawled
Second URL link carry out the operation of joint duplicate removal, specifically can be as described below:
(1) the URL queue to be crawled is traversed, obtains the corresponding entirety of current second URL link traversed
Feature URL link.
Specifically, obtaining the corresponding global feature URL link of current second URL link traversed is according to above-mentioned
Described corresponding relationship obtains.
(2) whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtained
The corresponding duplicate checking mark of the global feature URL link.
Specifically, counting bloom filter employed in the present embodiment and non-existing use when link duplicate removal
The counting grand device of cloth, but the counting bloom filter of the chain feature based on URL link.
That is, the calculating Bloom filter of the present embodiment is climbed when carrying out duplicate removal to link particular by treating
The corresponding global feature URL link of each second URL link in URL queue is taken to carry out feature identification, then basis recognizes
Feature carry out whole duplicate checking, i.e., be to have entered link to each second to carry out Characteristic Contrast, and then realize whole look into duplicate removal
Weight.
Also, identify for convenience it is subsequent recombinated according to characteristic fragment after URL link segment, can also be global feature
URL link distributes corresponding duplicate checking mark.
(3) according to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple feature pieces
Section.
It specifically, with global feature URL link is still p1p2p3For, by being carried out to the global feature URL link
After feature identification, obtained multiple characteristic fragments, which specifically can be, respectively includes protocol type part, path sections and asking portion
The segment divided, i.e., to characteristic fragment p1, characteristic fragment p2With characteristic fragment p3。
(4) according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination
URL link segment.
It should be understood that since a global feature URL link is by protocol type part, path sections and inquiry three
Part composition, thus 1 recombination URL link segment can be at least obtained, therefore N is the integer more than or equal to 1 in the present embodiment.
In addition, total in practical application, the URL link reformulation rule can by those skilled in the art as needed into
URL link segment after row setting, such as regulation recombination must include characteristic fragment p1, or the URL link segment after recombination
It cannot include characteristic fragment p3Deng will not enumerate, any restrictions also do not done to this herein.
Correspondingly, if URL link reformulation rule be recombination after URL link segment must include characteristic fragment p1, then
It only includes p that obtained recombination URL link segment, which generally comprises,1The URL link segment of characteristic fragment only includes p1Characteristic fragment and
p2The URL link segment of characteristic fragment, and only include p1Characteristic fragment and p3The URL link segment of characteristic fragment.
If URL link reformulation rule is that the URL link segment after recombination cannot include characteristic fragment p3, then the weight that obtains
It only includes p that group URL link segment, which generally comprises,1The URL link segment of characteristic fragment and only include p1Characteristic fragment and p2Feature piece
The URL link segment of section.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality
In the application of border, those skilled in the art can be configured according to actual needs, herein with no restrictions.
(5) multiple Hash duplicate checking is carried out to N number of recombination URL link segment, it is corresponding obtains current second URL link
Duplicate checking result.
It is noted that the second URL link being buffered in URL queue to be crawled may due in practical applications
Have largely, thus the URL link segment obtained after recombinating can be more more.Therefore, in the present embodiment, in order to reduce as far as possible pair
The second URL link cached in URL queue to be crawled is to the occupancy of memory space, according to preset URL link reformulation rule,
The multiple characteristic fragment is recombinated, after obtaining N number of recombination URL link segment, can first be based on MD5 algorithm, to
To N number of recombination URL link segment compressed respectively, and then it is close to obtain the corresponding character string of N number of recombination URL link segment
The character string ciphertext is finally replaced the content in corresponding recombination URL link segment by text.
It should be understood that being given above only a kind of specific compress mode, not to technical solution of the present invention
Any restriction is constituted, in practical applications, those skilled in the art can choose suitable compression method according to actual needs,
Herein with no restrictions.
Correspondingly, above-mentioned that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain
The operation of corresponding duplicate checking result is connect, specifically:
(5-1) extracts the corresponding character string ciphertext of N number of recombination URL link segment, chooses from N number of character string ciphertext any
One character string ciphertext carries out K Hash processing, obtains K cryptographic Hash.
It should be understood that being tapped into due to the link duplicate removal scheme provided in this embodiment based on web crawlers to chain
What is specifically combined when row joint duplicate removal is multiple Hash, i.e., at least needs to carry out 2 Hash processing to a character string ciphertext, therefore
Above-mentioned described K is the integer more than or equal to 2.
(5-2) joins using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each
Examine the corresponding spatially-variable counter setting initial count value of cryptographic Hash.
Specifically, each in the present embodiment with reference to the initial meter shown on the corresponding spatially-variable counter of cryptographic Hash
Numerical value is indicated with " 0 ".
(5-3) carries out K Hash processing to remaining N-1 character string ciphertext respectively, and it is close to obtain each remaining character string
The corresponding K cryptographic Hash of text.
(5-4) by the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and
It is adjacent with reference to cryptographic Hash with any one.
Specifically, it is referred to actually with that for the ease of determining newly to hash to the cryptographic Hash in institute's bit vector space
Cryptographic Hash is adjacent, can preset a determining standard, such as two neighboring with reference to being inserted into new Hash between cryptographic Hash
It, can be using the nearest reference cryptographic Hash of cryptographic Hash that selected distance is newly inserted into as adjacent reference cryptographic Hash when value.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality
In the application of border, those skilled in the art can be configured according to actual needs, herein with no restrictions.
(5-5) uses head to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute
The cryptographic Hash in bit vector space is inserted into a preset characters.
Specifically, the preset characters select " 1 " to indicate in the present embodiment.
Such as cryptographic Hash is referred to for one, the initial count value shown on corresponding spatially-variable counter is
"0".When there is a new cryptographic Hash hash to position adjacent thereto, it is necessary to insert method being previously inserted into " 0 " using head
One preset characters " 1 ", the count value shown on spatially-variable counter at this time become " 10 ".
Correspondingly, position has been thought with reference to cryptographic Hash to this if there are two new cryptographic Hash hash, needed using head
The method of inserting is previously inserted into two preset characters " 1 " in " 0 ", and the count value shown on spatially-variable counter at this time becomes " 110 ".
(5-6) counts each with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to described default
The number of character determines the corresponding duplicate checking result of current second URL link.
Specifically, determining duplicate checking result can be with are as follows:
If the number of the preset characters " 1 " before initial count value " 0 " is greater than 1, it is determined that the recombination URL segment weight
It is multiple, it needs to abandon;
Otherwise, it determines the recombination URL segment does not repeat, can retain.
(6) according to the duplicate checking as a result, the second URL link in the URL queue to be crawled is retained or abandoned
Operation.
It should be understood that only a kind of specific implementation for combining duplicate removal is given above, to technology of the invention
Scheme does not constitute any restriction, and in practical applications, those skilled in the art, which can according to need, to be reasonably adjusted, herein not
It is limited.
In addition, in practical applications, in order to further reduce the occupancy to memory space, in the meter using chain feature
Number Bloom filter, and joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash
Later, it is also based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is pressed
Contracting, and then obtain the corresponding character string ciphertext of each second URL link;Finally the character string ciphertext is replaced corresponding
Content in second URL link reduces empty to storage to compress the second URL link in URL queue to be crawled as far as possible
Between occupancy.
By foregoing description it is not difficult to find out that, the link De-weight method provided in this embodiment based on web crawlers, by adopting
With the counting bloom filter of chain feature, and in conjunction with multiple Hash to the 2nd URL chain cached in the URL queue to be crawled
The whole joint duplicate removal with part of row is tapped into be effectively improved to reduce the False Rate of counting bloom filter as far as possible
The performance of web crawlers, information needed for enabling web crawlers quickly to obtain people improve user as far as possible
Experience.
In addition, during duplicate removal, by being based on compression algorithm, if MD5 algorithm compresses URL link, thus to the greatest extent
The possible occupancy reduced to memory space.
With reference to Fig. 3, Fig. 3 is a kind of process signal of link De-weight method second embodiment based on web crawlers of the present invention
Figure.
Based on above-mentioned first embodiment, the present embodiment based on the link De-weight method of web crawlers the step S50 it
Afterwards, further includes:
Step S60, with the presence or absence of the second URL link accessed in the URL queue to be crawled after judging duplicate removal.
Specifically, if by judgement, there is second accessed in the URL queue to be crawled after determining duplicate removal
URL link, i.e. web crawlers have accessed the corresponding page of the second URL link according to second URL link, and grab
Data information in the page causes to repeat to grab identical number in order to avoid web crawlers accesses second URL link again
According to wasting the resource of web crawlers, need to be implemented the operation of step S70;Otherwise, step S60 is continued to execute.
Step S70 deletes second URL link accessed from the URL queue to be crawled.
Specifically, in practical applications, it can detect and just held in the case that second URL link is accessed again
Delete operation of row can also first be marked the second URL link being accessed, accessed second then marked again
When URL link reaches predetermined quantity or predetermined erasing time, currently labeled all second URL links are deleted together.
It should be understood that the above is only for example, any restriction is not constituted to technical solution of the present invention, in reality
In the application of border, those skilled in the art, which can according to need, to be configured, herein with no restrictions.
By foregoing description it is not difficult to find out that, the link De-weight method provided in this embodiment based on web crawlers, by fixed
When or real-time detection described in URL queue to be crawled the second URL link access situation, and detecting the URL to be crawled
When there is the second URL link being accessed in queue, the second accessed URL link is deleted from URL queue to be crawled
It removes, thereby may be ensured that the second URL link cached in the URL queue to be crawled is the second not visited URL link,
It avoids web crawlers and crawling for identical data is repeated according to same second URL link, further improve network and climb
The performance of worm.
In addition, the embodiment of the present invention also proposes a kind of computer readable storage medium, the computer readable storage medium
On be stored with the link based on web crawlers and remove master control program, the link based on web crawlers goes master control program to be executed by processor
The step of Shi Shixian as described above link De-weight method based on web crawlers.
It is the structural block diagram of the link duplicate removal device first embodiment the present invention is based on web crawlers referring to Fig. 4, Fig. 4.
As shown in figure 4, the link duplicate removal device based on web crawlers that the embodiment of the present invention proposes includes: extraction module
4001, sending module 4002, handling module 4003, parsing module 4004 and deduplication module 4005.
Wherein, extraction module 4001, for receive agricultural product to be analyzed data grabber request when, from the data
The first uniform resource position mark URL link of platform to be visited is extracted in crawl request;Sending module 4002, for according to
First URL link sends access request to the platform to be visited;Handling module 4003, for receive it is described to be visited
After the response that platform is made according to the access request, the data information in the corresponding page of first URL link is grabbed;Solution
Module 4004 is analysed, for parsing to the data information, obtains the second URL link embedded in the page, and by institute
It states the second URL link and is added to URL queue to be crawled;Deduplication module 4005, for the grand filtering of counting cloth using chain feature
Device, and joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash.
It should be noted that each module involved in the present embodiment is logic module, and in practical applications, one
Logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physical units
Combination realize.In addition, in order to protrude innovative part of the invention, it will not be proposed by the invention with solution in the present embodiment
The technical issues of the less close unit of relationship introduce, but this does not indicate that there is no other units in present embodiment.
In addition, it is noted that deduplication module 4005 is in the grand filtering of counting cloth using chain feature in the present embodiment
Device, and when carrying out joint duplicate removal to second URL link wait crawl in URL queue in conjunction with multiple Hash, specific point
For to the corresponding global feature URL link duplicate removal of the URL link and to URL link segment duplicate removal.
And URL link segment is obtained according to global feature URL link, thus in order to guarantee 4005 energy of deduplication module
It is enough smoothly to execute aforesaid operations, it needs first to determine the corresponding relationship between the second URL link and global feature URL link.
It, substantially can following institute about the mode for determining corresponding relationship between the second URL link and global feature URL link
It states:
Firstly, traversing to the URL queue to be crawled, feature point is carried out to current second URL link traversed
Protocol type part, path sections and the inquiry part of current second URL link are extracted in analysis;
Then, according to the protocol type part, the path sections and the inquiry part, described current second is obtained
The corresponding global feature URL link of URL link;
Finally, establishing the corresponding relationship between current second URL link and the global feature URL link, and will
The corresponding relationship is updated into the URL queue to be crawled.
Correspondingly, after obtaining above-mentioned corresponding relationship, the operation that the deduplication module 4005 executes, specifically:
Firstly, traversing to the URL queue to be crawled, it is corresponding whole to obtain current second URL link traversed
Body characteristics URL link;
Then, whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, obtained
To the corresponding duplicate checking mark of the global feature URL link;
Then, according to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple features
Segment;
Then, according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination
URL link segment;
Then, multiple Hash duplicate checking is carried out to N number of recombination URL link segment, it is corresponding obtains current second URL link
Duplicate checking result;
Finally, according to the duplicate checking as a result, the second URL link in the URL queue to be crawled is retained or lost
Abandon operation.
It should be noted that in the present embodiment, above-mentioned described N is the integer more than or equal to 1.
However, it should be understood that being given above only a kind of the second URL link of determination and global feature URL link
Between corresponding relationship, and using the counting bloom filter of chain feature, and in conjunction with multiple Hash to the URL team to be crawled
Second URL link in column carries out the specific implementation of joint duplicate removal, does not constitute and appoints to technical solution of the present invention
What is limited, and in a particular application, those skilled in the art, which can according to need, to be configured, and the present invention is without limitation.
Further, in practical applications, in order to reduce the 2nd URL chain for treating and crawling and caching in URL queue as far as possible
The occupancy to memory space is connect, according to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, is obtained
To after N number of recombination URL link segment, it can be first based on MD5 algorithm, obtained N number of recombination URL link segment is carried out respectively
Compression, and then obtains the corresponding character string ciphertext of N number of recombination URL link segment, finally replaces the character string ciphertext pair
The content in recombination URL link segment answered.
Correspondingly, described that multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtain the current 2nd URL chain
The operation of corresponding duplicate checking result is connect, specifically:
Firstly, extracting the corresponding character string ciphertext of N number of recombination URL link segment, chosen from N number of character string ciphertext any
One character string ciphertext carries out K Hash processing, obtains K cryptographic Hash;
Then, join using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each
Examine the corresponding spatially-variable counter setting initial count value of cryptographic Hash;
Then, K Hash processing is carried out to remaining N-1 character string ciphertext respectively, it is close obtains each remaining character string
The corresponding K cryptographic Hash of text;
Then, by each corresponding K cryptographic Hash random hash of residue character string ciphertext to institute's bit vector space, and
It is adjacent with reference to cryptographic Hash with any one;
Then, head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute
The cryptographic Hash in bit vector space is inserted into a preset characters;
Finally, counting each with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to described default
The number of character determines the corresponding duplicate checking result of current second URL link.
It should be noted that in the present embodiment, above-mentioned described K is the integer more than or equal to 2.
However, it should be understood that being given above the corresponding duplicate checking result of only a kind of current second URL link of acquisition
Specific implementation, any restriction, in a particular application, those skilled in the art are not constituted to technical solution of the present invention
Member, which can according to need, to be configured, and the present invention is without limitation.
In addition, in practical applications, in order to further reduce the occupancy to memory space, to the URL to be crawled
After the second URL link in queue carries out joint duplicate removal, it is also based on MD5 algorithm, to URL to be crawled described in after duplicate removal
The second URL link of each of queue is compressed, and then obtains the corresponding character string ciphertext of each second URL link;
The character string ciphertext is finally replaced into the content in corresponding second URL link, to compress URL to be crawled as far as possible
The second URL link in queue reduces the occupancy to memory space.
By foregoing description it is not difficult to find out that, the link duplicate removal device provided in this embodiment based on web crawlers, by adopting
With the counting bloom filter of chain feature, and in conjunction with multiple Hash to the 2nd URL chain cached in the URL queue to be crawled
The whole joint duplicate removal with part of row is tapped into be effectively improved to reduce the False Rate of counting bloom filter as far as possible
The performance of web crawlers, information needed for enabling web crawlers quickly to obtain people improve user as far as possible
Experience.
In addition, during duplicate removal, by being based on compression algorithm, if MD5 algorithm compresses URL link, thus to the greatest extent
The possible occupancy reduced to memory space.
It should be noted that workflow described above is only schematical, not to protection model of the invention
Enclose composition limit, in practical applications, those skilled in the art can select according to the actual needs part therein or
It all achieves the purpose of the solution of this embodiment, herein with no restrictions.
In addition, the not technical detail of detailed description in the present embodiment, reference can be made to provided by any embodiment of the invention
Link De-weight method based on web crawlers, details are not described herein again.
Based on the first embodiment of the above-mentioned link duplicate removal device based on web crawlers, propose that the present invention is based on web crawlers
Link duplicate removal device second embodiment.
In the present embodiment, the link duplicate removal device based on web crawlers further includes removing module.
Specifically, the removing module, in URL queue to be crawled described in judging after duplicate removal with the presence or absence of having visited
The second URL link asked.
Correspondingly, it if there is the second URL link for having accessed in the URL queue to be crawled, has been accessed described
Second URL link is deleted from the URL queue to be crawled;Otherwise, continue to monitor second in the URL queue to be crawled
URL link, and judge whether there is the second URL link accessed.
It should be noted that each module involved in the present embodiment is logic module, and in practical applications, one
Logic unit can be a physical unit, be also possible to a part of a physical unit, can also be with multiple physical units
Combination realize.In addition, in order to protrude innovative part of the invention, it will not be proposed by the invention with solution in the present embodiment
The technical issues of the less close unit of relationship introduce, but this does not indicate that there is no other units in present embodiment.
However, it should be understood that the above is only for example, not constituting any limit to technical solution of the present invention
Fixed, in a particular application, those skilled in the art, which can according to need, to be configured, and the present invention is without limitation.
By foregoing description it is not difficult to find out that, the link duplicate removal device provided in this embodiment based on web crawlers, by fixed
When or real-time detection described in URL queue to be crawled the second URL link access situation, and detecting the URL to be crawled
When there is the second URL link being accessed in queue, the second accessed URL link is deleted from URL queue to be crawled
It removes, thereby may be ensured that the second URL link cached in the URL queue to be crawled is the second not visited URL link,
It avoids web crawlers and crawling for identical data is repeated according to same second URL link, further improve network and climb
The performance of worm.
It should be noted that workflow described above is only schematical, not to protection model of the invention
Enclose composition limit, in practical applications, those skilled in the art can select according to the actual needs part therein or
It all achieves the purpose of the solution of this embodiment, herein with no restrictions.
In addition, the not technical detail of detailed description in the present embodiment, reference can be made to provided by any embodiment of the invention
Link De-weight method based on web crawlers, details are not described herein again.
In addition, it should be noted that, herein, the terms "include", "comprise" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that process, method, article or system including a series of elements are not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or system
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or system including the element.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art
The part contributed out can be embodied in the form of software products, which is stored in a storage medium
In (such as read-only memory (Read Only Memory, ROM)/RAM, magnetic disk, CD), including some instructions are used so that one
Terminal device (can be mobile phone, computer, server or the network equipment etc.) executes side described in each embodiment of the present invention
Method.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of link De-weight method based on web crawlers, which is characterized in that the described method comprises the following steps:
In the data grabber request for receiving agricultural product to be analyzed, platform to be visited is extracted from data grabber request
The link of first uniform resource position mark URL;
According to first URL link, access request is sent to the platform to be visited;
After receiving the response that the platform to be visited is made according to the access request, first URL link pair is grabbed
The data information in the page answered;
The data information is parsed, obtains the second URL link embedded in the page, and by the 2nd URL chain
It connects and is added to URL queue to be crawled;
Using the counting bloom filter of chain feature, and in conjunction with multiple Hash to described in the URL queue to be crawled
Two URL links carry out joint duplicate removal.
2. the method as described in claim 1, which is characterized in that the counting bloom filter using chain feature, and tie
Before closing the step of multiple Hash carries out joint duplicate removal to second URL link in the URL queue to be crawled, the side
Method further include:
The URL queue to be crawled is traversed, signature analysis is carried out to current second URL link traversed, extracts institute
State protocol type part, path sections and the inquiry part of current second URL link;
According to the protocol type part, the path sections and the inquiry part, current second URL link pair is obtained
The global feature URL link answered;
The corresponding relationship between current second URL link and the global feature URL link is established, and the correspondence is closed
System updates into the URL queue to be crawled.
3. method according to claim 2, which is characterized in that the counting bloom filter using chain feature, and tie
Close the step of multiple Hash carries out joint duplicate removal to second URL link in the URL queue to be crawled, comprising:
The URL queue to be crawled is traversed, the corresponding global feature URL of current second URL link traversed is obtained
Link;
Whole duplicate checking is carried out to the global feature URL link using the counting bloom filter of chain feature, is obtained described whole
The corresponding duplicate checking mark of body characteristics URL link;
According to the duplicate checking mark, feature identification is carried out to the global feature URL link, obtains multiple characteristic fragments;
According to preset URL link reformulation rule, the multiple characteristic fragment is recombinated, obtains N number of recombination URL link piece
Section, the N are the integer more than or equal to 1;
Multiple Hash duplicate checking is carried out to N number of recombination URL link segment, obtains the corresponding duplicate checking knot of current second URL link
Fruit;
According to the duplicate checking as a result, being retained or being abandoned operation to the second URL link in the URL queue to be crawled.
4. method as claimed in claim 3, which is characterized in that it is characterized in that, described recombinated according to preset URL link is advised
Then, after the step of recombinating to the multiple characteristic fragment, obtaining N number of recombination URL link segment, the method is also wrapped
It includes:
Based on MD5 algorithm, obtained N number of recombination URL link segment is compressed respectively, obtains N number of recombination URL link segment
Corresponding character string ciphertext;
The character string ciphertext is replaced into the content in corresponding recombination URL link segment.
5. method as claimed in claim 4, which is characterized in that described to look into the multiple Hash of N number of recombination URL link segment progress
Weight, the step of obtaining current second URL link corresponding duplicate checking result, comprising:
The corresponding character string ciphertext of N number of recombination URL link segment is extracted, chooses any one character from N number of character string ciphertext
Ciphertext of going here and there carries out K Hash processing, obtains K cryptographic Hash, and the K is the integer more than or equal to 2;
Cryptographic Hash is referred to using K cryptographic Hash hash to the bit vector space constructed in advance as with reference to cryptographic Hash, and for each
Initial count value is arranged in corresponding spatially-variable counter;
K Hash processing is carried out to remaining N-1 character string ciphertext respectively, obtains the corresponding K of each residue character string ciphertext
A cryptographic Hash;
By the corresponding K cryptographic Hash random hash of each remaining character string ciphertext to institute's bit vector space, and with it is any one
It is a adjacent with reference to cryptographic Hash;
Head is used to insert method before the adjacent corresponding initial count value of reference cryptographic Hash as each new hash to institute's bit vector
The cryptographic Hash in space is inserted into a preset characters;
Each is counted with reference to the number of preset characters before the corresponding initial value of cryptographic Hash, according to of the preset characters
Number, determines the corresponding duplicate checking result of current second URL link.
6. such as method described in any one of claim 1 to 5, which is characterized in that the grand mistake of counting cloth using chain feature
Filter, and the step of joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash it
Afterwards, the method also includes:
Based on MD5 algorithm, the second URL link of each of URL queue to be crawled described in after duplicate removal is compressed, is obtained
The corresponding character string ciphertext of each second URL link;
The character string ciphertext is replaced into the content in corresponding second URL link.
7. such as method described in any one of claim 1 to 5, which is characterized in that the grand mistake of counting cloth using chain feature
Filter, and the step of joint duplicate removal is carried out to second URL link in the URL queue to be crawled in conjunction with multiple Hash it
Afterwards, the method also includes:
With the presence or absence of the second URL link accessed in the URL queue to be crawled after judging duplicate removal;
If there is the second URL link accessed in the URL queue to be crawled, by second URL link accessed
It is deleted from the URL queue to be crawled.
8. a kind of link duplicate removal device based on web crawlers, which is characterized in that described device includes:
Extraction module, for being mentioned from data grabber request in the data grabber request for receiving agricultural product to be analyzed
The first uniform resource position mark URL of platform to be visited is taken to link;
Sending module, for sending access request to the platform to be visited according to first URL link;
Handling module, for after receiving the response that the platform to be visited is made according to the access request, described in crawl
Data information in the corresponding page of first URL link;
Parsing module obtains the second URL link embedded in the page, and will for parsing to the data information
Second URL link is added to URL queue to be crawled;
Deduplication module, for the counting bloom filter using chain feature, and in conjunction with multiple Hash to the URL team to be crawled
Second URL link in column carries out joint duplicate removal.
9. heavy equipment is gone in a kind of link based on web crawlers, which is characterized in that the equipment include: memory, processor and
Master control program is removed in the link based on web crawlers that being stored in can run on the memory and on the processor, described to be based on
The link of web crawlers goes master control program to be arranged for carrying out the chain based on web crawlers as described in any one of claims 1 to 7
The step of connecing De-weight method.
10. a kind of computer readable storage medium, which is characterized in that be stored on the computer readable storage medium based on net
Master control program is removed in the link of network crawler, realizes when the link based on web crawlers goes master control program to be executed by processor as right is wanted
The step of seeking 1 to 7 described in any item link De-weight methods based on web crawlers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910670803.0A CN110399546B (en) | 2019-07-23 | 2019-07-23 | Link duplicate removal method, device, equipment and storage medium based on web crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910670803.0A CN110399546B (en) | 2019-07-23 | 2019-07-23 | Link duplicate removal method, device, equipment and storage medium based on web crawler |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110399546A true CN110399546A (en) | 2019-11-01 |
CN110399546B CN110399546B (en) | 2022-02-08 |
Family
ID=68325974
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910670803.0A Active CN110399546B (en) | 2019-07-23 | 2019-07-23 | Link duplicate removal method, device, equipment and storage medium based on web crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110399546B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN111930924A (en) * | 2020-07-02 | 2020-11-13 | 上海微亿智造科技有限公司 | Data duplicate checking system and method based on bloom filter |
CN112287201A (en) * | 2020-12-31 | 2021-01-29 | 北京精准沟通传媒科技股份有限公司 | Method, device, medium and electronic equipment for removing duplicate of crawler request |
CN112417240A (en) * | 2020-02-21 | 2021-02-26 | 上海哔哩哔哩科技有限公司 | Website link detection method and device and computer equipment |
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
CN113051498A (en) * | 2021-03-22 | 2021-06-29 | 全球能源互联网研究院有限公司 | URL duplicate removal method and system based on multiple bloom filtering |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106407485A (en) * | 2016-12-20 | 2017-02-15 | 福建六壬网安股份有限公司 | URL de-repetition method and system based on similarity comparison |
CN107798106A (en) * | 2017-10-31 | 2018-03-13 | 广东思域信息科技有限公司 | A kind of URL De-weight methods in distributed reptile system |
CN107885777A (en) * | 2017-10-11 | 2018-04-06 | 北京智慧星光信息技术有限公司 | A kind of control method and system of the crawl web data based on collaborative reptile |
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN108628871A (en) * | 2017-03-16 | 2018-10-09 | 哈尔滨英赛克信息技术有限公司 | A kind of link De-weight method based on chain feature |
CN109561163A (en) * | 2017-09-27 | 2019-04-02 | 阿里巴巴集团控股有限公司 | The generation method and device of uniform resource locator rewriting rule |
CN110008419A (en) * | 2019-03-11 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Removing duplicate webpages method, device and equipment |
-
2019
- 2019-07-23 CN CN201910670803.0A patent/CN110399546B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108121706A (en) * | 2016-11-28 | 2018-06-05 | 央视国际网络无锡有限公司 | A kind of optimization method of distributed reptile |
CN106407485A (en) * | 2016-12-20 | 2017-02-15 | 福建六壬网安股份有限公司 | URL de-repetition method and system based on similarity comparison |
CN108628871A (en) * | 2017-03-16 | 2018-10-09 | 哈尔滨英赛克信息技术有限公司 | A kind of link De-weight method based on chain feature |
CN109561163A (en) * | 2017-09-27 | 2019-04-02 | 阿里巴巴集团控股有限公司 | The generation method and device of uniform resource locator rewriting rule |
CN107885777A (en) * | 2017-10-11 | 2018-04-06 | 北京智慧星光信息技术有限公司 | A kind of control method and system of the crawl web data based on collaborative reptile |
CN107798106A (en) * | 2017-10-31 | 2018-03-13 | 广东思域信息科技有限公司 | A kind of URL De-weight methods in distributed reptile system |
CN110008419A (en) * | 2019-03-11 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Removing duplicate webpages method, device and equipment |
Non-Patent Citations (2)
Title |
---|
WEIPENG ZHOU等: "An Improved Bloom Filter in Distributed Crawler", 《2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION 》 * |
郁晨: "一种高性能网络爬虫系统关键技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948654A (en) * | 2019-11-26 | 2021-06-11 | 上海哔哩哔哩科技有限公司 | Webpage crawling method and device and computer equipment |
CN112417240A (en) * | 2020-02-21 | 2021-02-26 | 上海哔哩哔哩科技有限公司 | Website link detection method and device and computer equipment |
CN111753162A (en) * | 2020-06-29 | 2020-10-09 | 平安国际智慧城市科技股份有限公司 | Data crawling method, device, server and storage medium |
CN111930924A (en) * | 2020-07-02 | 2020-11-13 | 上海微亿智造科技有限公司 | Data duplicate checking system and method based on bloom filter |
CN112287201A (en) * | 2020-12-31 | 2021-01-29 | 北京精准沟通传媒科技股份有限公司 | Method, device, medium and electronic equipment for removing duplicate of crawler request |
CN113051498A (en) * | 2021-03-22 | 2021-06-29 | 全球能源互联网研究院有限公司 | URL duplicate removal method and system based on multiple bloom filtering |
CN113051498B (en) * | 2021-03-22 | 2024-03-12 | 全球能源互联网研究院有限公司 | URL (Uniform resource locator) de-duplication method and system based on multiple bloom filtering |
Also Published As
Publication number | Publication date |
---|---|
CN110399546B (en) | 2022-02-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110399546A (en) | Link De-weight method, device, equipment and storage medium based on web crawlers | |
CN102801697B (en) | Malicious code detection method and system based on plurality of URLs (Uniform Resource Locator) | |
CN105243159B (en) | A kind of distributed network crawler system based on visualization script editing machine | |
CN102857493B (en) | Content filtering method and device | |
CN100562873C (en) | Obtain the system and method for web page element in the webpage | |
CN104714965B (en) | Static resource De-weight method, static resource management method and device | |
CN103118007B (en) | A kind of acquisition methods of user access activity and system | |
CN106708841B (en) | The polymerization and device of website visitation path | |
CN106549974A (en) | Prediction the social network account whether equipment of malice, method and system | |
CN102436564A (en) | Method and device for identifying falsified webpage | |
CN107343031A (en) | A kind of method, apparatus for automatically updating file, electronic equipment and storage medium | |
CN107896219A (en) | A kind of detection method, system and the relevant apparatus of website fragility | |
CN103838728B (en) | The processing method and browser of info web | |
CN107239701A (en) | Recognize the method and device of malicious websites | |
CN107577423A (en) | A kind of method and system for optimizing memory space | |
CN109359263A (en) | A kind of user behavior characteristics extracting method and system | |
CN104123311B (en) | A kind of data traffic reminding method and device | |
CN103246675B (en) | A kind of method and apparatus for being used to capture website data | |
JP5364012B2 (en) | Data extraction apparatus, data extraction method, and data extraction program | |
CN104424188B (en) | The system and method that the web data of acquisition is updated | |
KR101481910B1 (en) | Apparatus and method for monitoring suspicious information in web page | |
CN110413861A (en) | Link extracting method, device, equipment and storage medium based on web crawlers | |
CN104219271B (en) | Based on the asynchronous multiserver synchronous method for downloading the page of multithreading | |
CN110020297A (en) | A kind of loading method of web page contents, apparatus and system | |
CN104881453B (en) | A kind of method and apparatus identifying type of webpage |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |