Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a web crawler-based link extraction device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the web crawler-based link extracting apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the architecture shown in FIG. 1 does not constitute a limitation of web crawler-based link extraction devices, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a web crawler-based link extraction program.
In the web crawler-based link extraction apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a web server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the web crawler-based link extraction device of the present invention may be disposed in the web crawler-based link extraction device, and the web crawler-based link extraction device calls the web crawler-based link extraction program stored in the memory 1005 through the processor 1001 and executes the web crawler-based link extraction method provided by the embodiment of the present invention.
An embodiment of the present invention provides a link extraction method based on a web crawler, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of a link extraction method based on a web crawler according to the present invention.
In this embodiment, the link extraction method based on web crawlers includes the following steps:
step S10, when a data grabbing request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be visited and agricultural product theme information related to the agricultural product to be analyzed from the data grabbing request.
Specifically, the execution main body of the embodiment is a terminal device arbitrarily deployed or installed with a web crawler system.
It should be noted that, in this embodiment, in order to improve operations such as a capturing speed and an analyzing speed of data corresponding to an agricultural product to be analyzed as much as possible, the web crawler system described in this embodiment is preferably a distributed web crawler system.
In addition, it should be understood that, in practical applications, the terminal device may be a client device or a server device, and is not limited herein.
In addition, the platform to be accessed can be a network mall displaying agricultural products to be analyzed in practical application.
Accordingly, the Uniform Resource Locator (URL) is a network address required for accessing the network mall.
In addition, it should be understood that the agricultural products to be analyzed are only a general term for various common agricultural products at present, and in practical applications, the agricultural products to be analyzed may be tea products, fruit and vegetable products, food products, and the like, which are not listed here, and no limitation is made thereto.
For ease of understanding, this example uses tea product as the agricultural product to be analyzed.
Correspondingly, the topic information of the agricultural product is main information of the tea product, and in practical application, the main information of the tea product may specifically include characteristic information related to the tea product to be analyzed, for example, it is limited that the kind of the tea product to be analyzed is green tea, the season of the tea product is before the Qing Ming, and the price of the tea product is 500/kg to 1000/kg, and so on, which is not listed one by one, and is not limited at all.
And step S20, sending an access request to the platform to be accessed according to the first URL link.
Specifically, in practical applications, the web crawler may send an access request to the platform to be accessed (substantially, a server of the platform) by using a HyperText Transfer Protocol (HTTP) that transmits data based on a Transmission Control Protocol/Internet Protocol (TCP/IP).
It should be understood that the above is only a specific implementation manner of sending the access request to the platform to be accessed, and the technical solution of the present invention is not limited at all, and in practical applications, those skilled in the art may set the implementation manner as needed, and the implementation manner is not limited herein.
Step S30, after receiving a response from the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link.
It should be understood that, in practical applications, if the access request sent to the platform to be accessed is successful, and the platform to be accessed successfully verifies the first URL link carried in the access request, a successful response is made, and the data information in the page corresponding to the first URL link is fed back. At this time, the web crawler may capture the data information in the page corresponding to the first URL link, which is fed back by the platform to be accessed.
And step S40, analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled.
It should be understood that, in practical applications, besides displaying the same data information as the agricultural product to be analyzed, a plurality of URL links related to the data information may be displayed in the page corresponding to the first URL link, which is referred to as a second URL link herein for convenience of distinction.
For example, a web mall homepage including the agricultural product to be analyzed is displayed in a page corresponding to the first URL link, four types of agricultural product information including an agricultural product a, an agricultural product B, an agricultural product C, an agricultural product D and the like are mainly displayed in the homepage, meanwhile, each type of agricultural product corresponds to a second URL link, and a small type of agricultural product included in the corresponding agricultural product is mainly displayed in a page corresponding to the second URL link.
For example, agricultural products A-1, agricultural products A-2 and agricultural products A-3 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural products A; agricultural products B-1 and B-2 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural product B; agricultural products C-1, C-2, C-3 and C4 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural product C; and the agricultural product D-1 and the agricultural product D-2 are mainly displayed in the page corresponding to the second URL link corresponding to the agricultural product D.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.
In addition, in this embodiment, the reason that the second URL link embedded in the page is to be added to the URL queue to be crawled is that in practical application, the number of the second URL links obtained through analysis is relatively large because the data crawled by the web crawler is large. And each crawling and analyzing of a second URL link consumes much time, so that a large number of second URL links cannot be visited in a short time, and the second URL links acquired each time need to be added into a URL queue to be crawled.
In addition, the "first" of the "first URL link" and the "second" of the "second URL link" are only used for distinguishing the URL link corresponding to the platform to be visited from the URL link embedded in the page corresponding to the URL link, and do not limit the URL link itself. In practical applications, any "second URL link" may be regarded as a "first URL link" with respect to the URL link embedded in the corresponding page.
In addition, it is worth mentioning that in practical applications, a page corresponding to a second URL link may include some interference information, such as advertisement (picture, audio, video, etc.) information in various formats, in addition to the related information of the agricultural product to be analyzed. Therefore, in order to simplify the structure of the page corresponding to the second URL link as much as possible and facilitate crawling of the web crawler metal data, when the second URL link is obtained and added to the URL queue to be crawled, denoising processing can be performed on the second URL link.
For ease of understanding, the present embodiment provides a specific denoising method, which is roughly as follows:
(1) and analyzing the data information to obtain a second URL link embedded in the page.
(2) And analyzing the second URL link to obtain a normalized label corresponding to the second URL link.
Specifically, in practical applications, what is normalized here is substantially the label in the page corresponding to the second URL link.
Since current web pages are typically compiled based on HyperText Markup Language (HTML).
In addition, since in practical applications, the noisy links are usually in the presence of some picture tags, hyperlinks defined by the tags, and some URLs specifying the hyperlink targets, it is only necessary to normalize such tags.
(3) And generating an abstract tree corresponding to the second URL link according to the normalized tag.
(4) And matching the node content of the abstract tree with the topic information of the agricultural product based on a DOM tree matching method, removing unmatched node content, and obtaining a second URL link matched with the topic information of the agricultural product.
Specifically, each node of the abstract tree is essentially a normalized tag, since the abstract tree is generated according to the normalized tags. Therefore, when the node content of the abstract tree is matched with the topic information of the agricultural product, the keywords in the node and the keywords in the topic information of the agricultural product are extracted, and then the two keywords are compared to determine whether the node needs to be removed or not. In this way, after the content of each node in the abstract tree is matched with the agricultural product theme, the noise link can be removed, and then a second URL link matched with the agricultural product theme information is obtained.
(5) And adding a second URL link matched with the agricultural product topic information to a URL queue to be crawled.
It should be understood that the embodiment is only a specific denoising method, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art may set the denoising method according to needs, and the present invention is not limited herein.
And step S50, processing the first URL link and a second URL link in the URL queue to be crawled based on an anchor multiple attribute integration mode of path aggregation to obtain multiple attribute theme information in a rich text format corresponding to the second URL link.
In order to facilitate understanding of the above operation of obtaining multiple attribute topic information in the rich text format corresponding to each of the second URL links, a specific implementation manner is given below, which is approximately as follows:
(1) and generating a path access directed graph corresponding to the agricultural product to be analyzed according to the first URL link and a second URL link in the URL queue to be crawled.
Specifically, each vertex of the path access directed graph is a page corresponding to a URL link, which is specifically described by taking fig. 3 as an example.
As shown in fig. 3, the source web page u is substantially a page corresponding to the first URL link, the tea type and the tea price are pages corresponding to two embedded second URL links parsed from the data information of the page, and the destination pages v1, v2 and v3 are second URL links of a next page parsed from the pages corresponding to the two second URL links.
(2) And determining the shortest access path in the path access directed graph based on an anchor multiple attribute integration mode of path aggregation to obtain a shortest access path set.
Specifically, in practical applications, multiple paths may exist in a path access directed graph, and the shortest path among the multiple paths may exist as a loop-free path (not closed) or a loop-around path (closed).
For the sake of easy differentiation, in practical applications, different sets may be used to indicate whether the shortest path is looped or non-looped.
For ease of understanding, a specific representation of the shortest set of loop-free paths is given below.
For example, M shortest loop-free paths from the source web page to the target page may be represented by the following set:
it should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.
(3) Determining an anchor text corresponding to each shortest access path in the shortest access path set to obtain an access path anchor text set corresponding to the shortest access path set, and allocating a weight to each element in the access path anchor text set.
When a weight is assigned to each element in the access path anchor text set, the following specific steps may be performed:
first, contract P
mIs the shortest loop-free path, and the value range of m meets the following requirements:
then, contract w (P)
m)≤w(P
m+1) And the value range of m satisfies:
next, contract w (P)
m) W (P) is less than or equal to, and the value range of P meets the following conditions:
finally, contract P
mAt P
m+1It was previously determined that the value range of m satisfies:
wherein, W is weight, and M is a positive integer.
The start specifies a weight of 1 for each element (i.e., each edge), so if path P passes m directed edges, then w (P) is m.
It should be understood that the above is only a specific implementation manner for assigning weights, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the implementation manner as needed, and the implementation manner is not limited herein.
(4) And normalizing the weight corresponding to each element in the access path anchor text set according to a preset weight normalization formula.
Specifically, the weight normalization formula adopted in this embodiment is as follows:
wherein the content of the first and second substances,
is an element e in
The weight in (1).
Still taking the access path directed graph shown in fig. 3 as an example, the weights of the elements in the anchor text from the source webpage u to the target page v1, the target page v2 and the target page v3 in the original access path can be modified by the weight standard formula.
(5) And sorting the normalized weights in a descending order to obtain multiple attribute theme information in the rich text format corresponding to the second URL link.
Step S60, comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product, and extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
To facilitate understanding the operation of web crawler-based link extraction, a specific implementation is given below, roughly as follows:
(1) extracting multiple attribute theme characteristic words from the multiple attribute theme information, and performing hash processing on the multiple attribute theme characteristic words to obtain a first hash value, wherein the multiple attribute theme characteristic words are one element in an access path anchor text set corresponding to the multiple attribute theme information.
(2) And acquiring the weight corresponding to the multiple attribute topic characteristic word from the access path anchor text set, and quantizing the first hash value into a first vector by combining the weight.
(3) And extracting the topic characteristic words of the agricultural products from the topic information of the agricultural products, and carrying out hash processing on the topic characteristic words of the agricultural products to obtain a second hash value.
(4) And quantizing the second hash value into a second vector according to the preset weight of the agricultural product topic feature word.
(5) And comparing the first vector with the second vector, and extracting a second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
Specifically, in the embodiment, the comparison process of the multiple attribute theme information and the agricultural product theme information is converted into the comparison between two vectors, so that the comparison result can be obtained more vividly and visually, the extraction of the link is facilitated, and the accuracy is ensured.
It is not difficult to find out through the above description that the link extraction method based on web crawlers provided in this embodiment processes the first URL link of the platform to be accessed and the second URL link in the URL queue to be crawled through an anchor multiple attribute integration mode based on path aggregation to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, compares the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of agricultural products, extracts the second URL link corresponding to the multiple attribute topic information whose similarity to the topic information of agricultural products meets a preset threshold, effectively ensures the accuracy of extracting a specific URL link, and further can avoid resource waste caused by crawling of unrelated links by the web crawlers, thereby significantly improving the performance of the web crawlers, and enabling the web crawlers to quickly and accurately acquire information required by people, and the user experience is improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a web crawler-based link extraction method according to a second embodiment of the present invention.
Based on the first embodiment, before the step S50, the web crawler-based link extraction method of this embodiment further includes:
and step S00, performing joint duplicate removal on the second URL link in the URL queue to be crawled by adopting a counting bloom filter of link characteristics and combining multiple hashes.
Specifically, the joint deduplication of the second URL link in the URL queue to be crawled by using the counting bloom filter with link characteristics and combining multiple hashes is mainly divided into deduplication of the URL link with overall characteristics corresponding to the URL link and deduplication of a URL link fragment.
Since the URL link segment is obtained according to the global feature URL link, in order to ensure that the joint deduplication operation can be performed smoothly, the corresponding relationship between the second URL link and the global feature URL link needs to be determined first.
For ease of understanding, the present embodiment provides a specific implementation manner for determining the correspondence between the second URL link and the whole feature URL link, which is roughly as follows:
(1) and traversing the URL queue to be crawled, performing characteristic analysis on the traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link.
Specifically, since in practical applications the URL links are used to uniquely identify resources on the network. Also, in general, a URL link will typically contain the following five components: a Protocol type part (usually denoted by Protocol), a server address part (usually denoted by user Host), a Port number part (usually denoted by Port), a Path part (usually denoted by Path), and a query part (usually denoted by Fragment).
Wherein, the three parts of the protocol type part, the path part and the inquiry part can usually embody the characteristics of a URL link.
Therefore, in this embodiment, the URL queue to be crawled is traversed, and the traversed current second URL link is subjected to feature analysis, so as to extract the protocol type part of the current second URL link (for convenience of the following description, the following user p is referred to as a "user" p)1Presentation), path section (for convenience of the following description user p)2Presentation) and an inquiry section (for convenience of the following description user p3Representation).
(2) And obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part.
In particular, since p is1、p2And p3These three parts can embody the full characteristics of the current second URL link, thus by p1、p2And p3The combination is performed to obtain the global characteristic URL link corresponding to the current second URL link, which is hereinafter referred to as p1p2p3Representing the global characteristic URL link to which each second URL link corresponds.
(3) And establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.
Specifically, in this embodiment, the correspondence between the current second URL link and the overall characteristic URL link is to be established, and the correspondence is updated to the to-be-crawled URL queue, so that in a subsequent process of deduplication of the second URL link, the correspondence can be used to quickly find the overall characteristic URL link corresponding to the current second URL link, and further, the URL link segment corresponding to the current second URL link is obtained according to the overall URL link.
In addition, in practical application, the corresponding relation may not be updated to the URL queue to be crawled, but may be stored separately. And when the second URL link in the URL queue to be crawled is subjected to joint duplicate removal, searching the integral characteristic URL link corresponding to the current second URL link from the separately stored corresponding relation table according to the traversed current second URL link.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.
Further, after obtaining the correspondence and the overall characteristic URL link corresponding to each second URL link, the above-mentioned counting bloom filter using the link characteristic and performing a joint deduplication operation on the second URL links in the to-be-crawled URL queue by combining multiple hashes may specifically be as follows:
(1) and traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to the traversed current second URL link.
Specifically, the whole characteristic URL link corresponding to the traversed current second URL link is obtained according to the above correspondence.
(2) And performing integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link.
Specifically, the counting bloom filter used in the present embodiment is not a counting bloom filter used in the existing link deduplication, but a counting bloom filter based on the link characteristics of URL links.
That is to say, when the computing bloom filter of this embodiment deduplicates a link, specifically, feature recognition is performed on an overall feature URL link corresponding to each second URL link in a URL queue to be crawled, and then, overall duplicate checking is performed according to the recognized feature, that is, feature comparison is performed on each second entered link during deduplication, so that overall duplicate checking is achieved.
In addition, in order to conveniently identify the URL link segment which is subsequently recombined according to the feature segment, a corresponding duplication checking mark is distributed to the whole feature URL link.
(3) And according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments.
Specifically, the global feature URL link is still taken as p1p2p3For example, after performing the feature recognition on the whole feature URL link, the obtained plurality of feature segments may specifically be segments respectively including a protocol type portion, a path portion and a query portion, that is, a feature segment p1Characteristic fragment p2And a characteristic fragment p3。
(4) And recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments.
It should be understood that since an entire characteristic URL link is composed of three parts, a protocol type part, a path part and a query part, at least 1 recombined URL link segment is obtained, and N is an integer greater than or equal to 1 in this embodiment.
In addition, in practical applications, the URL link restructuring rule may be set by those skilled in the art as required, for example, the URL link segment after restructuring must include the feature segment p1Or the recomposed URL link segment cannot include the feature segment p3And so on, which are not listed here, and do not limit in any way.
Accordingly, if the URL link restructuring rule is that the restructured URL link segment must include the characteristic segment p1The resulting recombined URL link fragment substantially includes p only1URL link segment of feature segment, including only p1Characteristic fragment and p2URL link segments of feature segments, and including only p1Characteristic fragment and p3The URL of the feature fragment links the fragment.
If the URL link recombination rule is that the recombined URL link segment cannot be recombinedIncluding a characteristic segment p3The resulting recombined URL link fragment substantially includes p only1URL link segment of feature segment and including only p1Characteristic fragment and p2The URL of the feature fragment links the fragment.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the technical solution according to actual needs, and the technical solution is not limited herein.
(5) And performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link.
It should be noted that, in practical applications, since there may be a large number of second URL links cached in the URL queue to be crawled, the number of URL link segments obtained after the reorganization is more. Therefore, in this embodiment, in order to reduce the occupation of the second URL link cached in the URL queue to be crawled to the storage space as much as possible, after the plurality of feature segments are recombined according to the preset URL link recombination rule to obtain N recombined URL link segments, the obtained N recombined URL link segments may be respectively compressed based on the MD5 algorithm to further obtain the character string ciphertexts corresponding to the N recombined URL link segments, and finally the character string ciphertexts replace the content in the corresponding recombined URL link segments.
It should be understood that the above is only a specific compression method, and the technical solution of the present invention is not limited in any way, and in practical applications, a person skilled in the art can select a suitable compression method according to actual needs, and is not limited herein.
Correspondingly, the operation of performing multiple hash duplicate checking on the N recombined URL link segments to obtain the duplicate checking result corresponding to the current second URL link specifically includes:
(5-1) extracting the character string ciphertexts corresponding to the N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values.
It should be understood that, since the link deduplication operation provided in this embodiment specifically combines multiple hashes when performing joint deduplication on links, that is, at least 2 hash processes are required on a string ciphertext, K is an integer greater than or equal to 2.
And (5-2) hashing the K hash values to a pre-constructed bit vector space to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value.
Specifically, in the present embodiment, the initial count value displayed on the spatially variable counter corresponding to each reference hash value is represented by "0".
And (5-3) carrying out Hash processing on the remaining N-1 character string ciphertext for K times respectively to obtain K Hash values corresponding to each remaining character string ciphertext.
And (5-4) randomly hashing the K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value.
Specifically, in order to determine whether the hash value newly hashed into the bit vector space is adjacent to the reference hash value, a determination criterion may be preset, for example, when a new hash value is inserted between two adjacent reference hash values, the reference hash value closest to the newly inserted hash value may be selected as the adjacent reference hash value.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the technical solution according to actual needs, and the technical solution is not limited herein.
And (5-5) inserting a preset character for each newly hashed hash value to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method.
Specifically, in this embodiment, the preset character is represented by "1".
For example, for a reference hash value, the initial count value displayed on the corresponding spatially variable counter is "0". When a new hash value is hashed to a position adjacent to the new hash value, a preset character "1" needs to be inserted in front of "0" by using a header insertion method, and the count value displayed on the space variable counter becomes "10".
Accordingly, if there are two new hash values hashed to the desired position of the reference hash value, a two-bit preset character "1" needs to be inserted in front of "0" by using a header insertion method, and the count value displayed on the space-variable counter becomes "110".
And (5-6) counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.
Specifically, the determined duplicate checking result may be:
if the number of the preset characters '1' in front of the initial count value '0' is more than 1, determining that the recombined URL segment is repeated and needs to be discarded;
otherwise, determining that the recombined URL segment is not repeated and can be reserved.
(6) And according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
It should be understood that the above is only a specific implementation manner of joint deduplication, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art may reasonably adjust the implementation manner according to needs, and the implementation manner is not limited herein.
In addition, in practical application, in order to further reduce the occupation of a storage space, after a counting bloom filter of link characteristics is adopted and multiple hashes are combined to perform joint deduplication on the second URL links in the URL queue to be crawled, each second URL link in the URL queue to be crawled after deduplication is performed can be compressed based on an MD5 algorithm, and then a character string ciphertext corresponding to each second URL link is obtained; and finally, replacing the content in the corresponding second URL link with the character string ciphertext, so that the second URL link in the URL queue to be crawled is compressed as much as possible, and the occupation of a storage space is reduced.
As can be seen from the above description, the link extraction method based on the web crawler provided in this embodiment performs deduplication operation on the second URL link in the URL queue to be crawled before performing extraction operation on the second URL link in the URL queue to be crawled, thereby further reducing unnecessary interference in the link extraction process and improving the extraction efficiency of the web crawler.
In addition, this embodiment is through the count bloom filter who adopts the link characteristic to combine multiple hash to the second URL link that waits to crawl buffering in the URL queue carries out whole and partial joint duplicate removal, thereby has reduced count bloom filter's erroneous judgement rate as far as, has effectively improved web crawler's performance, makes web crawler can be fast accurate acquire people required information, has promoted user experience as far as possible.
In addition, in the deduplication process, the URL link is compressed based on a compression algorithm, such as the MD5 algorithm, so that the occupation of the storage space is reduced as much as possible.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a web crawler-based link extraction program is stored on the computer-readable storage medium, and when executed by a processor, the web crawler-based link extraction program implements the steps of the web crawler-based link extraction method as described above.
Referring to fig. 5, fig. 5 is a block diagram illustrating a first embodiment of a web crawler-based link extracting apparatus according to the present invention.
As shown in fig. 5, the link extracting apparatus based on web crawlers according to the embodiment of the present invention includes: an extraction module 5001, a sending module 5002, a fetching module 5003, a parsing module 5004, a processing module 5005, and an extraction module 5006.
The extraction module 5001 is configured to, when a data fetch request of an agricultural product to be analyzed is received, extract a first uniform resource locator URL link of a platform to be accessed and topic information related to the agricultural product to be analyzed from the data fetch request; a sending module 5002, configured to send an access request to the platform to be accessed according to the first URL link; a fetching module 5003, configured to fetch, after receiving a response made by the platform to be accessed according to the access request, data information in a page corresponding to the first URL link; the analyzing module 5004 is configured to analyze the data information to obtain a second URL link embedded in the page, and add the second URL link to a URL queue to be crawled; a processing module 5005, configured to process the first URL link and the second URL link in the URL queue to be crawled based on an anchor multiple attribute integration manner of path aggregation, to obtain multiple attribute topic information in a rich text format corresponding to the second URL link; the extracting module 5006 is configured to compare the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the agricultural product topic information, and extract a second URL link corresponding to the multiple attribute topic information whose similarity to the agricultural product topic information meets a preset threshold.
It should be understood that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but it does not indicate that there is no other unit in the present embodiment.
In addition, in order to facilitate understanding of a specific processing flow of each functional module in an actual application of the web crawler-based link extraction apparatus provided in this embodiment, the following specifically describes processing of the parsing module 5004, the processing module 5005, and the extraction module 5006.
Specifically, the analysis module 5004 performs an operation of analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled, where the implementation flow in a specific application is substantially as follows:
firstly, analyzing the data information to obtain a second URL link embedded in the page;
then, analyzing the second URL link to obtain a normalized label corresponding to the second URL link;
then, generating an abstract tree corresponding to the second URL link according to the normalized tag;
then, based on a DOM tree matching method, matching the node content of the abstract tree with the topic information of the agricultural product, removing unmatched node content, and obtaining a second URL link matched with the topic information of the agricultural product;
and finally, adding a second URL link matched with the agricultural product topic information into a URL queue to be crawled.
It should be understood that the above is only a specific implementation manner for denoising the second URL link in the URL queue to be crawled, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the denoising method according to needs, and the present invention is not limited to this.
In addition, the processing module 5005 performs an anchor multiple attribute integration mode based on path aggregation to process the first URL link and the second URL link in the URL queue to be crawled, so as to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, and the implementation flow in specific applications is substantially as follows:
firstly, generating a path access directed graph corresponding to the agricultural product to be analyzed according to the first URL link and a second URL link in the URL queue to be crawled;
then, based on an anchor multiple attribute integration mode of path aggregation, determining the shortest access path in the path access directed graph to obtain a shortest access path set;
then, determining an anchor text corresponding to each shortest access path in the shortest access path set to obtain an access path anchor text set corresponding to the shortest access path set, and assigning a weight to each element in the access path anchor text set;
secondly, standardizing the weight corresponding to each element in the access path anchor text set according to a preset weight standardization formula;
and finally, sorting the normalized weights in a descending order to obtain multiple attribute theme information in the rich text format corresponding to the second URL link.
It should be understood that, the above only is a specific implementation manner for obtaining the multiple attribute topic information in the rich text format corresponding to each second URL link in the URL queue to be crawled, and the technical solution of the present invention is not limited at all.
In addition, the step of comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the agricultural product topic information, which is executed by the extraction module 5006, extracts a second URL link operation corresponding to the multiple attribute topic information whose similarity to the agricultural product topic information meets a preset threshold, and the implementation flow in a specific application is substantially as follows:
firstly, extracting multiple attribute theme characteristic words from the multiple attribute theme information, and performing hash processing on the multiple attribute theme characteristic words to obtain a first hash value, wherein the multiple attribute theme characteristic words are one element in an access path anchor text set corresponding to the multiple attribute theme information;
then, acquiring weights corresponding to the multiple attribute topic feature words from the access path anchor text set, and quantizing the first hash value into a first vector by combining the weights;
then, extracting an agricultural product subject characteristic word from the agricultural product subject information, and carrying out hash processing on the agricultural product subject characteristic word to obtain a second hash value;
then, quantizing the second hash value into a second vector according to a preset weight for the agricultural product subject characteristic word;
and finally, comparing the first vector with the second vector, and extracting a second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
It should be understood that the above is only a specific implementation manner of extracting a specific link from a URL queue to be crawled, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the method according to needs, and the present invention is not limited to this.
It is not difficult to find out through the above description that the link extraction device based on the web crawler provided in this embodiment processes the first URL link of the platform to be accessed and the second URL link in the URL queue to be crawled through an anchor multiple attribute integration mode based on path aggregation to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, compares the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product, extracts the second URL link corresponding to the multiple attribute topic information whose similarity to the topic information of the agricultural product meets a preset threshold, effectively ensures the accuracy of extracting a specific URL link, and further can avoid resource waste caused by crawling of unrelated links by the web crawler, thereby significantly improving the performance of the web crawler, and enabling the web crawler to quickly and accurately acquire information required by people, and the user experience is improved.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the link deduplication method provided in any embodiment of the present invention, and are not described herein again.
Based on the first embodiment of the web crawler-based link extraction apparatus, a second embodiment of the web crawler-based link extraction apparatus of the present invention is provided.
In this embodiment, the web crawler-based link extraction apparatus further includes a deduplication module.
And the duplication removing module is used for adopting a counting bloom filter of link characteristics and combining multiple Hash to carry out joint duplication removal on the second URL link in the URL queue to be crawled.
It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but it does not indicate that there is no other unit in the present embodiment.
In addition, it is worth mentioning that, in this embodiment, when the deduplication module uses a counting bloom filter of link characteristics and combines multiple hashes to perform joint deduplication on the second URL link in the URL queue to be crawled, the deduplication module specifically includes deduplication on the whole characteristic URL link corresponding to the URL link and deduplication on a URL link fragment.
Since the URL link segment is obtained according to the global feature URL link, in order to ensure that the deduplication module can smoothly perform the above operations, the corresponding relationship between the second URL link and the global feature URL link needs to be determined first.
Regarding the manner of determining the correspondence between the second URL link and the global characteristic URL link, the following may be roughly described:
firstly, traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link;
then, obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part;
and finally, establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.
Accordingly, after obtaining the corresponding relationship, the operation executed by the deduplication module is specifically:
firstly, traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link;
then, carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link;
then, according to the duplication checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments;
secondly, recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments;
then, performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;
and finally, according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
In this embodiment, N is an integer of 1 or more.
In addition, it should be understood that what is given above is only a specific implementation manner of determining a corresponding relationship between a second URL link and an overall characteristic URL link, and using a counting bloom filter of a link characteristic, and combining multiple hashes to perform joint deduplication on the second URL link in the URL queue to be crawled, and the technical scheme of the present invention is not limited at all.
Further, in practical application, in order to reduce the occupation of the second URL link cached in the URL queue to be crawled to the storage space as much as possible, after the plurality of feature segments are recombined according to a preset URL link recombination rule to obtain N recombined URL link segments, the obtained N recombined URL link segments may be respectively compressed based on an MD5 algorithm to obtain character string ciphertexts corresponding to the N recombined URL link segments, and finally the character string replacing ciphertexts are omitted from the content in the corresponding recombined URL link segments.
Correspondingly, the operation of performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link specifically includes:
firstly, extracting character string ciphertexts corresponding to N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values;
then, hashing the K hash values to a pre-constructed bit vector space to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value;
then, carrying out Hash processing on the remaining N-1 character string ciphertext for K times respectively to obtain K Hash values corresponding to each remaining character string ciphertext;
then, randomly hashing K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value;
then, inserting a preset character for each hash value newly hashed to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method;
and finally, counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.
In this embodiment, K is an integer of 2 or more.
In addition, it should be understood that the above is only a specific implementation manner for obtaining the duplicate checking result corresponding to the current second URL link, and the technical solution of the present invention is not limited at all, and in a specific application, a person skilled in the art may set the duplicate checking result as needed, and the present invention is not limited to this.
In addition, in practical application, in order to further reduce the occupation of a storage space, after the second URL links in the URL queue to be crawled are subjected to joint duplicate removal, each second URL link in the URL queue to be crawled after the duplicate removal can be compressed based on an MD5 algorithm, so as to obtain a character string ciphertext corresponding to each second URL link; and finally, replacing the content in the corresponding second URL link with the character string ciphertext, so that the second URL link in the URL queue to be crawled is compressed as much as possible, and the occupation of a storage space is reduced.
It can be easily seen from the above description that the link extracting device based on web crawlers provided by this embodiment performs deduplication operation on the second URL link in the URL queue to be crawled before performing extraction operation on the second URL link in the URL queue to be crawled, thereby further reducing unnecessary interference in the link extraction process and improving the extraction efficiency of the web crawlers.
In addition, this embodiment is through the count bloom filter who adopts the link characteristic to combine multiple hash to the second URL link that waits to crawl buffering in the URL queue carries out whole and partial joint duplicate removal, thereby has reduced count bloom filter's erroneous judgement rate as far as, has effectively improved web crawler's performance, makes web crawler can be fast accurate acquire people required information, has promoted user experience as far as possible.
In addition, in the deduplication process, the URL link is compressed based on a compression algorithm, such as the MD5 algorithm, so that the occupation of the storage space is reduced as much as possible.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the link deduplication method provided in any embodiment of the present invention, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.