CN110413861B - Link extraction method, device, equipment and storage medium based on web crawler - Google Patents

Link extraction method, device, equipment and storage medium based on web crawler Download PDF

Info

Publication number
CN110413861B
CN110413861B CN201910670515.5A CN201910670515A CN110413861B CN 110413861 B CN110413861 B CN 110413861B CN 201910670515 A CN201910670515 A CN 201910670515A CN 110413861 B CN110413861 B CN 110413861B
Authority
CN
China
Prior art keywords
url
link
url link
crawled
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910670515.5A
Other languages
Chinese (zh)
Other versions
CN110413861A (en
Inventor
郑禄
王锦群
雷建云
毛腾跃
刘晶
马尧
刘越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN201910670515.5A priority Critical patent/CN110413861B/en
Publication of CN110413861A publication Critical patent/CN110413861A/en
Application granted granted Critical
Publication of CN110413861B publication Critical patent/CN110413861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of internet, and discloses a link extraction method, device, equipment and storage medium based on web crawlers. According to the method, the first URL link of the platform to be accessed and the second URL link in the URL queue to be crawled are processed through an anchor multiple attribute integration mode based on path aggregation, multiple attribute topic information in a rich text format corresponding to the second URL link is obtained, the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled is compared with the topic information of agricultural products, the second URL link corresponding to the multiple attribute topic information with the similarity meeting a preset threshold value with the topic information of the agricultural products is extracted, the extraction accuracy of the specific URL link is effectively guaranteed, resource waste caused by crawling of irrelevant links of a web crawler can be avoided, the performance of the web crawler is remarkably improved, the web crawler can rapidly and accurately obtain information needed by people, and user experience is improved.

Description

Link extraction method, device, equipment and storage medium based on web crawler
Technical Field
The invention relates to the technical field of internet, in particular to a link extraction method, a device, equipment and a storage medium based on web crawlers.
Background
As web page forms and content tend to be diversified and complicated, web pages are presented with varying amounts of information. Rather than processing all page links, the web crawler selects links relevant to the topic for crawling. Therefore, the web crawler should prejudge whether the links are related to the topic before extracting the page links. The current common theme link extraction method comprises the following steps: rule-based related link extraction, webpage block and Document Object Model (DOM) tree-based related link extraction, and machine learning-based related link extraction.
Although the rule-based related link extraction method can effectively remove the noise link in the link, in practical application, because the writers of each website are different, the webpage link does not conform to the established extraction rule as much as possible, the universality is not high, and higher accuracy cannot be maintained, so that the performance of the web crawler is seriously influenced.
Although the relevant link extraction method based on the webpage blocks and the DOM tree can realize the extraction operation of the relevant links of the theme, the realization of the method needs to analyze the webpage according to the layout characteristics of the website by the DOM tree in a block shape, endow different importance weights for the webpage blocks at different positions, and judge whether the links are relevant or not by combining the title of the page, the theme of the page and the anchor text of the link, so the realization process is relatively complex, the accuracy is influenced by the endowed importance weights, and the performance of the web crawler is seriously influenced.
Although the relevant link extraction method based on machine learning can extract the relevant links of the page more accurately, the collection cost of the training data set in the previous period is too high, and the expansibility of the extraction mode of the relevant links of the page is not good, so that the performance of the web crawler is seriously influenced.
Therefore, it is urgently needed to provide a topic-related link extraction method based on a web crawler to improve the performance of the web crawler, so that the web crawler can quickly and accurately acquire information required by people, and further improve user experience.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a link extraction method, a link extraction device, link extraction equipment and a storage medium based on a web crawler, aiming at improving the performance of the web crawler by optimizing a link extraction mode, so that the web crawler can be ensured to quickly and accurately acquire information required by people, and the user experience is improved.
In order to achieve the above object, the present invention provides a web crawler-based link extraction method, including the following steps:
when a data grabbing request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be accessed and agricultural product subject information related to the agricultural product to be analyzed from the data grabbing request;
according to the first URL link, sending an access request to the platform to be accessed;
after receiving a response made by the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link;
analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled;
processing the first URL link and a second URL link in the URL queue to be crawled based on an anchor multiple attribute integration mode of path aggregation to obtain multiple attribute theme information in a rich text format corresponding to the second URL link;
and respectively comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product, and extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
Preferably, the step of analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled includes:
analyzing the data information to obtain a second URL link embedded in the page;
analyzing the second URL link to obtain a normalized tag corresponding to the second URL link;
generating an abstract tree corresponding to the second URL link according to the normalized tag;
matching the node content of the abstract tree with the topic information of the agricultural product based on a DOM tree matching method, removing unmatched node content, and obtaining a second URL link matched with the topic information of the agricultural product;
and adding a second URL link matched with the agricultural product topic information to a URL queue to be crawled.
Preferably, the step of processing the first URL link and the second URL link in the URL queue to be crawled to obtain multiple attribute topic information in a rich text format corresponding to the second URL link in the anchor multiple attribute integration manner based on path aggregation includes:
generating a path access directed graph corresponding to the agricultural product to be analyzed according to the first URL link and a second URL link in the URL queue to be crawled;
determining the shortest access path in the path access directed graph based on an anchor multiple attribute integration mode of path aggregation to obtain a shortest access path set;
determining an anchor text corresponding to each shortest access path in the shortest access path set to obtain an access path anchor text set corresponding to the shortest access path set, and allocating a weight to each element in the access path anchor text set;
standardizing the weight corresponding to each element in the access path anchor text set according to a preset weight standardization formula;
and sorting the normalized weights in a descending order to obtain multiple attribute theme information in the rich text format corresponding to the second URL link.
Preferably, the step of comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product respectively, and the step of extracting the second URL link corresponding to the multiple attribute topic information whose similarity to the topic information of the agricultural product meets a preset threshold include:
extracting multiple attribute theme characteristic words from the multiple attribute theme information, and performing hash processing on the multiple attribute theme characteristic words to obtain a first hash value, wherein the multiple attribute theme characteristic words are one element in an access path anchor text set corresponding to the multiple attribute theme information;
acquiring weights corresponding to the multiple attribute topic feature words from the access path anchor text set, and quantizing the first hash value into a first vector by combining the weights;
extracting an agricultural product subject characteristic word from the agricultural product subject information, and carrying out hash processing on the agricultural product subject characteristic word to obtain a second hash value;
quantizing the second hash value into a second vector according to a preset weight for the agricultural product topic feature word;
and comparing the first vector with the second vector, and extracting a second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
Preferably, before the step of performing feature extraction on the second URL link in the URL queue to be crawled in the anchor multiple attribute integration manner based on path aggregation, the method further includes:
and adopting a counting bloom filter with link characteristics and combining multiple Hash to perform combined duplicate removal on the second URL links in the URL queue to be crawled, so that any two second URL links in the URL queue to be crawled are different.
Preferably, before the step of performing joint deduplication on the second URL link in the URL queue to be crawled by using a counting bloom filter of a link feature and combining multiple hashes, the method further includes:
traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link;
obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part;
and establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.
Preferably, the step of performing joint deduplication on the second URL link in the URL queue to be crawled by using a counting bloom filter of a link feature and combining multiple hashes includes:
traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link;
carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link;
according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments;
recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments, wherein N is an integer greater than or equal to 1;
performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;
and according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
In addition, in order to achieve the above object, the present invention further provides a web crawler-based link extraction apparatus, including:
the system comprises an extraction module, a data acquisition module and a data acquisition module, wherein the extraction module is used for extracting a first Uniform Resource Locator (URL) link of a platform to be accessed and theme information related to the agricultural product to be analyzed from a data acquisition request when the data acquisition request of the agricultural product to be analyzed is received;
the sending module is used for sending an access request to the platform to be accessed according to the first URL link;
the grabbing module is used for grabbing data information in a page corresponding to the first URL link after receiving a response made by the platform to be visited according to the visit request;
the analysis module is used for analyzing the data information to obtain a second URL link embedded in the page and adding the second URL link to a URL queue to be crawled;
the processing module is used for processing the first URL link and a second URL link in the URL queue to be crawled based on an anchor multiple attribute integration mode of path aggregation to obtain multiple attribute theme information in a rich text format corresponding to the second URL link;
and the extraction module is used for respectively comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product and extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
In addition, in order to achieve the above object, the present invention further provides a web crawler-based link extraction device, including: the system comprises a memory, a processor and a web crawler-based link extraction program stored on the memory and operable on the processor, the web crawler-based link extraction program being configured to implement the steps of the web crawler-based link extraction method as described above.
In addition, to achieve the above object, the present invention further provides a computer readable storage medium, which stores a web crawler-based link extraction program, and when the web crawler-based link extraction program is executed by a processor, the computer readable storage medium implements the steps of the web crawler-based link extraction method as described above.
The link extraction scheme based on the web crawler provided by the invention adopts an anchor multi-attribute integration mode based on path aggregation, processing a first URL link of a platform to be accessed and a second URL link in a URL queue to be crawled to obtain multiple attribute theme information in a rich text format corresponding to the second URL link, and comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural products, extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural products meets a preset threshold value, effectively ensuring the extraction accuracy of the specific URL link, further avoiding the resource waste of the web crawler caused by the crawling of irrelevant links, thereby obviously improving the performance of the web crawler, the web crawler can acquire the information needed by people quickly and accurately, and user experience is improved.
Drawings
FIG. 1 is a schematic structural diagram of a web crawler-based link extraction device of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a web crawler-based link extraction method according to a first embodiment of the present invention;
FIG. 3 is a schematic diagram of a path access directed graph in a first embodiment of a web crawler-based link extraction method according to the present invention;
FIG. 4 is a flowchart illustrating a web crawler-based link extraction method according to a second embodiment of the present invention;
fig. 5 is a block diagram of a first embodiment of a web crawler-based link extracting apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a web crawler-based link extraction device in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the web crawler-based link extracting apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the architecture shown in FIG. 1 does not constitute a limitation of web crawler-based link extraction devices, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a web crawler-based link extraction program.
In the web crawler-based link extraction apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a web server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the web crawler-based link extraction device of the present invention may be disposed in the web crawler-based link extraction device, and the web crawler-based link extraction device calls the web crawler-based link extraction program stored in the memory 1005 through the processor 1001 and executes the web crawler-based link extraction method provided by the embodiment of the present invention.
An embodiment of the present invention provides a link extraction method based on a web crawler, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of a link extraction method based on a web crawler according to the present invention.
In this embodiment, the link extraction method based on web crawlers includes the following steps:
step S10, when a data grabbing request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be visited and agricultural product theme information related to the agricultural product to be analyzed from the data grabbing request.
Specifically, the execution main body of the embodiment is a terminal device arbitrarily deployed or installed with a web crawler system.
It should be noted that, in this embodiment, in order to improve operations such as a capturing speed and an analyzing speed of data corresponding to an agricultural product to be analyzed as much as possible, the web crawler system described in this embodiment is preferably a distributed web crawler system.
In addition, it should be understood that, in practical applications, the terminal device may be a client device or a server device, and is not limited herein.
In addition, the platform to be accessed can be a network mall displaying agricultural products to be analyzed in practical application.
Accordingly, the Uniform Resource Locator (URL) is a network address required for accessing the network mall.
In addition, it should be understood that the agricultural products to be analyzed are only a general term for various common agricultural products at present, and in practical applications, the agricultural products to be analyzed may be tea products, fruit and vegetable products, food products, and the like, which are not listed here, and no limitation is made thereto.
For ease of understanding, this example uses tea product as the agricultural product to be analyzed.
Correspondingly, the topic information of the agricultural product is main information of the tea product, and in practical application, the main information of the tea product may specifically include characteristic information related to the tea product to be analyzed, for example, it is limited that the kind of the tea product to be analyzed is green tea, the season of the tea product is before the Qing Ming, and the price of the tea product is 500/kg to 1000/kg, and so on, which is not listed one by one, and is not limited at all.
And step S20, sending an access request to the platform to be accessed according to the first URL link.
Specifically, in practical applications, the web crawler may send an access request to the platform to be accessed (substantially, a server of the platform) by using a HyperText Transfer Protocol (HTTP) that transmits data based on a Transmission Control Protocol/Internet Protocol (TCP/IP).
It should be understood that the above is only a specific implementation manner of sending the access request to the platform to be accessed, and the technical solution of the present invention is not limited at all, and in practical applications, those skilled in the art may set the implementation manner as needed, and the implementation manner is not limited herein.
Step S30, after receiving a response from the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link.
It should be understood that, in practical applications, if the access request sent to the platform to be accessed is successful, and the platform to be accessed successfully verifies the first URL link carried in the access request, a successful response is made, and the data information in the page corresponding to the first URL link is fed back. At this time, the web crawler may capture the data information in the page corresponding to the first URL link, which is fed back by the platform to be accessed.
And step S40, analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled.
It should be understood that, in practical applications, besides displaying the same data information as the agricultural product to be analyzed, a plurality of URL links related to the data information may be displayed in the page corresponding to the first URL link, which is referred to as a second URL link herein for convenience of distinction.
For example, a web mall homepage including the agricultural product to be analyzed is displayed in a page corresponding to the first URL link, four types of agricultural product information including an agricultural product a, an agricultural product B, an agricultural product C, an agricultural product D and the like are mainly displayed in the homepage, meanwhile, each type of agricultural product corresponds to a second URL link, and a small type of agricultural product included in the corresponding agricultural product is mainly displayed in a page corresponding to the second URL link.
For example, agricultural products A-1, agricultural products A-2 and agricultural products A-3 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural products A; agricultural products B-1 and B-2 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural product B; agricultural products C-1, C-2, C-3 and C4 are mainly displayed in a page corresponding to a second URL link corresponding to the agricultural product C; and the agricultural product D-1 and the agricultural product D-2 are mainly displayed in the page corresponding to the second URL link corresponding to the agricultural product D.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.
In addition, in this embodiment, the reason that the second URL link embedded in the page is to be added to the URL queue to be crawled is that in practical application, the number of the second URL links obtained through analysis is relatively large because the data crawled by the web crawler is large. And each crawling and analyzing of a second URL link consumes much time, so that a large number of second URL links cannot be visited in a short time, and the second URL links acquired each time need to be added into a URL queue to be crawled.
In addition, the "first" of the "first URL link" and the "second" of the "second URL link" are only used for distinguishing the URL link corresponding to the platform to be visited from the URL link embedded in the page corresponding to the URL link, and do not limit the URL link itself. In practical applications, any "second URL link" may be regarded as a "first URL link" with respect to the URL link embedded in the corresponding page.
In addition, it is worth mentioning that in practical applications, a page corresponding to a second URL link may include some interference information, such as advertisement (picture, audio, video, etc.) information in various formats, in addition to the related information of the agricultural product to be analyzed. Therefore, in order to simplify the structure of the page corresponding to the second URL link as much as possible and facilitate crawling of the web crawler metal data, when the second URL link is obtained and added to the URL queue to be crawled, denoising processing can be performed on the second URL link.
For ease of understanding, the present embodiment provides a specific denoising method, which is roughly as follows:
(1) and analyzing the data information to obtain a second URL link embedded in the page.
(2) And analyzing the second URL link to obtain a normalized label corresponding to the second URL link.
Specifically, in practical applications, what is normalized here is substantially the label in the page corresponding to the second URL link.
Since current web pages are typically compiled based on HyperText Markup Language (HTML).
In addition, since in practical applications, the noisy links are usually in the presence of some picture tags, hyperlinks defined by the tags, and some URLs specifying the hyperlink targets, it is only necessary to normalize such tags.
(3) And generating an abstract tree corresponding to the second URL link according to the normalized tag.
(4) And matching the node content of the abstract tree with the topic information of the agricultural product based on a DOM tree matching method, removing unmatched node content, and obtaining a second URL link matched with the topic information of the agricultural product.
Specifically, each node of the abstract tree is essentially a normalized tag, since the abstract tree is generated according to the normalized tags. Therefore, when the node content of the abstract tree is matched with the topic information of the agricultural product, the keywords in the node and the keywords in the topic information of the agricultural product are extracted, and then the two keywords are compared to determine whether the node needs to be removed or not. In this way, after the content of each node in the abstract tree is matched with the agricultural product theme, the noise link can be removed, and then a second URL link matched with the agricultural product theme information is obtained.
(5) And adding a second URL link matched with the agricultural product topic information to a URL queue to be crawled.
It should be understood that the embodiment is only a specific denoising method, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art may set the denoising method according to needs, and the present invention is not limited herein.
And step S50, processing the first URL link and a second URL link in the URL queue to be crawled based on an anchor multiple attribute integration mode of path aggregation to obtain multiple attribute theme information in a rich text format corresponding to the second URL link.
In order to facilitate understanding of the above operation of obtaining multiple attribute topic information in the rich text format corresponding to each of the second URL links, a specific implementation manner is given below, which is approximately as follows:
(1) and generating a path access directed graph corresponding to the agricultural product to be analyzed according to the first URL link and a second URL link in the URL queue to be crawled.
Specifically, each vertex of the path access directed graph is a page corresponding to a URL link, which is specifically described by taking fig. 3 as an example.
As shown in fig. 3, the source web page u is substantially a page corresponding to the first URL link, the tea type and the tea price are pages corresponding to two embedded second URL links parsed from the data information of the page, and the destination pages v1, v2 and v3 are second URL links of a next page parsed from the pages corresponding to the two second URL links.
(2) And determining the shortest access path in the path access directed graph based on an anchor multiple attribute integration mode of path aggregation to obtain a shortest access path set.
Specifically, in practical applications, multiple paths may exist in a path access directed graph, and the shortest path among the multiple paths may exist as a loop-free path (not closed) or a loop-around path (closed).
For the sake of easy differentiation, in practical applications, different sets may be used to indicate whether the shortest path is looped or non-looped.
For ease of understanding, a specific representation of the shortest set of loop-free paths is given below.
For example, M shortest loop-free paths from the source web page to the target page may be represented by the following set:
Figure BDA0002140259220000111
it should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.
(3) Determining an anchor text corresponding to each shortest access path in the shortest access path set to obtain an access path anchor text set corresponding to the shortest access path set, and allocating a weight to each element in the access path anchor text set.
When a weight is assigned to each element in the access path anchor text set, the following specific steps may be performed:
first, contract PmIs the shortest loop-free path, and the value range of m meets the following requirements:
Figure BDA0002140259220000121
then, contract w (P)m)≤w(Pm+1) And the value range of m satisfies:
Figure BDA0002140259220000122
next, contract w (P)m) W (P) is less than or equal to, and the value range of P meets the following conditions:
Figure BDA0002140259220000123
finally, contract PmAt Pm+1It was previously determined that the value range of m satisfies:
Figure BDA0002140259220000124
wherein, W is weight, and M is a positive integer.
The start specifies a weight of 1 for each element (i.e., each edge), so if path P passes m directed edges, then w (P) is m.
It should be understood that the above is only a specific implementation manner for assigning weights, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the implementation manner as needed, and the implementation manner is not limited herein.
(4) And normalizing the weight corresponding to each element in the access path anchor text set according to a preset weight normalization formula.
Specifically, the weight normalization formula adopted in this embodiment is as follows:
Figure BDA0002140259220000125
wherein the content of the first and second substances,
Figure BDA0002140259220000126
is an element e in
Figure BDA0002140259220000127
The weight in (1).
Still taking the access path directed graph shown in fig. 3 as an example, the weights of the elements in the anchor text from the source webpage u to the target page v1, the target page v2 and the target page v3 in the original access path can be modified by the weight standard formula.
(5) And sorting the normalized weights in a descending order to obtain multiple attribute theme information in the rich text format corresponding to the second URL link.
Step S60, comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product, and extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
To facilitate understanding the operation of web crawler-based link extraction, a specific implementation is given below, roughly as follows:
(1) extracting multiple attribute theme characteristic words from the multiple attribute theme information, and performing hash processing on the multiple attribute theme characteristic words to obtain a first hash value, wherein the multiple attribute theme characteristic words are one element in an access path anchor text set corresponding to the multiple attribute theme information.
(2) And acquiring the weight corresponding to the multiple attribute topic characteristic word from the access path anchor text set, and quantizing the first hash value into a first vector by combining the weight.
(3) And extracting the topic characteristic words of the agricultural products from the topic information of the agricultural products, and carrying out hash processing on the topic characteristic words of the agricultural products to obtain a second hash value.
(4) And quantizing the second hash value into a second vector according to the preset weight of the agricultural product topic feature word.
(5) And comparing the first vector with the second vector, and extracting a second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
Specifically, in the embodiment, the comparison process of the multiple attribute theme information and the agricultural product theme information is converted into the comparison between two vectors, so that the comparison result can be obtained more vividly and visually, the extraction of the link is facilitated, and the accuracy is ensured.
It is not difficult to find out through the above description that the link extraction method based on web crawlers provided in this embodiment processes the first URL link of the platform to be accessed and the second URL link in the URL queue to be crawled through an anchor multiple attribute integration mode based on path aggregation to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, compares the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of agricultural products, extracts the second URL link corresponding to the multiple attribute topic information whose similarity to the topic information of agricultural products meets a preset threshold, effectively ensures the accuracy of extracting a specific URL link, and further can avoid resource waste caused by crawling of unrelated links by the web crawlers, thereby significantly improving the performance of the web crawlers, and enabling the web crawlers to quickly and accurately acquire information required by people, and the user experience is improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a web crawler-based link extraction method according to a second embodiment of the present invention.
Based on the first embodiment, before the step S50, the web crawler-based link extraction method of this embodiment further includes:
and step S00, performing joint duplicate removal on the second URL link in the URL queue to be crawled by adopting a counting bloom filter of link characteristics and combining multiple hashes.
Specifically, the joint deduplication of the second URL link in the URL queue to be crawled by using the counting bloom filter with link characteristics and combining multiple hashes is mainly divided into deduplication of the URL link with overall characteristics corresponding to the URL link and deduplication of a URL link fragment.
Since the URL link segment is obtained according to the global feature URL link, in order to ensure that the joint deduplication operation can be performed smoothly, the corresponding relationship between the second URL link and the global feature URL link needs to be determined first.
For ease of understanding, the present embodiment provides a specific implementation manner for determining the correspondence between the second URL link and the whole feature URL link, which is roughly as follows:
(1) and traversing the URL queue to be crawled, performing characteristic analysis on the traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link.
Specifically, since in practical applications the URL links are used to uniquely identify resources on the network. Also, in general, a URL link will typically contain the following five components: a Protocol type part (usually denoted by Protocol), a server address part (usually denoted by user Host), a Port number part (usually denoted by Port), a Path part (usually denoted by Path), and a query part (usually denoted by Fragment).
Wherein, the three parts of the protocol type part, the path part and the inquiry part can usually embody the characteristics of a URL link.
Therefore, in this embodiment, the URL queue to be crawled is traversed, and the traversed current second URL link is subjected to feature analysis, so as to extract the protocol type part of the current second URL link (for convenience of the following description, the following user p is referred to as a "user" p)1Presentation), path section (for convenience of the following description user p)2Presentation) and an inquiry section (for convenience of the following description user p3Representation).
(2) And obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part.
In particular, since p is1、p2And p3These three parts can embody the full characteristics of the current second URL link, thus by p1、p2And p3The combination is performed to obtain the global characteristic URL link corresponding to the current second URL link, which is hereinafter referred to as p1p2p3Representing the global characteristic URL link to which each second URL link corresponds.
(3) And establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.
Specifically, in this embodiment, the correspondence between the current second URL link and the overall characteristic URL link is to be established, and the correspondence is updated to the to-be-crawled URL queue, so that in a subsequent process of deduplication of the second URL link, the correspondence can be used to quickly find the overall characteristic URL link corresponding to the current second URL link, and further, the URL link segment corresponding to the current second URL link is obtained according to the overall URL link.
In addition, in practical application, the corresponding relation may not be updated to the URL queue to be crawled, but may be stored separately. And when the second URL link in the URL queue to be crawled is subjected to joint duplicate removal, searching the integral characteristic URL link corresponding to the current second URL link from the separately stored corresponding relation table according to the traversed current second URL link.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can make settings according to needs, and the present invention is not limited herein.
Further, after obtaining the correspondence and the overall characteristic URL link corresponding to each second URL link, the above-mentioned counting bloom filter using the link characteristic and performing a joint deduplication operation on the second URL links in the to-be-crawled URL queue by combining multiple hashes may specifically be as follows:
(1) and traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to the traversed current second URL link.
Specifically, the whole characteristic URL link corresponding to the traversed current second URL link is obtained according to the above correspondence.
(2) And performing integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link.
Specifically, the counting bloom filter used in the present embodiment is not a counting bloom filter used in the existing link deduplication, but a counting bloom filter based on the link characteristics of URL links.
That is to say, when the computing bloom filter of this embodiment deduplicates a link, specifically, feature recognition is performed on an overall feature URL link corresponding to each second URL link in a URL queue to be crawled, and then, overall duplicate checking is performed according to the recognized feature, that is, feature comparison is performed on each second entered link during deduplication, so that overall duplicate checking is achieved.
In addition, in order to conveniently identify the URL link segment which is subsequently recombined according to the feature segment, a corresponding duplication checking mark is distributed to the whole feature URL link.
(3) And according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments.
Specifically, the global feature URL link is still taken as p1p2p3For example, after performing the feature recognition on the whole feature URL link, the obtained plurality of feature segments may specifically be segments respectively including a protocol type portion, a path portion and a query portion, that is, a feature segment p1Characteristic fragment p2And a characteristic fragment p3
(4) And recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments.
It should be understood that since an entire characteristic URL link is composed of three parts, a protocol type part, a path part and a query part, at least 1 recombined URL link segment is obtained, and N is an integer greater than or equal to 1 in this embodiment.
In addition, in practical applications, the URL link restructuring rule may be set by those skilled in the art as required, for example, the URL link segment after restructuring must include the feature segment p1Or the recomposed URL link segment cannot include the feature segment p3And so on, which are not listed here, and do not limit in any way.
Accordingly, if the URL link restructuring rule is that the restructured URL link segment must include the characteristic segment p1The resulting recombined URL link fragment substantially includes p only1URL link segment of feature segment, including only p1Characteristic fragment and p2URL link segments of feature segments, and including only p1Characteristic fragment and p3The URL of the feature fragment links the fragment.
If the URL link recombination rule is that the recombined URL link segment cannot be recombinedIncluding a characteristic segment p3The resulting recombined URL link fragment substantially includes p only1URL link segment of feature segment and including only p1Characteristic fragment and p2The URL of the feature fragment links the fragment.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the technical solution according to actual needs, and the technical solution is not limited herein.
(5) And performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link.
It should be noted that, in practical applications, since there may be a large number of second URL links cached in the URL queue to be crawled, the number of URL link segments obtained after the reorganization is more. Therefore, in this embodiment, in order to reduce the occupation of the second URL link cached in the URL queue to be crawled to the storage space as much as possible, after the plurality of feature segments are recombined according to the preset URL link recombination rule to obtain N recombined URL link segments, the obtained N recombined URL link segments may be respectively compressed based on the MD5 algorithm to further obtain the character string ciphertexts corresponding to the N recombined URL link segments, and finally the character string ciphertexts replace the content in the corresponding recombined URL link segments.
It should be understood that the above is only a specific compression method, and the technical solution of the present invention is not limited in any way, and in practical applications, a person skilled in the art can select a suitable compression method according to actual needs, and is not limited herein.
Correspondingly, the operation of performing multiple hash duplicate checking on the N recombined URL link segments to obtain the duplicate checking result corresponding to the current second URL link specifically includes:
(5-1) extracting the character string ciphertexts corresponding to the N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values.
It should be understood that, since the link deduplication operation provided in this embodiment specifically combines multiple hashes when performing joint deduplication on links, that is, at least 2 hash processes are required on a string ciphertext, K is an integer greater than or equal to 2.
And (5-2) hashing the K hash values to a pre-constructed bit vector space to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value.
Specifically, in the present embodiment, the initial count value displayed on the spatially variable counter corresponding to each reference hash value is represented by "0".
And (5-3) carrying out Hash processing on the remaining N-1 character string ciphertext for K times respectively to obtain K Hash values corresponding to each remaining character string ciphertext.
And (5-4) randomly hashing the K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value.
Specifically, in order to determine whether the hash value newly hashed into the bit vector space is adjacent to the reference hash value, a determination criterion may be preset, for example, when a new hash value is inserted between two adjacent reference hash values, the reference hash value closest to the newly inserted hash value may be selected as the adjacent reference hash value.
It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art can set the technical solution according to actual needs, and the technical solution is not limited herein.
And (5-5) inserting a preset character for each newly hashed hash value to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method.
Specifically, in this embodiment, the preset character is represented by "1".
For example, for a reference hash value, the initial count value displayed on the corresponding spatially variable counter is "0". When a new hash value is hashed to a position adjacent to the new hash value, a preset character "1" needs to be inserted in front of "0" by using a header insertion method, and the count value displayed on the space variable counter becomes "10".
Accordingly, if there are two new hash values hashed to the desired position of the reference hash value, a two-bit preset character "1" needs to be inserted in front of "0" by using a header insertion method, and the count value displayed on the space-variable counter becomes "110".
And (5-6) counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.
Specifically, the determined duplicate checking result may be:
if the number of the preset characters '1' in front of the initial count value '0' is more than 1, determining that the recombined URL segment is repeated and needs to be discarded;
otherwise, determining that the recombined URL segment is not repeated and can be reserved.
(6) And according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
It should be understood that the above is only a specific implementation manner of joint deduplication, and the technical solution of the present invention is not limited in any way, and in practical applications, those skilled in the art may reasonably adjust the implementation manner according to needs, and the implementation manner is not limited herein.
In addition, in practical application, in order to further reduce the occupation of a storage space, after a counting bloom filter of link characteristics is adopted and multiple hashes are combined to perform joint deduplication on the second URL links in the URL queue to be crawled, each second URL link in the URL queue to be crawled after deduplication is performed can be compressed based on an MD5 algorithm, and then a character string ciphertext corresponding to each second URL link is obtained; and finally, replacing the content in the corresponding second URL link with the character string ciphertext, so that the second URL link in the URL queue to be crawled is compressed as much as possible, and the occupation of a storage space is reduced.
As can be seen from the above description, the link extraction method based on the web crawler provided in this embodiment performs deduplication operation on the second URL link in the URL queue to be crawled before performing extraction operation on the second URL link in the URL queue to be crawled, thereby further reducing unnecessary interference in the link extraction process and improving the extraction efficiency of the web crawler.
In addition, this embodiment is through the count bloom filter who adopts the link characteristic to combine multiple hash to the second URL link that waits to crawl buffering in the URL queue carries out whole and partial joint duplicate removal, thereby has reduced count bloom filter's erroneous judgement rate as far as, has effectively improved web crawler's performance, makes web crawler can be fast accurate acquire people required information, has promoted user experience as far as possible.
In addition, in the deduplication process, the URL link is compressed based on a compression algorithm, such as the MD5 algorithm, so that the occupation of the storage space is reduced as much as possible.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a web crawler-based link extraction program is stored on the computer-readable storage medium, and when executed by a processor, the web crawler-based link extraction program implements the steps of the web crawler-based link extraction method as described above.
Referring to fig. 5, fig. 5 is a block diagram illustrating a first embodiment of a web crawler-based link extracting apparatus according to the present invention.
As shown in fig. 5, the link extracting apparatus based on web crawlers according to the embodiment of the present invention includes: an extraction module 5001, a sending module 5002, a fetching module 5003, a parsing module 5004, a processing module 5005, and an extraction module 5006.
The extraction module 5001 is configured to, when a data fetch request of an agricultural product to be analyzed is received, extract a first uniform resource locator URL link of a platform to be accessed and topic information related to the agricultural product to be analyzed from the data fetch request; a sending module 5002, configured to send an access request to the platform to be accessed according to the first URL link; a fetching module 5003, configured to fetch, after receiving a response made by the platform to be accessed according to the access request, data information in a page corresponding to the first URL link; the analyzing module 5004 is configured to analyze the data information to obtain a second URL link embedded in the page, and add the second URL link to a URL queue to be crawled; a processing module 5005, configured to process the first URL link and the second URL link in the URL queue to be crawled based on an anchor multiple attribute integration manner of path aggregation, to obtain multiple attribute topic information in a rich text format corresponding to the second URL link; the extracting module 5006 is configured to compare the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the agricultural product topic information, and extract a second URL link corresponding to the multiple attribute topic information whose similarity to the agricultural product topic information meets a preset threshold.
It should be understood that each module referred to in this embodiment is a logical module, and in practical applications, one logical unit may be one physical unit, may be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but it does not indicate that there is no other unit in the present embodiment.
In addition, in order to facilitate understanding of a specific processing flow of each functional module in an actual application of the web crawler-based link extraction apparatus provided in this embodiment, the following specifically describes processing of the parsing module 5004, the processing module 5005, and the extraction module 5006.
Specifically, the analysis module 5004 performs an operation of analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled, where the implementation flow in a specific application is substantially as follows:
firstly, analyzing the data information to obtain a second URL link embedded in the page;
then, analyzing the second URL link to obtain a normalized label corresponding to the second URL link;
then, generating an abstract tree corresponding to the second URL link according to the normalized tag;
then, based on a DOM tree matching method, matching the node content of the abstract tree with the topic information of the agricultural product, removing unmatched node content, and obtaining a second URL link matched with the topic information of the agricultural product;
and finally, adding a second URL link matched with the agricultural product topic information into a URL queue to be crawled.
It should be understood that the above is only a specific implementation manner for denoising the second URL link in the URL queue to be crawled, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the denoising method according to needs, and the present invention is not limited to this.
In addition, the processing module 5005 performs an anchor multiple attribute integration mode based on path aggregation to process the first URL link and the second URL link in the URL queue to be crawled, so as to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, and the implementation flow in specific applications is substantially as follows:
firstly, generating a path access directed graph corresponding to the agricultural product to be analyzed according to the first URL link and a second URL link in the URL queue to be crawled;
then, based on an anchor multiple attribute integration mode of path aggregation, determining the shortest access path in the path access directed graph to obtain a shortest access path set;
then, determining an anchor text corresponding to each shortest access path in the shortest access path set to obtain an access path anchor text set corresponding to the shortest access path set, and assigning a weight to each element in the access path anchor text set;
secondly, standardizing the weight corresponding to each element in the access path anchor text set according to a preset weight standardization formula;
and finally, sorting the normalized weights in a descending order to obtain multiple attribute theme information in the rich text format corresponding to the second URL link.
It should be understood that, the above only is a specific implementation manner for obtaining the multiple attribute topic information in the rich text format corresponding to each second URL link in the URL queue to be crawled, and the technical solution of the present invention is not limited at all.
In addition, the step of comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the agricultural product topic information, which is executed by the extraction module 5006, extracts a second URL link operation corresponding to the multiple attribute topic information whose similarity to the agricultural product topic information meets a preset threshold, and the implementation flow in a specific application is substantially as follows:
firstly, extracting multiple attribute theme characteristic words from the multiple attribute theme information, and performing hash processing on the multiple attribute theme characteristic words to obtain a first hash value, wherein the multiple attribute theme characteristic words are one element in an access path anchor text set corresponding to the multiple attribute theme information;
then, acquiring weights corresponding to the multiple attribute topic feature words from the access path anchor text set, and quantizing the first hash value into a first vector by combining the weights;
then, extracting an agricultural product subject characteristic word from the agricultural product subject information, and carrying out hash processing on the agricultural product subject characteristic word to obtain a second hash value;
then, quantizing the second hash value into a second vector according to a preset weight for the agricultural product subject characteristic word;
and finally, comparing the first vector with the second vector, and extracting a second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
It should be understood that the above is only a specific implementation manner of extracting a specific link from a URL queue to be crawled, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the method according to needs, and the present invention is not limited to this.
It is not difficult to find out through the above description that the link extraction device based on the web crawler provided in this embodiment processes the first URL link of the platform to be accessed and the second URL link in the URL queue to be crawled through an anchor multiple attribute integration mode based on path aggregation to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, compares the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product, extracts the second URL link corresponding to the multiple attribute topic information whose similarity to the topic information of the agricultural product meets a preset threshold, effectively ensures the accuracy of extracting a specific URL link, and further can avoid resource waste caused by crawling of unrelated links by the web crawler, thereby significantly improving the performance of the web crawler, and enabling the web crawler to quickly and accurately acquire information required by people, and the user experience is improved.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the link deduplication method provided in any embodiment of the present invention, and are not described herein again.
Based on the first embodiment of the web crawler-based link extraction apparatus, a second embodiment of the web crawler-based link extraction apparatus of the present invention is provided.
In this embodiment, the web crawler-based link extraction apparatus further includes a deduplication module.
And the duplication removing module is used for adopting a counting bloom filter of link characteristics and combining multiple Hash to carry out joint duplication removal on the second URL link in the URL queue to be crawled.
It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but it does not indicate that there is no other unit in the present embodiment.
In addition, it is worth mentioning that, in this embodiment, when the deduplication module uses a counting bloom filter of link characteristics and combines multiple hashes to perform joint deduplication on the second URL link in the URL queue to be crawled, the deduplication module specifically includes deduplication on the whole characteristic URL link corresponding to the URL link and deduplication on a URL link fragment.
Since the URL link segment is obtained according to the global feature URL link, in order to ensure that the deduplication module can smoothly perform the above operations, the corresponding relationship between the second URL link and the global feature URL link needs to be determined first.
Regarding the manner of determining the correspondence between the second URL link and the global characteristic URL link, the following may be roughly described:
firstly, traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link;
then, obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part;
and finally, establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled.
Accordingly, after obtaining the corresponding relationship, the operation executed by the deduplication module is specifically:
firstly, traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link;
then, carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link;
then, according to the duplication checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments;
secondly, recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments;
then, performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;
and finally, according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
In this embodiment, N is an integer of 1 or more.
In addition, it should be understood that what is given above is only a specific implementation manner of determining a corresponding relationship between a second URL link and an overall characteristic URL link, and using a counting bloom filter of a link characteristic, and combining multiple hashes to perform joint deduplication on the second URL link in the URL queue to be crawled, and the technical scheme of the present invention is not limited at all.
Further, in practical application, in order to reduce the occupation of the second URL link cached in the URL queue to be crawled to the storage space as much as possible, after the plurality of feature segments are recombined according to a preset URL link recombination rule to obtain N recombined URL link segments, the obtained N recombined URL link segments may be respectively compressed based on an MD5 algorithm to obtain character string ciphertexts corresponding to the N recombined URL link segments, and finally the character string replacing ciphertexts are omitted from the content in the corresponding recombined URL link segments.
Correspondingly, the operation of performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link specifically includes:
firstly, extracting character string ciphertexts corresponding to N recombined URL link segments, and selecting any one character string cipher text from the N character string cipher texts to carry out Hash processing for K times to obtain K Hash values;
then, hashing the K hash values to a pre-constructed bit vector space to serve as reference hash values, and setting an initial count value for a space variable counter corresponding to each reference hash value;
then, carrying out Hash processing on the remaining N-1 character string ciphertext for K times respectively to obtain K Hash values corresponding to each remaining character string ciphertext;
then, randomly hashing K hash values corresponding to each residual character string ciphertext to the bit vector space, wherein the K hash values are adjacent to any one reference hash value;
then, inserting a preset character for each hash value newly hashed to the bit vector space before the initial count value corresponding to the adjacent reference hash value by adopting a head insertion method;
and finally, counting the number of preset characters before the initial value corresponding to each reference hash value, and determining the duplicate checking result corresponding to the current second URL link according to the number of the preset characters.
In this embodiment, K is an integer of 2 or more.
In addition, it should be understood that the above is only a specific implementation manner for obtaining the duplicate checking result corresponding to the current second URL link, and the technical solution of the present invention is not limited at all, and in a specific application, a person skilled in the art may set the duplicate checking result as needed, and the present invention is not limited to this.
In addition, in practical application, in order to further reduce the occupation of a storage space, after the second URL links in the URL queue to be crawled are subjected to joint duplicate removal, each second URL link in the URL queue to be crawled after the duplicate removal can be compressed based on an MD5 algorithm, so as to obtain a character string ciphertext corresponding to each second URL link; and finally, replacing the content in the corresponding second URL link with the character string ciphertext, so that the second URL link in the URL queue to be crawled is compressed as much as possible, and the occupation of a storage space is reduced.
It can be easily seen from the above description that the link extracting device based on web crawlers provided by this embodiment performs deduplication operation on the second URL link in the URL queue to be crawled before performing extraction operation on the second URL link in the URL queue to be crawled, thereby further reducing unnecessary interference in the link extraction process and improving the extraction efficiency of the web crawlers.
In addition, this embodiment is through the count bloom filter who adopts the link characteristic to combine multiple hash to the second URL link that waits to crawl buffering in the URL queue carries out whole and partial joint duplicate removal, thereby has reduced count bloom filter's erroneous judgement rate as far as, has effectively improved web crawler's performance, makes web crawler can be fast accurate acquire people required information, has promoted user experience as far as possible.
In addition, in the deduplication process, the URL link is compressed based on a compression algorithm, such as the MD5 algorithm, so that the occupation of the storage space is reduced as much as possible.
It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.
In addition, the technical details that are not described in detail in this embodiment may refer to the link deduplication method provided in any embodiment of the present invention, and are not described herein again.
Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (7)

1. A link extraction method based on web crawlers is characterized by comprising the following steps:
when a data grabbing request of an agricultural product to be analyzed is received, extracting a first Uniform Resource Locator (URL) link of a platform to be accessed and agricultural product subject information related to the agricultural product to be analyzed from the data grabbing request;
according to the first URL link, sending an access request to the platform to be accessed;
after receiving a response made by the platform to be accessed according to the access request, capturing data information in a page corresponding to the first URL link;
analyzing the data information to obtain a second URL link embedded in the page, and adding the second URL link to a URL queue to be crawled;
processing the first URL link and a second URL link in the URL queue to be crawled based on an anchor multiple attribute integration mode of path aggregation to obtain multiple attribute theme information in a rich text format corresponding to the second URL link;
respectively comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product, and extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold;
before the step of processing the first URL link and the second URL link in the URL queue to be crawled to obtain multiple attribute topic information in a rich text format corresponding to the second URL link, the method for integrating multiple attributes of the anchor based on path aggregation further includes:
performing combined duplicate removal on the second URL links in the URL queue to be crawled by adopting a counting bloom filter with link characteristics and combining multiple hashes, so that any two second URL links in the URL queue to be crawled are different;
before the step of performing joint deduplication on the second URL link in the URL queue to be crawled by using the counting bloom filter with link characteristics and combining multiple hashes, the method further includes:
traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link;
obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part;
establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled;
the method comprises the following steps of adopting a counting bloom filter with link characteristics and combining multiple Hash to jointly remove the duplicate of a second URL link in the URL queue to be crawled, wherein the method comprises the following steps:
traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link;
carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link;
according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments;
recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments, wherein N is an integer greater than or equal to 1;
performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link;
and according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
2. The method of claim 1, wherein the step of parsing the data information to obtain a second URL link embedded in the page and adding the second URL link to a URL queue to be crawled comprises:
analyzing the data information to obtain a second URL link embedded in the page;
analyzing the second URL link to obtain a normalized tag corresponding to the second URL link;
generating an abstract tree corresponding to the second URL link according to the normalized tag;
matching the node content of the abstract tree with the topic information of the agricultural product based on a DOM tree matching method, removing unmatched node content, and obtaining a second URL link matched with the topic information of the agricultural product;
and adding a second URL link matched with the agricultural product topic information to a URL queue to be crawled.
3. The method according to claim 2, wherein the step of processing the first URL link and the second URL link in the URL queue to be crawled based on the anchor multiple attribute integration manner of path aggregation to obtain multiple attribute topic information in rich text format corresponding to the second URL link includes:
generating a path access directed graph corresponding to the agricultural product to be analyzed according to the first URL link and a second URL link in the URL queue to be crawled;
determining the shortest access path in the path access directed graph based on an anchor multiple attribute integration mode of path aggregation to obtain a shortest access path set;
determining an anchor text corresponding to each shortest access path in the shortest access path set to obtain an access path anchor text set corresponding to the shortest access path set, and allocating a weight to each element in the access path anchor text set;
standardizing the weight corresponding to each element in the access path anchor text set according to a preset weight standardization formula;
and sorting the normalized weights in a descending order to obtain multiple attribute theme information in the rich text format corresponding to the second URL link.
4. The method of claim 3, wherein the step of comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product respectively, and the step of extracting the second URL link corresponding to the multiple attribute topic information with the similarity of the topic information of the agricultural product meeting a preset threshold value comprise:
extracting multiple attribute theme characteristic words from the multiple attribute theme information, and performing hash processing on the multiple attribute theme characteristic words to obtain a first hash value, wherein the multiple attribute theme characteristic words are one element in an access path anchor text set corresponding to the multiple attribute theme information;
acquiring weights corresponding to the multiple attribute topic feature words from the access path anchor text set, and quantizing the first hash value into a first vector by combining the weights;
extracting an agricultural product subject characteristic word from the agricultural product subject information, and carrying out hash processing on the agricultural product subject characteristic word to obtain a second hash value;
quantizing the second hash value into a second vector according to a preset weight for the agricultural product topic feature word;
and comparing the first vector with the second vector, and extracting a second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold value.
5. A web crawler-based link extraction apparatus, the apparatus comprising:
the system comprises an extraction module, a data acquisition module and a data acquisition module, wherein the extraction module is used for extracting a first Uniform Resource Locator (URL) link of a platform to be accessed and agricultural product theme information related to the agricultural product to be analyzed from a data acquisition request when the data acquisition request of the agricultural product to be analyzed is received;
the sending module is used for sending an access request to the platform to be accessed according to the first URL link;
the grabbing module is used for grabbing data information in a page corresponding to the first URL link after receiving a response made by the platform to be visited according to the visit request;
the analysis module is used for analyzing the data information to obtain a second URL link embedded in the page and adding the second URL link to a URL queue to be crawled;
the processing module is used for processing the first URL link and a second URL link in the URL queue to be crawled based on an anchor multiple attribute integration mode of path aggregation to obtain multiple attribute theme information in a rich text format corresponding to the second URL link;
the extraction module is used for respectively comparing the multiple attribute topic information corresponding to each second URL link in the URL queue to be crawled with the topic information of the agricultural product and extracting the second URL link corresponding to the multiple attribute topic information of which the similarity with the topic information of the agricultural product meets a preset threshold;
the web crawler-based link extraction apparatus further comprises: performing combined duplicate removal on the second URL links in the URL queue to be crawled by adopting a counting bloom filter with link characteristics and combining multiple hashes, so that any two second URL links in the URL queue to be crawled are different;
the web crawler-based link extraction apparatus further comprises: traversing the URL queue to be crawled, performing characteristic analysis on a traversed current second URL link, and extracting a protocol type part, a path part and an inquiry part of the current second URL link; obtaining an integral characteristic URL link corresponding to the current second URL link according to the protocol type part, the path part and the inquiry part; establishing a corresponding relation between the current second URL link and the integral characteristic URL link, and updating the corresponding relation into the URL queue to be crawled;
the web crawler-based link extraction apparatus further comprises: traversing the URL queue to be crawled, and acquiring an integral characteristic URL link corresponding to a traversed current second URL link; carrying out integral duplicate checking on the integral characteristic URL link by adopting a counting bloom filter of the link characteristic to obtain a duplicate checking mark corresponding to the integral characteristic URL link; according to the duplicate checking mark, carrying out feature identification on the integral feature URL link to obtain a plurality of feature segments; recombining the plurality of characteristic segments according to a preset URL link recombination rule to obtain N recombined URL link segments, wherein N is an integer greater than or equal to 1; performing multiple hash duplicate checking on the N recombined URL link segments to obtain a duplicate checking result corresponding to the current second URL link; and according to the duplicate checking result, reserving or discarding the second URL link in the URL queue to be crawled.
6. A web crawler-based link extraction device, the device comprising: a memory, a processor, and a web crawler-based link extraction program stored on the memory and executable on the processor, the web crawler-based link extraction program configured to implement the steps of the web crawler-based link extraction method of any one of claims 1-4.
7. A computer-readable storage medium, wherein the computer-readable storage medium has stored thereon a web crawler-based link extraction program, which when executed by a processor, implements the steps of the web crawler-based link extraction method according to any one of claims 1 to 4.
CN201910670515.5A 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler Active CN110413861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910670515.5A CN110413861B (en) 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910670515.5A CN110413861B (en) 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler

Publications (2)

Publication Number Publication Date
CN110413861A CN110413861A (en) 2019-11-05
CN110413861B true CN110413861B (en) 2021-10-22

Family

ID=68362839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910670515.5A Active CN110413861B (en) 2019-07-23 2019-07-23 Link extraction method, device, equipment and storage medium based on web crawler

Country Status (1)

Country Link
CN (1) CN110413861B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753162A (en) * 2020-06-29 2020-10-09 平安国际智慧城市科技股份有限公司 Data crawling method, device, server and storage medium
CN113065051B (en) * 2021-04-02 2022-04-15 西南石油大学 Visual agricultural big data analysis interactive system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202467A (en) * 2016-07-18 2016-12-07 浪潮集团有限公司 Peer-to-peer network-oriented web crawler method capable of defining search key points
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN109561163A (en) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 The generation method and device of uniform resource locator rewriting rule

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6928616B2 (en) * 2001-09-20 2005-08-09 International Business Machines Corporation Method and apparatus for allowing one bookmark to replace another

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202467A (en) * 2016-07-18 2016-12-07 浪潮集团有限公司 Peer-to-peer network-oriented web crawler method capable of defining search key points
CN108628871A (en) * 2017-03-16 2018-10-09 哈尔滨英赛克信息技术有限公司 A kind of link De-weight method based on chain feature
CN109561163A (en) * 2017-09-27 2019-04-02 阿里巴巴集团控股有限公司 The generation method and device of uniform resource locator rewriting rule
CN108334591A (en) * 2018-01-30 2018-07-27 天津中科智能识别产业技术研究院有限公司 Industry analysis method and system based on focused crawler technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于DOM的网页主题信息自动提取";王琦等;《计算机研究与发展》;20041031;第1786-1792页 *
"基于链接路径搜索的URL属性集成方法";马艳红等;《计算机工程》;20130131;第76-79页 *

Also Published As

Publication number Publication date
CN110413861A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
US8601120B2 (en) Update notification method and system
US7917514B2 (en) Visual and multi-dimensional search
US8768926B2 (en) Techniques for categorizing web pages
US7739221B2 (en) Visual and multi-dimensional search
CN110399546B (en) Link duplicate removal method, device, equipment and storage medium based on web crawler
US10311120B2 (en) Method and apparatus for identifying webpage type
CN103631794B (en) A kind of method, apparatus and equipment for being ranked up to search result
US20120284270A1 (en) Method and device to detect similar documents
KR100848319B1 (en) Harmful web site filtering method and apparatus using web structural information
JP6203374B2 (en) Web page style address integration
US8086953B1 (en) Identifying transient portions of web pages
CN104899306B (en) Information processing method, information display method and device
JP2014502753A (en) Web page information detection method and system
CN103810268B (en) Search result recommendation information loading method, device and system and URL detection method, device and system
CN107193987A (en) Obtain the methods, devices and systems of the search term related to the page
CN110413861B (en) Link extraction method, device, equipment and storage medium based on web crawler
CN103793508B (en) A kind of loading recommendation information, the methods, devices and systems of network address detection
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
WO2015024522A1 (en) Search method and system, search engine and client
US20180337930A1 (en) Method and apparatus for providing website authentication data for search engine
CN111209325A (en) Service system interface identification method, device and storage medium
CN103631793B (en) A kind of method, apparatus and equipment for being ranked up to search result
CN109145581A (en) Anti- simulation login method, device and server based on browser rendering performance
US8121991B1 (en) Identifying transient paths within websites
JP2007122398A (en) Method for determining identity of fragment, and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20191105

Assignee: Xiangyang Goode Cultural Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980041350

Denomination of invention: Link extraction methods, devices, devices, and storage media based on web crawlers

Granted publication date: 20211022

License type: Common License

Record date: 20230908

Application publication date: 20191105

Assignee: Hubei Fengyun Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980041308

Denomination of invention: Link extraction methods, devices, devices, and storage media based on web crawlers

Granted publication date: 20211022

License type: Common License

Record date: 20230908