CN111143720A - URL duplicate removal method, device and storage medium - Google Patents

URL duplicate removal method, device and storage medium Download PDF

Info

Publication number
CN111143720A
CN111143720A CN201811312516.4A CN201811312516A CN111143720A CN 111143720 A CN111143720 A CN 111143720A CN 201811312516 A CN201811312516 A CN 201811312516A CN 111143720 A CN111143720 A CN 111143720A
Authority
CN
China
Prior art keywords
url
crawled
processing result
data list
bloom filter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811312516.4A
Other languages
Chinese (zh)
Inventor
曾庆维
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
SF Tech Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201811312516.4A priority Critical patent/CN111143720A/en
Publication of CN111143720A publication Critical patent/CN111143720A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application discloses a URL duplicate removal method, a device and a storage medium, wherein the method comprises the following steps: acquiring a URL to be crawled corresponding to a webpage to be crawled; performing hash processing on the characteristics of the URL to be crawled to obtain a processing result of the URL to be crawled; judging whether the processing result is in a bloom filter or not, and if the processing result is in the bloom filter, judging whether the characteristics of the URL are in a pre-established data list or not; if the feature is in the data list, the URL to crawl is discarded. According to the URL duplicate removal method, whether the URL to be crawled is crawled or not is confirmed again by using the pre-established data list, the misjudgment of the bloom filter is made up, the rejection of the URL which is not crawled due to the misjudgment of the bloom filter is avoided, and the accuracy of URL duplicate removal is improved.

Description

URL duplicate removal method, device and storage medium
Technical Field
The present application relates generally to the field of computer technologies, and in particular, to a URL deduplication method, apparatus, and storage medium.
Background
In the process of acquiring information by using a search engine, a web crawler actively grabs programs or scripts from internet information, downloads web pages on the internet to the local to form a mirror image backup of internet content, and provides a data source for a user. In order to obtain as much network information as possible, web crawlers are typically distributed across multiple clusters of machines for crawling.
In order to avoid repeated crawling of the crawled web pages, a Uniform Resource Locator (URL) corresponding to the crawled web pages needs to be deduplicated. The currently commonly used deduplication methods are database-based deduplication, memory-based deduplication, disk path-based deduplication, and bloom filter-based deduplication.
When the bloom filter is used for URL duplicate removal, whether the hash function value of the URL to be crawled is in the bloom filter is judged, whether the URL is crawled is determined, and due to the fact that when the hash function value of a crawled URL is input into the bloom filter, 1 setting of element values at other positions can be caused, other un-crawled URLs are judged by mistake, namely a certain URL is not actually crawled, and therefore the judgment result is inaccurate and the duplicate removal efficiency is influenced.
Disclosure of Invention
In view of the foregoing drawbacks and deficiencies of the prior art, it is desirable to provide a URL deduplication method, apparatus and storage medium, so as to improve the accuracy of URL deduplication.
In a first aspect, an embodiment of the present application provides a method for removing duplicate URLs, where the method includes:
acquiring a URL to be crawled corresponding to a webpage to be crawled;
performing hash processing on the characteristics of the URL to be crawled to obtain a processing result of the URL to be crawled;
judging whether the processing result is in a bloom filter or not, and if the processing result is in the bloom filter, judging whether the characteristic is in a pre-established data list or not, wherein the data list comprises at least one characteristic of a crawled URL;
if the feature of the URL to be crawled is in the data list, the URL to be crawled is discarded.
In a second aspect, an embodiment of the present application provides an apparatus for removing duplicate URLs, the apparatus including:
the acquisition module is used for acquiring the URL to be crawled corresponding to the webpage to be crawled;
the processing module is used for carrying out Hash processing on the characteristics of the URL to be crawled to obtain a processing result of the URL to be crawled;
the first judgment module is used for judging whether the processing result is in the bloom filter or not;
a second judging module, configured to, when the processing result is in the bloom filter, judge whether the feature of the URL to be crawled is in a pre-established data list, where the data list includes at least one feature of a crawled URL;
and the abandoning module is used for abandoning the URL to be crawled when the processing result is in the bloom filter and the characteristic of the URL to be crawled is in the data list.
In a third aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, the computer program being configured to implement the URL deduplication method according to the first aspect.
To sum up, according to the URL deduplication method, the URL deduplication device, and the storage medium provided in the embodiments of the present application, hash processing is performed on a feature of a URL corresponding to an acquired web page to be crawled, and then it is determined whether a processing result is in a bloom filter, and when the feature is in the bloom filter, it is further determined whether the feature is in a pre-established data list, and when the feature is also in the data list, it is determined that the URL to be crawled has been crawled, and the URL is discarded.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is a flowchart illustrating a URL deduplication method provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a URL deduplication method according to yet another embodiment of the present application;
FIG. 3 is a schematic structural diagram of a URL deduplication apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a computer system of a server according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
It is understood that during the process of crawling the program or script in the internet by the web crawler, a portion of the seed uniform resource locator URL, such as a uniform resource locator URL conforming to a predetermined format, may be selected first. The uniform Resource locator URL (uniform Resource locator) is a representation of the location and access address of a Resource obtained from the internet, where each file on the internet has a unique URL that contains information indicating the location of the file and how the browser should handle it. As an information acquisition resource, a crawling object can be obtained by putting these URLs into a URL queue to be crawled. Analyzing the crawling object to obtain webpage content corresponding to the URL; the webpage content is the content corresponding to the crawling object, the content is stored in a corresponding storage device or equipment, and then the URL is put into a crawling queue, so that the phenomenon that repeated crawling is the same is avoided, and the load of a server is increased.
It can be understood that the URL deduplication method provided in the embodiments of the present application may be applied to a server or a device that performs URL deduplication.
For convenience of understanding and explanation, the URL deduplication method and apparatus provided by the embodiments of the present application are explained in detail below with reference to fig. 1 to 4.
Fig. 1 is a schematic flowchart of a URL deduplication method provided in an embodiment of the present application, and as shown in fig. 1, the method may include:
and S1, acquiring the URL to be crawled corresponding to the webpage to be crawled.
And S2, performing hash processing on the characteristics of the URL to be crawled to obtain the processing result of the URL to be crawled.
S3, it is judged whether or not the processing result is in the bloom filter.
S4, judging whether the characteristic is in a pre-established data list, wherein the data list comprises at least one characteristic of the crawled URL.
S5, abandoning the URL to be crawled.
In the embodiment of the application, the URL can be obtained from the webpage through Java codes, and the URL can also be automatically extracted from the webpage through crawler tools, such as JSON-handle, User-Agent Switcher and the like. In the process of capturing the webpage, new URLs are continuously extracted from the current page and put into a queue until certain stop conditions of the system are met. And selecting the URL of the webpage to be captured next from the queue according to a certain search strategy, and repeating the process until a certain condition of the system is reached. All crawled web pages are stored by the system, analyzed, filtered, and indexed for later query and retrieval.
And acquiring a URL corresponding to a webpage to be crawled, wherein the URL to be crawled can be a link address corresponding to the webpage. After the URL to be crawled is acquired, hash processing may be performed on the acquired features of the URL to be crawled to obtain a processing result. That is, the hash value corresponding to the feature of the URL to be crawled may be calculated by a hash algorithm, for example, all or part of the character string of the URL to be crawled is converted into a binary hash value.
The characteristics of the URL may be, for example, a complete string of link addresses, or a specific string, such as a string of numeric identifiers in the URL. For example, hash all strings of https:// item.jd.com/5461975.html of the URL.
The hash process may be, for example, an addition hash; shifting hash; performing multiplication hash; hash by division; table lookup and hash; and mixing processing modes such as Hash and the like. Shift hashing is performed by traversing elements in the data and then shifting the initial value each time.
In the embodiment of the present application, for example, one hash algorithm or multiple hash algorithms may be used for hash processing of the URL feature to be crawled. And then judging whether the processing result of the URL to be crawled, namely the hash value, obtained by calculation or processing is in the bloom filter or not. And determining whether to start and judge whether the characteristics of the URL to be crawled are in a pre-established data list or not according to the judgment result. That is, when the processing result is in the bloom filter, it is further determined whether the feature of the URL to be crawled is in a pre-established data list.
The bloom filter is a bit array with m bits, and queries whether the processing result of the URL to be crawled is in the bloom filter, wherein the processing result of the URL to be crawled needs to be used as the input of k hash functions to obtain k array positions. As long as any of these positions is 0, the element must not be in this set. If the element is in the set, then these positions are all set to 1 when inserting this element. If these positions are all 1, it is indicated that the processing results may be in the set, and all these positions are accidentally set to 1 during the insertion of other processing results, which results in a "false positive".
Therefore, when the processing result is judged to be in the bloom filter, the URL to be crawled corresponding to the processing result cannot be accurately determined to be a crawled object, when the processing result of the URL to be crawled is determined to be in the bloom filter, whether the characteristic of the URL to be crawled is in a pre-established data list is further judged, only when the characteristic of the URL to be crawled is in the data list, the URL is indicated to be crawled, at the moment, the URL to be crawled can be abandoned, and the deduplication of the URL to be crawled is completed. The pre-established data list is used for storing the characteristics of the crawled URL, and whether the URL to be crawled is crawled or not is further verified through characteristic comparison, so that the accuracy of URL duplicate removal is improved.
The URL duplicate removal method provided by the embodiment of the application can judge whether the hash processing result of the URL to be crawled is in the bloom filter, and further judge whether the characteristics of the URL to be crawled are in a pre-established data list when the URL to be crawled is in the bloom filter, so that whether the URL to be crawled is accurately determined, the misjudgment of the bloom filter is made up, the URL duplicate removal accuracy is improved, and the crawling efficiency of a web crawler is improved.
In another embodiment provided by the present application, in order to improve the URL deduplication efficiency, a preliminary judgment may be performed on the obtained URL to be crawled. The URL deduplication method provided in another embodiment of the present application is explained in detail by fig. 2.
Fig. 2 is a flowchart illustrating a URL deduplication method according to another embodiment of the present application, as shown in fig. 2, the method may include:
and S10, acquiring the URL to be crawled corresponding to the webpage to be crawled.
And S20, judging whether the URL to be crawled conforms to a preset format.
Specifically, after the URL to be crawled is acquired, preliminary judgment can be performed on the URL to be crawled. It can be understood that, when the webpage is crawled, the basic format that the URL corresponding to the webpage to be crawled conforms to can be preset. Then in the determination, if the obtained URL to be crawled does not conform to the predetermined format, the URL to be crawled may be directly discarded, and the process returns to S10 to continue to obtain the next URL. If the predetermined format is met, S30 may be performed.
The predetermined format may be for each portion of the URL string that needs to be included, including protocol, server name, path, and file name, for example. I.e., comparing whether each portion of the URL to be crawled is consistent with each portion in a predetermined format.
For example, in one application scenario, within the article popularity industry, there is a need for a reasonable classification of the types of hosts. And how to classify can be according to the existing classification way at present to adapt to the user's needs. For example, the classification mode of commodities on each existing shopping platform can be well applied to the classification of the consignment. Therefore, when the classification of the goods is obtained, the web page of the goods on the shopping platform needs to be crawled. For example, on a Jingdong shopping platform, a large amount of Jingdong webpage information can be crawled to obtain the three-level classification of the Jingdong webpage information.
It will be appreciated that the URL of the web page for the three-level category of the good will typically have the following format: https:// item.jd.com/XXXX.html. Such as https:// item.jd.com/5461975.html and https:// item.jd.com/4143668. html.
Therefore, the format of the URL corresponding to the webpage to be crawled can be preset to be https:// item. Then, after a URL is obtained, e.g., https:// item.jd.hk/4806715.html, the web page in this format is without three levels of classification. That is, the obtained URL does not conform to the predetermined format, the URL may be directly discarded, and the next URL may be obtained.
And S30, judging whether the characteristics of the URL to be crawled include specified characteristics, wherein the specified characteristics are ID identifiers of the URL to be crawled.
Specifically, the URL to be crawled that satisfies the predetermined format may be further determined to determine whether it includes the specified feature. If the specified feature is included, S40 can be performed, otherwise, the URL can be discarded directly, returning to S10 to continue to fetch the next URL.
For example, it may be determined whether the obtained URL to be crawled includes an ID identifier. If the obtained URL is https:// item.jd.com/5461975.html, the judgment determines that the URL to be crawled comprises the ID identifier 5461975.
It is to be understood that the execution sequence of S20 and S30 is not limited in this embodiment. For example, in another embodiment, a determination may be made as to whether the obtained URL includes the specified feature, and after determining that the specified feature is included, a determination may be made as to whether the predetermined format is met.
It is also understood that in another embodiment, it may be determined only whether the obtained URL to be crawled conforms to the predetermined format, and after conforming to the predetermined format, S30 may be skipped and S50 may be directly performed.
Or after the to-be-crawled URL corresponding to the webpage to be crawled is obtained, only whether the obtained to-be-crawled URL comprises the specified features or not is judged, and whether the obtained to-be-crawled URL accords with the preset format or not is not judged. That is, S30 may be directly performed after S10, skipping S20.
And S40, extracting the specified characteristics of the URL to be crawled.
For example, ID identifier 5461975 for https:// item.jd.com/5461975.html is extracted.
According to the URL duplication eliminating method, whether the URL meets the preset format or not and/or whether the URL comprises the designated features or not are judged, the processing efficiency of the URL is improved, the designated features included in the URL are extracted, hash processing is conducted on the extracted designated features, and the URL duplication eliminating efficiency is improved.
And S50, performing hash processing on the characteristics of the URL to be crawled to obtain the processing result of the URL to be crawled.
Specifically, after S40 is executed, in this step, the extracted specified feature of the URL to be crawled may be hashed. If the ID identifier 5461975 of the https:// item.jd.com/5461975.html is subjected to hash processing, a processing result is obtained, namely the ID identifier 5461975 is processed to obtain a binary character string.
If S40 is not executed, all the character strings of the acquired URL to be crawled may be hashed. And if the obtained https:// item.jd.com/5461975.html is subjected to hash processing, a processing result is obtained.
S60, it is judged whether or not the processing result is in the bloom filter.
Specifically, it may be determined whether the processed ID or URL is in the bloom filter. If the processing result is not in the bloom filter, indicating that the URL to be crawled must not be crawled, S61 may be performed.
If the processing result is in the bloom filter, it indicates that the URL to be crawled may or may not be crawled due to the minimal misinterpretation of the bloom filter, and at this time, S70 may be executed.
It can be understood that, in the embodiment of the present application, when determining whether the processed ID or URL is in the bloom filter, the bloom filter involved is a bit array of m bits, all the bits have values of 0, and k different hash functions conforming to uniform random distribution are defined. Then k random numbers can be generated by k hash functions for the elements that need to be added to the bloom filter, so that the corresponding positions of the bit array are all set to 1. I.e., each function maps a set element to one of the m bits of the bit array.
Then for a new element to be added, it is first necessary to determine whether the element has already added a bloom filter. In this embodiment, it is determined whether the obtained hash value of the URL is in the bloom filter, that is, k number of array positions are obtained by using the calculated k hash values of the URL as input. If any of the corresponding positions is 0, the corresponding hash value indicating the URL is not necessarily in the bloom filter. At this time, the hash value of the new element, i.e., the URL, may be added to the bloom filter, i.e., S61. If all 1 s, this indicates that the element is already in the previous set.
It will be appreciated that when all positions are 1, the hash value corresponding to the URL may or may not be in the bloom filter, since the element at the array position may be accidentally set to 1 during the insertion of other elements. At this time, S70 needs to be executed.
S61, the URL to be crawled is placed in a pre-established queue to be crawled, after the URL to be crawled is crawled, the processing result is added into the bloom filter, and the characteristics of the URL to be crawled are added into the data list.
For example, the URL to be crawled is placed in a queue to be crawled, and after the URL to be crawled is completed, the hash value of the URL may be placed in the bloom filter, and all characters https:// item.jd.com/5461975.html of the URL are placed in a pre-established data list, or the ID identifier 5461975 of the URL is placed in a pre-established data list.
And S70, judging whether the characteristics of the URL to be crawled are in the data list.
And S71, putting the URL to be crawled into a queue to be crawled, and putting the characteristics of the URL to be crawled into the data list.
S80, abandoning the URL to be crawled.
Specifically, if it is determined that the processing result is in the bloom filter, in order to further determine whether the URL to be crawled is crawled, it may be determined whether the feature of the URL to be crawled is in the data list.
If the feature of the URL to be crawled is not found in the data list, which indicates that the URL to be crawled has not been crawled, S71 may be executed to place the URL to be crawled into a queue to be crawled, and to place the feature of the URL to be crawled into the data list. It may then return to S10 to continue to obtain the next URL.
If the feature of the URL to be crawled is found in the data list, which indicates that the URL to be crawled has been crawled, S80 may be executed to discard the URL to be crawled, and the method ends. And may return to S10 to continue to obtain the next URL.
It will be appreciated that at least one characteristic of the crawled URLs is included in the list of data. I.e. the data list is an empty list established in advance. When a URL is crawled, the characteristics of the crawled URL can be put into the list, so that all the crawled URLs can be recorded, and whether a URL to be crawled is crawled or not can be judged subsequently. I.e. again a data structure is maintained for subsequent verification.
In the embodiment of the application, when the URL to be crawled is not in the bloom filter, or when the processing result of the URL is in the bloom filter but the characteristics of the URL are not in the data list, the characteristics are added to the data list, so that the record of the characteristics of the crawled URL in the data list is updated, and the deduplication efficiency of the URL is improved.
Optionally, the data list may be an array or a hash table, or other data storage manner that facilitates subsequent lookup and verification. This is not limited by the present application.
Fig. 3 shows a URL deduplication apparatus provided in an embodiment of the present application, where the apparatus 300 may include:
the obtaining module 310 is configured to obtain a URL to be crawled corresponding to a webpage to be crawled.
The processing module 320 is configured to perform hash processing on the feature of the URL to be crawled to obtain a processing result of the URL to be crawled.
The first determining module 330 is configured to determine whether the processing result is in a bloom filter.
A second determining module 340, configured to determine whether the feature of the URL to be crawled is in a pre-established data list when the processing result is in the bloom filter, where the data list includes at least one feature of a crawled URL.
And a discarding module 350, configured to discard the URL to be crawled when the processing result is in the bloom filter and the feature of the URL to be crawled is in the data list.
Preferably, the URL deduplication apparatus provided in this application further includes:
and the crawling adding module 360 is configured to, when the first determination result indicates that the processing result is not in the bloom filter, place the feature of the URL to be crawled into a pre-established queue to be crawled, add the processing result to the bloom filter after crawling the URL to be crawled is completed, and add the feature of the URL to be crawled to the data list.
Preferably, the URL deduplication apparatus provided in this application further includes:
a feature adding module 370, configured to, when the second determination result indicates that the feature of the URL to be crawled is not in the data list, place the feature of the URL to be crawled into the data list.
Preferably, the URL deduplication apparatus provided in this application may further include:
the third determining module 380 is configured to determine whether the URL to be crawled conforms to a predetermined format and/or determine whether the features of the URL include a specific feature, where the specific feature is an ID identifier of the URL to be crawled.
Preferably, the URL deduplication device to be crawled provided in the application is implemented, and the processing module is specifically configured to:
and extracting the specified characteristics of the URL to be crawled.
And carrying out Hash processing on the specified characteristics to obtain a processing result of the URL to be crawled.
Preferably, the URL deduplication device to be crawled provided by the application is implemented, and the data list may be the array or a hash table.
It can be understood that an embodiment of the present application also provides a server, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the above-mentioned method by obtaining a URL to be crawled corresponding to a web page to be crawled when executing the program; performing hash processing on the characteristics of the URL to be crawled to obtain a processing result of the URL to be crawled; judging whether the processing result is in a bloom filter or not, and if the processing result is in the bloom filter, judging whether the characteristic is in a pre-established data list or not; and if the characteristics of the URL to be crawled are in the data list, abandoning the URL duplication eliminating method of the URL to be crawled.
Referring now to FIG. 4, a block diagram of a computer system 400 suitable for use in implementing a server according to embodiments of the present application is shown.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 404. In the RAM 404, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 404 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A drive 410 is also connected to the I/O interface 404 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
In particular, the processes described above with reference to fig. 1 and 2 may be implemented as computer software programs according to the URL deduplication method embodiments of the present application. For example, URL deduplication embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the methods of fig. 1 and 2. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, and may be described as: a processor comprises an acquisition module, a processing module, a first judgment module and a abandoning module. The names of these units or modules do not form a limitation on the units or modules themselves in some cases, for example, the obtaining module may also be described as "obtaining a URL to be crawled corresponding to a web page to be crawled".
As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the foregoing device in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described herein for URL deduplication.
To sum up, in the URL deduplication method, the URL deduplication device, and the storage medium provided in the embodiments of the present application, hash processing is performed on a feature of a URL to be crawled corresponding to an acquired web page to be crawled, and then it is determined whether a processing result is in a bloom filter, and when the processing result is in the bloom filter, it is further determined whether the feature is in a pre-established data list, and when the processing result is in the bloom filter and the feature is also in the data list, it is determined that the URL is crawled, and the URL is discarded.
The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims (10)

1. A method for URL deduplication, the method comprising:
acquiring a URL to be crawled corresponding to a webpage to be crawled;
performing Hash processing on the characteristics of the URL to be crawled to obtain a processing result of the URL to be crawled;
judging whether the processing result is in a bloom filter or not, and if the processing result is in the bloom filter, judging whether the characteristic is in a pre-established data list or not, wherein the data list comprises at least one characteristic of a crawled URL;
and if the characteristics of the URL to be crawled are in the data list, discarding the URL to be crawled.
2. The URL deduplication method of claim 1, further comprising:
if the processing result is not in the bloom filter, putting the URL to be crawled into a pre-established queue to be crawled, adding the processing result into the bloom filter after the URL to be crawled is crawled, and adding the characteristics of the URL to be crawled into the data list;
if the characteristics of the URL to be crawled are not in the data list, the characteristics of the URL to be crawled are put into the data list.
3. The URL deduplication method according to claim 1 or 2, wherein after acquiring the URL to be crawled corresponding to the webpage to be crawled, before hashing the features of the URL to be crawled, the method further comprises:
and judging whether the URL to be crawled accords with a preset format and/or judging whether the characteristics of the URL to be crawled comprise specified characteristics, wherein the specified characteristics are the ID identifiers of the URL to be crawled.
4. The URL deduplication method according to claim 3, wherein when it is determined that the specified feature is included in the features of the URL to be crawled, the hashing the features of the URL to be crawled to obtain the processing result of the URL to be crawled comprises:
extracting the specified features of the URL to be crawled;
and carrying out Hash processing on the specified characteristics to obtain a processing result of the URL to be crawled.
5. An apparatus for URL deduplication, the apparatus comprising:
the acquisition module is used for acquiring the URL to be crawled corresponding to the webpage to be crawled;
the processing module is used for carrying out Hash processing on the characteristics of the URL to be crawled to obtain a processing result of the URL to be crawled;
the first judgment module is used for judging whether the processing result is in the bloom filter or not;
a second determining module, configured to determine whether the feature of the URL to be crawled is in a pre-established data list when the processing result is in the bloom filter, where the data list includes at least one feature of a crawled URL;
and the abandoning module is used for abandoning the URL to be crawled when the processing result is in the bloom filter and the characteristics of the URL to be crawled are in the data list.
6. The URL deduplication apparatus of claim 5, wherein the apparatus further comprises:
the crawling adding module is used for putting the characteristics of the URL to be crawled into a pre-established queue to be crawled when the processing result is not in the bloom filter, adding the processing result into the bloom filter after the URL to be crawled is crawled, and adding the characteristics of the URL to be crawled into the data list;
and the characteristic adding module is used for putting the characteristics of the URL to be crawled into the data list when the characteristics of the URL to be crawled are not in the data list.
7. The URL deduplication apparatus of claim 5 or 6, wherein the apparatus further comprises:
and the third judging module is used for judging whether the URL to be crawled accords with a preset format and/or judging whether the characteristics of the URL to be crawled comprise specified characteristics, and the specified characteristics are the ID identifiers of the URL to be crawled.
8. The URL deduplication apparatus of claim 7, wherein the processing module is specifically configured to:
extracting the specified features of the URL to be crawled;
and carrying out Hash processing on the specified characteristics to obtain a processing result of the URL to be crawled.
9. The URL deduplication apparatus of claim 5, wherein the data list is an array or a hash table.
10. A computer-readable storage medium, having stored thereon a computer program for implementing the URL deduplication method as recited in any one of claims 1-4.
CN201811312516.4A 2018-11-06 2018-11-06 URL duplicate removal method, device and storage medium Pending CN111143720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811312516.4A CN111143720A (en) 2018-11-06 2018-11-06 URL duplicate removal method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811312516.4A CN111143720A (en) 2018-11-06 2018-11-06 URL duplicate removal method, device and storage medium

Publications (1)

Publication Number Publication Date
CN111143720A true CN111143720A (en) 2020-05-12

Family

ID=70515935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811312516.4A Pending CN111143720A (en) 2018-11-06 2018-11-06 URL duplicate removal method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111143720A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112436943A (en) * 2020-10-29 2021-03-02 南阳理工学院 Request deduplication method, device, equipment and storage medium based on big data
CN113377812A (en) * 2021-01-08 2021-09-10 北京数衍科技有限公司 Order duplication eliminating method and device for big data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN104899323A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Crawler system used for IDC harmful information monitoring platform
CN106649346A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data repeatability check method and apparatus
CN106886602A (en) * 2017-03-02 2017-06-23 上海斐讯数据通信技术有限公司 A kind of application crawler method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102663058A (en) * 2012-03-30 2012-09-12 华中科技大学 URL duplication removing method in distributed network crawler system
CN104899323A (en) * 2015-06-19 2015-09-09 成都国腾实业集团有限公司 Crawler system used for IDC harmful information monitoring platform
CN106649346A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data repeatability check method and apparatus
CN106886602A (en) * 2017-03-02 2017-06-23 上海斐讯数据通信技术有限公司 A kind of application crawler method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112436943A (en) * 2020-10-29 2021-03-02 南阳理工学院 Request deduplication method, device, equipment and storage medium based on big data
CN113377812A (en) * 2021-01-08 2021-09-10 北京数衍科技有限公司 Order duplication eliminating method and device for big data

Similar Documents

Publication Publication Date Title
US10250526B2 (en) Method and apparatus for increasing subresource loading speed
EP3251031B1 (en) Techniques for compact data storage of network traffic and efficient search thereof
US8515935B1 (en) Identifying related queries
CN108228799B (en) Object index information storage method and device
US8041893B1 (en) System and method for managing large filesystem-based caches
EP1713010A2 (en) Using attribute inheritance to identify crawl paths
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN108959359B (en) Uniform Resource Locator (URL) semantic deduplication method, device, equipment and medium
CN109241003B (en) File management method and device
KR102018445B1 (en) Compression of cascading style sheet files
CN111143720A (en) URL duplicate removal method, device and storage medium
CN106547803B (en) Method and device for crawling incremental resources of website
CN114911830A (en) Index caching method, device, equipment and storage medium based on time sequence database
CN111368227A (en) URL processing method and device
CN109040346B (en) Method, device and equipment for screening effective domain names in extensive domain name resolution
CN107301186B (en) Invalid data identification method and device
CN105468412B (en) Dynamic packaging method and device
CN106339372B (en) Method and device for optimizing search engine
CN110825947B (en) URL deduplication method, device, equipment and computer readable storage medium
CN112287201A (en) Method, device, medium and electronic equipment for removing duplicate of crawler request
CN106126670B (en) Operation data sorting processing method and device
CN105653540B (en) Method and device for processing file attribute information
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device
CN104636384B (en) A kind of method and device handling document
CN104899320A (en) Webpage repair method, terminal, server and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination