CN110147473B

CN110147473B - Crawling method and device for crawler

Info

Publication number: CN110147473B
Application number: CN201710749195.3A
Authority: CN
Inventors: 何熠皓
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-08-28
Filing date: 2017-08-28
Publication date: 2022-03-01
Anticipated expiration: 2037-08-28
Also published as: CN110147473A

Abstract

The invention discloses a crawling method and a crawling device for crawlers, relates to the technical field of computers, and mainly aims to enable data crawled by the crawlers to cover more page levels, wherein the main technical scheme of the crawling method is as follows: acquiring all URL links corresponding to each page under the same level to be crawled; extracting a preset number of URL links from all URL links corresponding to each page, and putting the URL links into a queue to be crawled; and crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance. The method and the device are mainly used for crawling of the URL links in the page.

Description

Crawling method and device for crawler

Technical Field

The invention relates to the technical field of computers, in particular to a crawling method and device for crawlers.

Background

With the deep development of cloud computing and big data technology, a large amount of structured and unstructured information searching and mining technologies on web pages become a hot research problem. When analyzing data, a lot of time and energy are often spent, and in the big data era, the crawler technology becomes an important way to acquire network data.

The crawler technology searches web pages through link addresses of the web pages, reads contents of the web pages from a certain web page of a website, finds other link addresses in the web pages, and then searches the next web page through the link addresses, which is circulated all the time. When crawling all pages of a website, a crawler generally adopts a breadth-first algorithm, namely starting from an entry page, capturing all links of the entry page as a first layer, starting from each link in all links, acquiring all links under a page corresponding to the link as a second layer, and so on, crawling down one layer until no new link is generated.

Under many circumstances, because the page is too many, can restrict the page quantity or the page number of piles of crawling, and then control the crawl time of crawler, if only crawl the page of two preceding layers or only crawl the first 100 ten thousand pages, however, the crawl method of above-mentioned crawler can't catch the data of more levels for the data that crawl in certain time is restricted, can not cover each component of real website, also can't satisfy the data demand of a lot of application scenarios, causes the disappearance of information under many application scenarios.

Disclosure of Invention

In view of this, the present invention provides a crawling method and apparatus for crawlers, and mainly aims to enable data crawled by crawlers to cover more page levels.

In order to solve the above problems, the present invention mainly provides the following technical solutions:

in one aspect, an embodiment of the present invention provides a crawling method for crawlers, including:

acquiring all URL links corresponding to each page under the same level to be crawled;

extracting a preset number of URL links from all URL links corresponding to each page, and putting the URL links into a queue to be crawled;

and crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance.

Further, the extracting a preset number of URL links from all URL links corresponding to each page and placing the URL links into a queue to be crawled includes:

respectively extracting URL links with the same preset number from all URL links corresponding to each page, and putting the URL links into a queue to be crawled; or

Setting weight values for different types of pages, extracting a preset number of URL links matched with the weight values corresponding to the page types from all URL links corresponding to each page respectively, and putting the URL links into a queue to be crawled.

Further, after the obtaining all URL links corresponding to each page at the same level to be crawled, the method further includes:

acquiring identification information of a page corresponding to the URL link;

establishing crawling catalogs corresponding to different pages according to the identification information;

putting all URL links corresponding to each page into a crawling directory corresponding to the page, and

the method for extracting the URL links with the preset number from all the URL links corresponding to each page and putting the URL links into a queue to be crawled comprises the following steps: and extracting a preset number of URL links from the crawling catalog corresponding to each page, and putting the URL links into a queue to be crawled.

Further, before the extracting a preset number of URL links from the crawling directory corresponding to each page, the method further includes:

traversing and inquiring the crawling directory corresponding to each page according to a preset time interval;

and acquiring the number of URL links stored in the crawling directory corresponding to the page.

Further, the extracting a preset number of URL links from the crawling directory corresponding to each page includes:

when the number of the URL links in the crawling directory is smaller than the preset number, extracting all URL links from the crawling directory corresponding to the page;

and when the number of the URL links in the crawling directory is greater than or equal to the preset number, extracting the URL links in the preset number from the crawling directory corresponding to the page.

Further, after the crawling is performed on the page content in the page corresponding to the URL link by using the URL link in the queue to be crawled as an entry, the method further includes:

and setting an ending condition of the URL link in the crawler crawling page, and ending the crawling operation of the crawler when the ending condition is reached.

In order to achieve the above object, according to another aspect of the present invention, there is provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the crawling method for crawlers.

In order to achieve the above object, according to another aspect of the present invention, there is provided a processor for executing a program, wherein the program executes the crawling method for crawlers described above.

On the other hand, the embodiment of the invention also provides a crawling device for crawlers, which comprises:

the first acquisition unit is used for acquiring all URL links corresponding to each page under the same level to be crawled;

the extraction unit is used for extracting a preset number of URL links from all URL links corresponding to each page and putting the URL links into a queue to be crawled;

and the crawling unit is used for crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance.

Further, the extracting unit is specifically configured to extract the same preset number of URL links from all URL links corresponding to each page, and place the URL links into a queue to be crawled; or

Further, the apparatus further comprises:

the second acquisition unit is used for acquiring the identification information of the page corresponding to the URL link;

the establishing unit is used for establishing crawling catalogues corresponding to different pages according to the identification information;

a storage unit for putting all URL links corresponding to each page into the crawling directory corresponding to the page, and

the extracting unit is specifically configured to extract a preset number of URL links from the crawling directory corresponding to each page, and place the URL links into a queue to be crawled.

Further, the apparatus further comprises:

the query unit is used for traversing and querying the crawling directory corresponding to each page according to a preset time interval;

and the third acquisition unit is used for acquiring the number of URL links stored in the crawling directory corresponding to the page.

Further, the extracting unit is specifically configured to extract all URL links from the crawl directory corresponding to the page when the number of URL links in the crawl directory is smaller than a preset number;

the extracting unit is specifically further configured to extract a preset number of URL links from the crawling directory corresponding to the page when the number of URL links in the crawling directory is greater than or equal to the preset number.

Further, the apparatus further comprises:

and the ending unit is used for setting ending conditions of the URL link in the crawler crawling page, and ending the crawling operation of the crawler when the ending conditions are met.

By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:

according to the crawling method and device for the crawler, all URL links corresponding to all pages of the same level to be crawled are obtained, a preset number of URL links are further extracted from all URL links corresponding to all pages, the preset number of URL links are placed in a queue to be crawled, the URL links in the queue to be crawled can cover more page levels, and finally, page contents in the pages corresponding to the URL links are crawled by taking the URL links in the queue to be crawled as an entrance. Compared with the method for crawling the page by adopting the breadth-first algorithm in the prior art, the method and the device have the advantages that in the process of crawling the URL links in the page by the crawler, the URL links in the page crawled by the crawler in real time are not directly placed into the queue to be crawled, a preset number of URL links are extracted from the URL links crawled by different pages and then placed into the queue to be crawled, so that the queue to be crawled contains more levels of URL links, the page data covered by the URL range in the queue to be crawled is wider, each component of a website is covered, a new crawling directory can be continuously established in the whole crawling process, the URL links are extracted from the new crawling directory and placed into the queue to be crawled, more data information under application scenes is crawled, and data with higher reference value are provided for subsequent website analysis.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart of a crawling method for crawlers according to an embodiment of the present invention;

FIG. 2 is a flowchart of another crawling method for crawlers according to an embodiment of the present invention;

FIG. 3 is a block diagram of a crawler crawling apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of another crawling apparatus for crawlers according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a crawler crawling method, as shown in fig. 1, the method can crawl URL links of more levels of pages by extracting a preset number of URL links from all URL links corresponding to each page and putting the preset number of URL links into a queue to be crawled, and the embodiment of the invention provides the following specific steps:

101. and acquiring all URL links corresponding to each page under the same level to be crawled.

The web crawler is a program or script for automatically capturing world wide web information according to a certain rule, and generally uses the URL of a page as an entrance, crawls the URL link in the page, continuously extracts new URL links from the current page and puts the new URL links into a queue to be crawled until the stop condition set by the system is met.

For the embodiment of the invention, in the process of crawling the URL links in the page by the crawler, different numbers of URL links can be crawled from the pages corresponding to different URL links, and all URL links corresponding to each page at the same hierarchy level to be crawled are further obtained. For example, 3 URL links are crawled from an original page corresponding to a given URL link as an entry, the corresponding pages are page a, page B and page C, the crawled page is the first level, the 3 URL links are crawled again according to the crawled page a, the corresponding pages are page a1, page a2 and page A3, the corresponding pages are crawled as the second level, the crawling is performed again according to the crawled page B to obtain page B1 and page B2, the crawling is also the second level, the crawling is performed again according to the crawled page C to obtain page C1, page C2 and page C2, and the crawling is also the second level.

102. And extracting a preset number of URL links from all URL links corresponding to each page, and putting the URL links into a queue to be crawled.

Because each page is provided with all URL links crawled from the page, and different pages are provided with different levels, in order to crawl page data of more levels, the embodiment of the invention respectively extracts a preset number of URL links from all URL links corresponding to each page, and puts the preset number of URL links into a queue to be crawled, so that when a crawler crawls URL links in the page, more URL links in sub-level pages of the current level page can be crawled.

For the embodiments of the present invention, when the crawler crawls the URL links of a page, the number of crawling URL links is not fixed, but is continuously updated with the crawlers' real-time crawlers until all URL links in the page are crawled, typically, after the crawler crawls all URL links of the page, a preset number of URL links are selected from a crawling directory, the selected URL links are further placed in a queue to be crawled, it is of course also possible to perform the selection operation when not all URL links of the page have been crawled, in which case the predetermined number is usually selected to be a smaller number, such as 1 or 2, thereby ensuring that the same amount of URL links can be selected from different crawling directories each time, and certainly if the preset number is set to be too large, and if the number of URL links in the crawling directory is less than the preset number, all URL links are selected from the crawling directory.

When the URL links with the preset number are extracted from all the URL links corresponding to each page, the URL links with the same preset number may be extracted from all the URL links corresponding to each page, for example, 3 URL links are extracted from all the URL links corresponding to each page, the preset number should not be set too large, so as to ensure that the URL links with the same preset number can be extracted from each page under each level, weight values may be set for different types of pages, and the preset number of URL links matched with the weight value corresponding to the page type may be extracted from all the URL links corresponding to each page, where the number of URL links extracted for different types of pages is different, and the number of URL links that may be crawled in different types of pages is different, and for pages that may include a large number of URL links, such as naobao page, Pages such as store home pages and the like are generally set to have a large weight, the number of correspondingly extracted preset number of URL links is large, pages which may contain a small number of URL links, such as picture pages, reading pages and the like, are generally set to have a small weight, and the number of correspondingly extracted preset number of URL links is small.

103. And crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance.

It should be noted that the queue to be crawled corresponds to a priority ranking, and crawls crawlers one by one according to the sequence of URL links in the queue to be crawled, unlike the prior art, the crawler in the prior art can directly place URL links in crawled pages into the queue to be crawled and crawl the pages layer by layer until preset conditions are met.

According to the crawling method for the crawler, all URL links corresponding to all pages of the same level to be crawled are obtained, a preset number of URL links are further extracted from all URL links corresponding to all pages, the preset number of URL links are placed in a queue to be crawled, the URL links in the queue to be crawled can cover more page levels, and finally, the URL links in the queue to be crawled are used as an entrance to crawl page contents in the pages corresponding to the URL links. Compared with the method for crawling the page by adopting the breadth-first algorithm in the prior art, the method and the device have the advantages that in the process of crawling the URL links in the page by the crawler, the URL links in the page crawled by the crawler in real time are not directly placed into the queue to be crawled, a preset number of URL links are extracted from the URL links crawled by different pages and then placed into the queue to be crawled, so that the queue to be crawled contains more levels of URL links, the page data covered by the URL range in the queue to be crawled is wider, each component of a website is covered, a new crawling directory can be continuously established in the whole crawling process, the URL links are extracted from the new crawling directory and placed into the queue to be crawled, more data information under application scenes is crawled, and data with higher reference value are provided for subsequent website analysis.

In order to describe the crawling method for crawlers in more detail, particularly to a step of extracting a preset number of URL links from all URL links corresponding to each page and putting the preset number of URL links into a queue to be crawled, an embodiment of the present invention further provides another crawling method for crawlers, as shown in fig. 2, the method includes the specific steps of:

201. and acquiring all URL links corresponding to each page under the same level to be crawled.

The specific implementation manner in this step is the same as that in step 101, and is not described herein again.

202. And acquiring the identification information of the page corresponding to the URL link.

For the embodiment of the present invention, after the URL link crawled by the crawler, the hyperlink and the HTML document content in the page can be identified, and the page data, such as the picture, the text, the URL link, and the like in the page, is analyzed by analyzing the HTML document content, so as to obtain the identification information of the page corresponding to the URL link, where the identification information usually selects the page name or the page number and other identifying information, which is not limited in the embodiment of the present invention. For example, when the page name is selected as the identification information, the page name corresponding to the URL link of the a page is a pan login, the page name corresponding to the URL of the B page is a pan my score, and the page name corresponding to the URL link of the C page is a pan activity page of a certain store.

By acquiring identification information of pages corresponding to the URL links, crawled different pages can be distinguished, so that crawled catalogs of different pages can be established according to the identification information later, and webpage information of different pages can be crawled.

203. And establishing crawling catalogues corresponding to different pages according to the identification information.

The crawler crawls different URL links to correspond to different pages, the different pages correspond to different identification information, and further corresponding crawl directories are established according to the identification information corresponding to the different pages, for example, when a page name is selected as the identification information, the page name corresponding to the URL link of the page A is a Taobao login, the crawl directory of the Taobao login is correspondingly established, the page name corresponding to the URL of the page B is a Taobao my integral, the crawl directory of the Taobao my integral is correspondingly established, the page name corresponding to the URL link of the page C is a Taobao activity page, and the crawl directory of the Taobao activity page is correspondingly established.

For the embodiment of the invention, the crawl directory stores URL links in pages crawled by crawlers, for example, the crawl directory corresponding to the A page stores URL links crawled from the A page, the crawl directory corresponding to the B page stores URL links crawled from the B page, and the crawl directory corresponding to the C page stores URL links crawled from the C page.

204. And putting all URL links corresponding to each page into a crawling directory corresponding to the page.

According to the embodiment of the invention, all URL links corresponding to each page can be put into the crawling directories of the corresponding pages by establishing the crawling directories corresponding to different pages, so that the crawled URL links are managed conveniently, and the process of crawling the URL links in the pages by the whole crawler is more organized.

For example, 3 URL links are crawled from an original page corresponding to a given URL link as an entry, the original page corresponds to a page a, a page B and a page C, the crawled page of the first level is created according to the 3 crawled pages, the crawled directory a corresponding to the page a, the crawled directory B corresponding to the page B and the crawled directory C corresponding to the page C are established, and the crawled URL links are correspondingly placed into the crawled directories of the page a, the page B and the page C corresponding to the 3 crawled URL links.

It should be noted that, in the process of crawling the URL link in the page by the crawler, a new URL link and a page corresponding to the new URL link are crawled continuously, so that a new crawling directory is continuously established, and the URL link crawled from the page corresponding to the new URL link is further stored in the crawling directory corresponding to the page.

For another example, based on the page a, the page B, and the page C crawled in the first hierarchy, crawling is continuously performed on URL links in the page, crawling 2 URL links from the page a corresponds to the page a1 and the page a2, crawling 4 URL links from the page B corresponds to the page B1, the page B2, the page B3, and the page B4, crawling 3 URL links from the page C corresponds to the page C1, the page C2, and the page C3, and crawling a page in the second hierarchy.

205. And traversing and inquiring the crawling directory corresponding to each page according to a preset time interval.

It should be noted that new URL links can not be crawled from pages by a crawler every time, URL links can be guaranteed to be extracted from pages of each hierarchy more uniformly only after a certain number of URL links are accumulated in a crawl directory, and in addition, a certain loss occurs when the crawl directory corresponding to a query page is traversed every time.

The embodiment of the invention does not limit the preset time interval, if the number of URL links in the queue to be crawled is small, the preset time interval can be set to be smaller, and if the number of the URL links in the queue to be crawled is sufficient, the preset time can be set to be larger, so that a certain number of the URL links in the queue to be crawled is ensured.

206. And acquiring the URL link quantity of the pages crawled by the crawler stored in the crawling catalog corresponding to the pages.

In the process of crawling the URL links of the page by the crawler, the number of URL links that different pages may crawl is different, and therefore, the number of URL links in the corresponding crawling directory of the page is also different, for example, the crawler crawls 5 URL links from the page a and crawls 15 URL links from the page B.

In order to conveniently know the crawling conditions of the current URL links in different pages, the number of the URL links in the crawling directory corresponding to the pages is further acquired, it needs to be stated that along with continuous crawling of the crawler, the number of the URL links in the crawling directory may not change or increase along with the crawling directory, the number of the URL links in the crawling directory does not change until all the URL links in the pages are crawled, and certainly, in order to know the total number of the URL links in each page, the number of the URL links in the crawling directory corresponding to the pages can be acquired again after the number of the URL links in the crawling directory does not change.

207a, when the number of the URL links in the crawling directory is smaller than the preset number, extracting all URL links from the crawling directory corresponding to the page, and putting the URL links into a queue to be crawled.

Because the number of corresponding URL links in different pages cannot be known in advance, the number of crawled URLs stored in the crawl directory is not fixed. For the embodiment of the invention, under the condition that the number of URL links in the crawling directory corresponding to the page is insufficient, namely when the number of URL links is less than the preset number, the number of URL links crawled by the page in the current crawler directory is insufficient, or the number of URL links in the page in the crawler directory is sufficient, in order to ensure that sufficient URL links are extracted from the crawling directory, all URL links are correspondingly extracted from the crawling directory with insufficient number, and the URL links with the preset number are placed in the queue to be crawled.

It should be noted that, when a crawler crawls for the first time, the queue to be crawled is empty, and usually, URL links are directly extracted from a corresponding crawling directory which does not pass through a page and are directly placed in the queue to be crawled; in the crawling process of the crawler, if the crawling queue is empty or the number of URL links in the crawling queue is small, the crawling queue can be set to be short at preset time intervals, so that the crawling queue can be guaranteed to have sufficient URL links, and the crawler can crawl conveniently.

Correspondingly, corresponding to the step 207a, there is a step 207b, when the number of the URL links in the crawling directory is greater than or equal to the preset number, extracting the preset number of URL links from the crawling directory corresponding to the page, and placing the URL links into a queue to be crawled.

For the embodiment of the invention, under the condition that the number of URL links in the crawling directory corresponding to the page is sufficient, namely when the number of the URL links is greater than or equal to the preset number, the preset number of URL links are extracted from the crawling directory corresponding to the page and are put into the queue to be crawled, wherein the preset number can be an integer part of the ratio of the number of all URL links in the queue to be crawled to the number of the crawling directory corresponding to the page, so that the URL links can be extracted from the crawling directory corresponding to each page in a balanced manner, and the extracted number of the URL links can uniformly cover each page level.

It should be noted that, when the crawler crawling operation is just started, since the page hierarchy of the crawled pages is low, the number of the pages is small, and the number of the established crawled directories is also small, in order to ensure that the same number of URL links can be extracted from different crawled directories, the preset number can be set correspondingly low, and certainly the setting of the specific preset number is not limited here.

208. And crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance.

It can be understood that, crawl the in-process that the URL linked in the page at the crawler, the crawler can constantly take out new URL link from waiting to crawl the queue to new URL link crawls the URL link in the page that this URL link corresponds for the entry, crawl the page that the new URL link that arrives corresponds simultaneously and can establish new directory of crawling, and the URL link that crawls in the directory adds the URL link to waiting to crawl in the queue according to new crawling, crawl for the crawler.

209. And setting an ending condition of the URL link in the crawler crawling page, and ending the crawling operation of the crawler when the ending condition is reached.

Since there are too many pages crawled during crawling, it is often the case that termination conditions for URL links in a crawler crawling page are set, and when the termination conditions are reached, the crawling operation of the crawler is terminated, for example, the crawling time of the crawler can be set by a timer so that the crawling operation is stopped after the crawling time is reached, for example, the crawling time is set to be 20 minutes, after 20 minutes, no matter which level of page or how many URL links are crawled, the crawling operation is finished, and the number of URL links crawled by the crawler can be set, so that when the crawler crawls the number of URL links, the crawling operation is stopped, for example, the crawler is set to crawl 10000 URL links, when the number of the crawling URL links reaches 10000, the crawling operation is finished, page levels of crawling of crawlers can be set, and the like.

For the embodiment of the present invention, specific application scenarios may include, but are not limited to, implementation manners that, in a process of providing one URL link and crawling according to the URL link, 3 URL links are crawled from a page corresponding to the URL link when crawling is performed for the first time, the corresponding pages are page a, page B, and page C, in the process of crawling for the first time, the 3 URL links crawled for the first time are directly placed into a queue to be crawled, which is a first-level page, then crawl directories corresponding to different pages are created, which are crawled directory a, crawled directory B, and crawled directory C, further crawled by using 3 URL links in the queue to be crawled as entries, the URL link crawled from page a is placed into crawl directory a, the URL link crawled from page B is placed into crawl directory B, the URL link crawled from page C is placed into crawl directory C, respectively traversing the crawling directory A, the crawling directory B and the crawling directory C to obtain that the number of URL links corresponding to the crawling directory A is 3, the number of URL links corresponding to the crawling directory B is 5, the number of URL links corresponding to the crawling directory C is 10, the number of URL links extracted from the crawling directory each time is 3, 3 URL links are respectively extracted from the crawling directory A, the crawling directory B and the crawling directory C, 9 URL links are extracted in total, the 9 URL links are put into a queue to be crawled to form a second-level page, the 9 URL links in the queue to be crawled are further used as entries for crawling, the crawling operation and the creating operation of the crawling directory are repeated, the 3 URL links are extracted from different crawling directories, if the number of URL links in the crawling directory is 1 or 2, all URL links are selected, and sequentially selecting 3 URL links from the crawling catalogs with sufficient URL link numbers, putting the URL links into the queue to be crawled again, namely a third-level page, and so on until the timer reaches the crawling deadline time, and ending the crawler task.

According to another crawling method for the crawler, in the process of crawling the URL links in the page, crawling directories corresponding to different pages are established according to identification information of the page corresponding to the URL links crawled by the crawler, all the URL links corresponding to each page are placed in different crawling directories, the more levels of the crawled pages are, crawling directories with more page levels are correspondingly established, a preset number of URL links are further extracted from the crawling directories corresponding to the different pages respectively, the preset number of URL links are placed in a queue to be crawled, the URL links in the queue to be crawled cover more page levels, and finally, the URL links in the page corresponding to the URL links are crawled by taking the URL links in the queue to be crawled as an entrance. Compared with the method for crawling the page by adopting the breadth-first algorithm in the prior art, the embodiment of the invention does not directly place the URL link in the page crawled by the crawler in the process of crawling the URL link in the page into the queue to be crawled, but establishes the crawling directory corresponding to different pages according to all the URL links corresponding to each page, further extracts a preset number of URL links from the crawling directories corresponding to different pages and then places the URL links into the queue to be crawled, so that the queue to be crawled contains more levels of URL links, the data of the page covered by the URL range in the queue to be crawled is wider and covers all components of a website, a new crawling directory can be continuously established in the whole crawling process, the URL link is extracted from the new crawling directory and placed into the queue to be crawled, and data information under more application scenes can be crawled, and data with more reference value is provided for subsequent website analysis.

In addition, a small number of URL links are grabbed from crawling catalogs corresponding to pages of different levels for crawling again, webpage data in more levels of pages can be crawled as far as possible within the same crawling time, the URL links in the crawling catalogs found newly can be preferentially crawled, compared with the breadth priority sequence in the prior art when the overall condition of a website is analyzed, data information under more application scenes can be crawled, and the fact that the crawling of the webpage is wider in coverage is guaranteed.

In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a storage medium, where the storage medium includes a stored program, where the apparatus on which the storage medium is located is controlled to execute the above crawling method for crawlers when the program runs.

In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a processor, configured to execute a program, where the program executes to perform the above crawling method for crawlers.

Further, as an implementation of the method shown in fig. 1 and fig. 2, another embodiment of the present invention further provides a crawling apparatus for crawlers. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device can make the data that the crawler crawled can cover more page levels, and specifically as shown in fig. 3, the device includes:

the first obtaining unit 31 may be configured to obtain all URL links corresponding to each page of the same hierarchy to be crawled;

the extracting unit 32 may be configured to extract a preset number of URL links from all URL links corresponding to each page, and place the URL links into a queue to be crawled;

the crawling unit 33 may be configured to crawl the page content in the page corresponding to the URL link by using the URL link in the queue to be crawled as an entry.

The embodiment of the invention provides a crawler crawling device, which is characterized in that all URL links corresponding to all pages of the same level to be crawled are obtained, a preset number of URL links are further extracted from all URL links corresponding to all pages, the preset number of URL links are placed in a queue to be crawled, the URL links in the queue to be crawled can cover more page levels, and finally, the URL links in the queue to be crawled are used as an entrance to crawl page contents in the pages corresponding to the URL links. Compared with the method for crawling the page by adopting the breadth-first algorithm in the prior art, the method and the device have the advantages that in the process of crawling the URL links in the page by the crawler, the URL links in the page crawled by the crawler in real time are not directly placed into the queue to be crawled, a preset number of URL links are extracted from the URL links crawled by different pages and then placed into the queue to be crawled, so that the queue to be crawled contains more levels of URL links, the page data covered by the URL range in the queue to be crawled is wider, each component of a website is covered, a new crawling directory can be continuously established in the whole crawling process, the URL links are extracted from the new crawling directory and placed into the queue to be crawled, more data information under application scenes is crawled, and data with higher reference value are provided for subsequent website analysis.

Further, as shown in fig. 4, the apparatus further includes:

a second obtaining unit 34, configured to obtain identification information of a page corresponding to the URL link;

the establishing unit 35 may be configured to establish crawling directories corresponding to different pages according to the identification information;

a storage unit 36, which can be used to put all URL links corresponding to each page into the crawling directory corresponding to the page, and

the extracting unit 32 may be further configured to extract a preset number of URL links from the crawling directory corresponding to each page, and place the URL links into a queue to be crawled.

The query unit 37 may be configured to query the crawling directory corresponding to each page in a traversal manner according to a preset time interval;

the third obtaining unit 38 may be configured to obtain the number of URL links stored in the crawling directory corresponding to the page.

And an ending unit 39, which may be configured to set an ending condition of the URL link in the crawler crawling page, and when the ending condition is reached, terminate the crawling operation of the crawler.

Further, the extracting unit 32 may be specifically configured to extract the same preset number of URL links from all URL links corresponding to each page, and place the URL links into a queue to be crawled; or

Further, the extracting unit 32 may be specifically configured to extract all URL links from the crawling directory corresponding to the page when the number of URL links in the crawling directory is smaller than a preset number;

the extracting unit 32 may be further specifically configured to extract a preset number of URL links from the crawling directory corresponding to the page when the number of URL links in the crawling directory is greater than or equal to the preset number.

According to another crawling device for crawlers provided by the embodiment of the invention, in the process of crawling the URL link in the page by the crawler, crawling directories corresponding to different pages are established according to identification information of the page corresponding to the URL link crawled by the crawler, all URL links corresponding to each page are placed in different crawling directories, the more levels of the crawled pages are correspondingly established, the preset number of URL links are further respectively extracted from the crawling directories corresponding to the different pages, the preset number of URL links are placed in a queue to be crawled, the URL links in the queue to be crawled cover more page levels, and finally, the URL links in the page corresponding to the URL links are crawled by taking the URL links in the queue to be crawled as an entrance. Compared with the method for crawling the page by adopting the breadth-first algorithm in the prior art, the embodiment of the invention does not directly place the URL link in the page crawled by the crawler in the process of crawling the URL link in the page into the queue to be crawled, but establishes the crawling directory corresponding to different pages according to all the URL links corresponding to each page, further extracts a preset number of URL links from the crawling directories corresponding to different pages and then places the URL links into the queue to be crawled, so that the queue to be crawled contains more levels of URL links, the data of the page covered by the URL range in the queue to be crawled is wider and covers all components of a website, a new crawling directory can be continuously established in the whole crawling process, the URL link is extracted from the new crawling directory and placed into the queue to be crawled, and data information under more application scenes can be crawled, and data with more reference value is provided for subsequent website analysis.

The crawling device of the crawler comprises a processor and a memory, wherein the first acquiring unit 31, the extracting unit 32, the crawling unit 33 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the data crawled by the crawler can cover more page levels by adjusting the kernel parameters.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the crawling method of the crawler when being executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the crawler crawling method is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:

a crawling method of crawlers, comprising: acquiring all URL links corresponding to each page under the same level to be crawled; extracting a preset number of URL links from all URL links corresponding to each page, and putting the URL links into a queue to be crawled; and crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance.

Further, the extracting a preset number of URL links from all URL links corresponding to each page and placing the URL links into a queue to be crawled includes: respectively extracting URL links with the same preset number from all URL links corresponding to each page, and putting the URL links into a queue to be crawled; or setting weight values for different types of pages, respectively extracting a preset number of URL links matched with the weight values corresponding to the page types from all URL links corresponding to each page, and putting the URL links into a queue to be crawled.

Further, after the obtaining all URL links corresponding to each page at the same level to be crawled, the method further includes: acquiring identification information of a page corresponding to the URL link; establishing crawling catalogs corresponding to different pages according to the identification information; putting all URL links corresponding to each page into a crawling directory corresponding to the page, extracting a preset number of URL links from all URL links corresponding to each page, and putting the URL links into a queue to be crawled, wherein the steps comprise: and extracting a preset number of URL links from the crawling catalog corresponding to each page, and putting the URL links into a queue to be crawled.

Further, before the extracting a preset number of URL links from the crawling directory corresponding to each page, the method further includes: traversing and inquiring the crawling directory corresponding to each page according to a preset time interval; and acquiring the number of URL links stored in the crawling directory corresponding to the page.

Further, the extracting a preset number of URL links from the crawling directory corresponding to each page includes: when the number of the URL links in the crawling directory is smaller than the preset number, extracting all URL links from the crawling directory corresponding to the page; and when the number of the URL links in the crawling directory is greater than or equal to the preset number, extracting the URL links in the preset number from the crawling directory corresponding to the page.

Further, after the crawling is performed on the page content in the page corresponding to the URL link by using the URL link in the queue to be crawled as an entry, the method further includes: and setting an ending condition of the URL link in the crawler crawling page, and ending the crawling operation of the crawler when the ending condition is reached.

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: acquiring all URL links corresponding to each page under the same level to be crawled; extracting a preset number of URL links from all URL links corresponding to each page, and putting the URL links into a queue to be crawled; and crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A crawling method for crawlers, comprising:

crawling page content in a page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance;

the method for extracting the URL links with the preset number from all the URL links corresponding to each page and putting the URL links into a queue to be crawled comprises the following steps:

setting weight values for different types of pages, extracting a preset number of URL links matched with the weight values corresponding to the page types from all URL links corresponding to each page respectively, and putting the URL links into a queue to be crawled;

after the obtaining all URL links corresponding to each page at the same level to be crawled, the method further includes:

acquiring identification information of a page corresponding to the URL link;

2. The method of claim 1, wherein the extracting a preset number of URL links from all URL links corresponding to each page and placing the URL links into a queue to be crawled comprises:

and respectively extracting URL links with the same preset number from all URL links corresponding to each page, and putting the URL links into a queue to be crawled.

3. The method of claim 1, wherein prior to said extracting a preset number of URL links from the crawl directory for each of the pages, the method further comprises:

4. The method of claim 3, wherein extracting a preset number of URL links from the crawl directory corresponding to each page comprises:

5. The method according to any one of claims 1-4, wherein after the crawling of the page content in the page corresponding to the URL link using the URL link in the queue to be crawled as an entry, the method further comprises:

6. A crawler crawling apparatus, comprising:

the crawling unit is used for crawling the page content in the page corresponding to the URL link by taking the URL link in the queue to be crawled as an entrance;

the extraction unit is specifically configured to set weight values for different types of pages, extract a preset number of URL links matching the weight value corresponding to the page type from all URL links corresponding to each page, and place the URL links into a queue to be crawled;

the device further comprises:

7. The apparatus of claim 6,

the extraction unit is specifically configured to extract the same preset number of URL links from all URL links corresponding to each page, and place the URL links into a queue to be crawled.

8. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus on which the storage medium is located to perform the crawler crawling method according to any one of claims 1 to 5.

9. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the crawler crawling method of any of claims 1 to 5.