CN113419781A

CN113419781A - Crawler method and device based on Chrome plug-in, computer equipment and storage medium

Info

Publication number: CN113419781A
Application number: CN202110813985.XA
Authority: CN
Inventors: 林鹏; 蔡权; 黄九鸣; 张圣栋; 曾琰
Original assignee: Hunan Sifang Tianjian Information Technology Co Ltd
Current assignee: Hunan Sifang Tianjian Information Technology Co Ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-09-21

Abstract

The invention relates to the technical field of web crawlers, and provides a crawler method and device based on Chrome plug-in, computer equipment and a storage medium, wherein the method comprises the following steps: requesting a crawler task from a task scheduling center; according to a crawler labeling template corresponding to the crawler task request, performing template labeling on a target webpage in advance to obtain the crawler labeling template; and starting a Chrome plug-in, and performing crawler operation by the Chrome plug-in according to the webpage elements marked in the crawler marking template. By adopting the method, the working efficiency of crawler collection can be improved.

Description

Crawler method and device based on Chrome plug-in, computer equipment and storage medium

Technical Field

The invention belongs to the technical field of web crawlers, and particularly relates to a crawler method and device based on Chrome plug-in, computer equipment and a storage medium.

Background

With the rapid development of networks, the world wide web becomes a carrier of a large amount of information, and how to effectively extract and utilize the information becomes a great challenge. Traditionally, users have been the portal and guide to access the world wide web through general purpose search engines, such as google, yahoo, etc., as tools to assist people in retrieving information. However, with the abundance of data formats, these general-purpose search engines are generally unable to support data that is information-intensive and structured, and it is often difficult for the general-purpose search engines to support queries that are formulated based on semantic information. Therefore, in order to solve the problems of the general search engine, a web crawler technology for directionally crawling related web page resources is developed.

However, when the existing web crawler faces a large number of ajax asynchronously loaded websites, a mode of writing different collection logics for each website in the early stage is adopted. At the early stage, a Selenium (browser automation test framework) is used for simulating a browser request, each website is individually adapted, extraction rules of each website are compiled after HTML (Hypertext Markup Language) source codes of the website are taken, and finally, required data are taken through the extraction rules. If the website is changed, the crawler needs to be encoded and released again, the development period is long, and the crawler acquisition work efficiency is low.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a crawler method, a crawler apparatus, a computer device, and a storage medium based on Chrome plug-in, which can improve the crawler collection efficiency.

The invention provides a crawler method based on Chrome plug-in, which comprises the following steps:

requesting a crawler task from a task scheduling center;

according to a crawler labeling template corresponding to the crawler task request, performing template labeling on a target webpage in advance to obtain the crawler labeling template;

and starting a Chrome plug-in, and performing crawler operation by the Chrome plug-in according to the webpage elements marked in the crawler marking template.

In one embodiment, starting a Chrome plug-in, and performing a crawler operation by the Chrome plug-in according to a webpage element labeled in the crawler labeling template, includes:

acquiring a link of a list page corresponding to the crawler task, and a list page success selector field and a detail page link selector field which are marked in the crawler marking template;

loading the link of the list page, and judging whether the list page is loaded successfully according to the list page success selector field;

when the list page is not loaded successfully, refreshing the link of the list page for reloading;

when the list page is loaded successfully, acquiring a link of the detail page included in the list page according to the link selector field of the detail page and clicking to load;

exporting the hypertext markup language of the loaded detail pages as a crawler result, and finishing the crawler after the links of the detail pages in the list page are loaded completely.

In one embodiment, after acquiring a link of a detail page included in the list page according to the detail page link selector field and clicking to load, the method includes:

acquiring a detail page loading completion selector field marked in the crawler marking template, and judging whether the detail page is loaded successfully or not according to the detail page loading completion selector field;

when the detail page is not loaded successfully, clicking a link for loading the detail page again;

and when the detail page is loaded successfully, the step of exporting the hypertext markup language of the loaded detail page as a crawler result is entered.

In one embodiment, before ending the crawler, the method further comprises:

judging whether the currently loaded list page is an end page or not;

and if the currently loaded list page is not the end page, completing page turning operation according to the webpage elements marked in the crawler marking template until the currently loaded list page is the end page.

In one embodiment, determining whether the web page is successfully loaded according to the selector field includes:

acquiring a field value corresponding to the selector field, and searching in a webpage source code of the webpage according to the field value; wherein the selector field comprises the listing page success selector field and/or the detail page load complete selector field, and the web page comprises the listing page corresponding to the listing page success selector field and/or the detail page corresponding to the detail page load complete selector field;

and when the webpage element corresponding to the field value is found in the webpage source code, determining that the webpage corresponding to the selector field is loaded successfully.

In one embodiment, the completing a page turning operation according to the webpage elements labeled in the crawler labeling template until the currently loaded list page is an end page includes:

acquiring a page turning button selector field and a page turning button text field which are marked in the crawler marking template;

searching whether a page turning button exists in the currently loaded list page or not according to the page turning button selector field and the page turning button text field;

when the page turning button does not exist, page number page turning is carried out on the list page;

when a page turning button exists, clicking the page turning button to turn pages;

and after page turning, returning to the step of judging whether the list page is loaded successfully according to the list page success selector field until the currently loaded list page is an end page.

In one embodiment, the step of performing template tagging on the target webpage to obtain the crawler tagging template includes:

marking items to be captured in the target webpage to obtain at least two cascading style sheet expressions of the marked items;

generating a cascading style sheet general formula of the labeled item according to each cascading style sheet expression;

and generating a crawler labeling template of the target webpage according to the cascading style sheet formula and the webpage information of the target webpage.

A Chrome plug-in based crawler device comprising:

the task request module is used for requesting a crawler task to the task scheduling center;

the template request module is used for requesting a corresponding crawler marking template according to the crawler task, and the crawler marking template is obtained by carrying out template marking on a target webpage in advance;

and the crawler module is used for starting the Chrome plug-in, and the Chrome plug-in carries out crawler operation according to the webpage elements marked in the crawler marking template.

The invention also provides computer equipment which comprises a processor and a memory, wherein the memory stores a computer program, and the processor realizes the steps of the crawler method based on the Chrome plug-in when executing the computer program.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the Chrome plug-in based crawler method described above.

According to the crawler method and device based on the Chrome plug-in, the computer equipment and the storage medium, the crawler requests a crawler task from a task scheduling center, and template marking is carried out on a target webpage in advance by the crawler marking template according to the crawler marking template corresponding to the crawler task request; and then starting the Chrome plug-in, and performing crawler operation by the Chrome plug-in according to the webpage elements marked in the crawler marking template, wherein the method does not need to manually write acquisition logics one by one aiming at different websites in advance when the crawler is performed, and only needs to simply mark a target webpage in advance to obtain a corresponding marking template, and then the Chrome plug-in directly performs crawler according to the webpage elements marked in the template subsequently to complete the acquisition of the required website, thereby greatly improving the efficiency of the crawler acquisition work.

Drawings

FIG. 1 is a diagram of an application environment of a crawler method based on Chrome plug-ins in one embodiment.

FIG. 2 is a flowchart illustrating a crawler method based on Chrome plug-in one embodiment.

FIG. 3 is a flowchart illustrating a crawler method based on Chrome plug-in another embodiment.

FIG. 4 is a block diagram of the structure of a crawler system based on Chrome plug-in one embodiment.

FIG. 5 is a flowchart illustrating steps of performing template tagging on a target webpage to obtain a crawler tagging template in one embodiment.

FIG. 6 is a block diagram of a crawler based on Chrome plug-in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The crawler method based on Chrome plug-in provided by the present application can be applied to the application environment shown in fig. 1, where the application environment relates to the terminal 102 and the server 104. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 may be, but is not limited to, various personal computers, laptops, smartphones, tablets and portable wearable devices, and the server 104 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

Specifically, the crawler program may be deployed in the terminal 102 and/or the server 104, and the crawler method based on the Chrome plug-in may be implemented by the crawler program deployed in the terminal 102, or may be implemented by the crawler program deployed in the server 104. Taking the implementation of the server 104 as an example, a crawler program in the server 104 requests a crawler task from the task scheduling center; a crawler program in the server 104 obtains a target webpage by template marking according to a crawler marking template corresponding to the crawler task request; the crawler program in the server 104 starts a Chrome plug-in, and the Chrome plug-in performs crawler operation according to the webpage elements marked in the crawler marking template.

In one embodiment, as shown in fig. 2, a crawler method based on Chrome plug-in is provided, which is described by taking the method as an example of being applied to a server, and includes the following steps:

step S201, a crawler task is requested to a task scheduling center.

The task scheduling center is a program deployed in the server and used for executing task scheduling, and the task scheduling center in this embodiment may be understood as a program for scheduling a crawler task.

Specifically, the crawler program in the server may request the task scheduling center for a crawler task created by the user when receiving execution of the crawler issued by the user through the terminal. Or, the crawler program in the server periodically requests the task scheduling center for the crawler task.

And S202, according to a crawler labeling template corresponding to the crawler task request, performing template labeling on a target webpage in advance by using the crawler labeling template to obtain the target webpage.

The crawler labeling template is obtained by performing template labeling on a target webpage in advance and is used for the crawler to acquire data. The crawler labeling template comprises information which can assist a crawler program in crawler collection of the webpage.

Specifically, after a crawler task is regularly requested to the task scheduling center, if the crawler task fed back by the task scheduling center is received, it can be determined that the task request is successful. And the crawler starts to run, and a crawler marking template matched with the crawler is acquired according to the requested crawler task request which is successful. In this embodiment, all the crawler labeling templates may be managed by the template management center in a unified manner, and when a crawler program requests a crawler labeling template, the template management center sends a data source (data source, for example, ID of the crawler labeling template) of the crawler labeling template that needs to be requested to the template management center, and feeds back the matched crawler labeling template according to the data source. However, if the crawler marking template fed back by the task scheduling center is not received in a long time or the task scheduling center directly feeds back the request failure information, it may be determined that the template request fails. And the failure of the crawler task is indicated because the corresponding crawler labeling template cannot be requested, so that the crawler is ended. In addition, the failure reason that the crawler tagging template is not requested is mostly because the template management center does not have the requested crawler tagging template. Therefore, the server can start the template labeling task for the webpage corresponding to the failed crawler task, so that the crawler collection work of the webpage is carried out again after the crawler labeling template of the webpage is labeled.

Step S203, starting the Chrome plug-in, and performing crawler operation by the Chrome plug-in according to the webpage elements marked in the crawler marking template.

Specifically, after the request of the crawler labeling template is successful, a Chrome (google browser) plug-in is started, and the Chrome plug-in performs corresponding operations such as click loading and page turning according to webpage elements included in the crawler labeling template to perform crawler collection.

According to the crawler method based on the Chrome plug-in, a crawler requests a crawler task from a task scheduling center, and template marking is carried out on a target webpage in advance by the crawler marking template according to a crawler marking template corresponding to the crawler task request; and then starting the Chrome plug-in, and performing crawler operation by the Chrome plug-in according to the webpage elements marked in the crawler marking template, wherein the method does not need to manually write acquisition logics one by one aiming at different websites in advance when the crawler is performed, and only needs to simply mark a target webpage in advance to obtain a corresponding marking template, and then the Chrome plug-in directly performs crawler according to the webpage elements marked in the template subsequently to complete the acquisition of the required website, thereby greatly improving the efficiency of the crawler acquisition work.

In one embodiment, step S203 includes: acquiring a link of a list page corresponding to a crawler task, and a list page success selector field and a detail page link selector field which are marked in a crawler marking template; loading the link of the list page, and judging whether the list page is loaded successfully according to the successful selector field of the list page; when the list page is not loaded successfully, refreshing the link of the list page for reloading; when the list page is loaded successfully, acquiring a link of the detail page included in the list page according to the link selector field of the detail page and clicking to load; exporting the hypertext markup language of the loaded detail pages as a crawler result, and finishing the crawler after the links of the detail pages in the list page are loaded.

The list page success selector field and the detail page link selector field are web page elements obtained by labeling the target website.

Specifically, the Chrome plug-in closes all the tab pages first, and obtains the links of the list pages set in the crawler task. And clicking a link of the list page to load the list page, and judging whether the list page is loaded successfully according to a list page success selector (success Selector) field in the crawler labeling template. And when the list page is successfully loaded, acquiring links of all the detail pages in the list page according to the detail page link selector (linkSector) fields marked in the crawler marking template, and clicking and loading. And returning hypertext Markup Language (HTML) of the loaded detail page as a crawler result to the crawler program, and exporting the result by the crawler program. And after all the detail pages are clicked and loaded and the corresponding detail page HTML is exported, the crawler is represented to be completed, and the crawler is finished. And if the list page is not loaded successfully, the refresh link loads the list page again, and if the list page is not loaded successfully after repeated loading for a preset number of times, the task fails, and the task is directly finished.

In this embodiment, because the field in the crawler labeling template is obtained by labeling the target webpage in advance, the crawler is performed by using the webpage elements labeled in the crawler labeling template, and whether the list page is loaded is determined by the webpage elements, so that the efficiency is improved without manually writing acquisition logic, and the accuracy of the crawler can be ensured.

In one embodiment, after acquiring a link of a detail page included in a list page according to a detail page link selector field and clicking to load, the method includes: acquiring a detail page loading completion selector field marked in a crawler marking template and judging whether the detail page is loaded successfully or not according to the detail page loading completion selector field; when the detail page is not loaded successfully, clicking the link for loading the detail page again; and when the detail page is loaded successfully, the step of exporting the hypertext markup language of the loaded detail page as a crawler result is entered.

Specifically, after acquiring a link of the detail page according to the link selector field of the detail page and clicking to load, before exporting HTML of the loaded detail page, it is necessary to further confirm whether the detail page is loaded correctly and successfully. And acquiring a field of a detail page loading completion selector (detailPageLoadFinishedSector) in the crawler labeling template, and judging whether the loading of the detail page is successful or not according to the field of the detail page loading completion selector, wherein the field of the detail page loading completion selector is also a webpage element obtained by labeling. When the detail page loading is determined to be successful, the HTML of the detail page which is loaded completely is exported. And when the detail page is determined not to be loaded successfully, returning to click the link of the detail page again for secondary loading, and if the detail page still fails to be loaded after being retried for n times, determining that the crawler task fails and directly ending the task. Here, n is a threshold value of the number of retries set in advance, and may be set according to actual needs, and is not limited herein.

In this embodiment, the loading state of the detail page is judged after the loading is completed, and the judgment is performed through the marked webpage elements, so that not only can the successful loading of the detail page be ensured, but also the judgment accuracy can be improved.

In one embodiment, prior to ending the crawler, the method further comprises: judging whether the currently loaded list page is an end page or not; and if the currently loaded list page is not the end page, completing page turning operation according to the webpage elements marked in the crawler marking template until the currently loaded list page is the end page.

Specifically, after all detail pages corresponding to one list page are clicked and loaded successfully and before the crawler is finished, in order to avoid that the current page is not the end page, it is also necessary to determine whether the currently loaded list page is the end page. And when the currently loaded list page is determined to be the end page, indicating that the crawler task is completed at this time, initializing the Chrome plug-in to close all the label pages, and ending the crawler at this time. And when the currently loaded list page is determined not to be the end page, the crawler task is not completed currently, and pages which are not turned may need to be collected by the crawler. And then, page turning operation is further completed according to the webpage elements marked in the crawler marking template, and the crawler cannot be ended until the currently loaded list page is determined to be an end page.

In one embodiment, completing a page turning operation according to a webpage element labeled in a crawler labeling template until a currently loaded list page is an end page, including: acquiring a page turning button selector field and a page turning button text field which are marked in a crawler marking template; searching whether a page turning button exists in the currently loaded list page or not according to the page turning button selector field and the page turning button text field; when the page turning button does not exist, page turning is carried out on the list page; when the page turning button exists, clicking the page turning button to turn pages; and after page turning, returning to the step of judging whether the list page is loaded successfully according to the list page success selector field until the currently loaded list page is the end page.

The field of the page turning button selector (nextpageselectoglobal) is a webpage element which is obtained by labeling and is used for turning pages in the webpage. The next pagetext (nextPageText) field is the text content marked on the obtained paging button, such as the text content of "next page", "page turning", ">" and the like, and the paging button in the webpage can be accurately positioned by combining the paging button text with the paging button selector. The end page refers to the last list page in the crawler task, for example, the end page can be represented by a corresponding mark.

Specifically, when it is determined that the currently loaded list page is not the end page, it indicates that the crawler task is not currently completed, and there may be a page that is not turned and needs to be crawled. And then, acquiring a page turning button selector field and a page turning button text field from the crawler labeling template, and positioning the page turning button in the currently loaded list according to the two fields of the page turning button selector field and the page turning button text field. That is, the values of the fields corresponding to the two fields, i.e., the page turning button selector field and the page turning button text field, are searched in the web page source code of the currently loaded page by using the parsing tool jsup. If the web page elements corresponding to the two field values are found from the web page source codes, the found web page elements are the page turning buttons. If the web page element corresponding to the two field values is not found, the web page is indicated to have no page turning button. After the page turning button is found, the page turning button is clicked to load the content of the next page of the list page, the step of judging whether the list page is loaded successfully according to the field of the list page success selector is returned, the same crawler operation is continued, and the crawler task can be ended until the page is loaded to the end page.

And when the page turning button does not exist in the currently loaded list page according to the page turning button selector field and the page turning button text field, page turning is carried out on the list page.

Specifically, because some special websites do not have page turning buttons but turn pages through the data keys, when the page turning buttons are found and determined to be absent on the webpage according to the page turning button selector field and the page turning button text field, page turning is performed on the list page through the page number.

In the embodiment, in order to avoid the problem of incomplete collection caused by the need of page turning, the judgment of the page ending is performed to determine whether to end the crawler, so that the precision of crawler collection is improved.

In one embodiment, determining whether the web page was loaded successfully according to the selector field includes: acquiring a field value corresponding to a selector field, and searching in a webpage source code of a webpage according to the field value; and when the webpage element corresponding to the field value is found in the webpage source code, determining that the webpage loading is successful.

The selector fields in this embodiment include a list page success selector field and a detail page load complete selector field. The webpage is determined according to the selector field, the list page corresponding to the selector field of the list page success is included, and the detail page corresponding to the selector field is loaded and completed.

Specifically, when whether the list page is loaded successfully is judged according to the list page success selector field, or whether the detail page is loaded successfully is judged according to the detail page loading completion selector field, since the field in the crawler labeling template is a web page element obtained by labeling from the web page, the web page should actually have a field value corresponding to the field. Therefore, according to the field value of the selector field, the jsup is used for searching from the webpage source code of the webpage, if the field value identical to the field value is found in the webpage source code, the webpage loading is successful, and otherwise, the loading is failed. In this embodiment, whether the webpage is successfully loaded is determined according to the labeled webpage elements, so that the loading accuracy is improved.

In one embodiment, as shown in fig. 3, a flowchart of another crawler method based on a Chrome plug-in is provided, as shown in fig. 4, a crawler system based on a Chrome plug-in is provided, and the crawler method based on the Chrome plug-in is explained based on fig. 3, fig. 4, and the following table 1.

TABLE 1 crawler labeling templates

Specifically, as shown in fig. 3, the crawler system based on the Chrome plug-in includes a template marking tool, the Chrome plug-in, and a distributed crawler subsystem including a template management center, a task scheduling center, and a proxy scheduling server providing a large number of different agents for the crawler.

Firstly, a website template is marked by using a template marking tool, namely, template marking is carried out on a target webpage to obtain a corresponding crawler marking template. And creating a corresponding crawler task, putting the crawler task into a task scheduling center, and requesting the crawler task from the task scheduling center by a crawler program at regular time. And when the crawler task request fails, returning to the step of requesting the task scheduling center again. And if the crawler task request is successful, the crawler program further requests the labeled crawler labeling template from the template management center. And if the template request fails, directly ending the task, and if the template request succeeds, starting a Chrome plug-in and accessing a list page set in the crawler task through a link for loading the list page. And then, judging whether the list page is loaded successfully or not according to the success CSSelector field in the template, and if the list page is not loaded successfully, refreshing the webpage. And if the list page is loaded successfully, the Chrome plug-in acquires the link of the detail page according to the linkSector in the template and carries out click loading. And further judging whether the detail page is loaded successfully or not through a detailPageLoadFinishedSector field in the template. And if the detail page is not loaded completely, returning to the step of re-clicking the loaded detail page. If the detail page is loaded successfully, after the detail page is loaded, the HTML of the detail page is returned to the crawler, and a result is exported by the crawler. And finally, judging whether all the detail pages in the list page are clicked completely or not, returning to continue clicking if not, judging whether the current list page is an end page or not if the current list page is clicked completely, searching a page turning button for turning the page according to a nextpageSelectorGlobal field and a nextpageText field in the template if the current list page is not the end page, and resuming the crawler after turning the page. If the page is finished, initializing the Chrome plug-in and closing all the label pages, finishing the crawler,

in addition, the datasource field is the id of the template and is used when searching for the template. The urlModel field is also a list page link, but the urlModel field in the template is used to verify that the template is valid. The isfeignsite field is used to determine whether the web page is an overseas web site, thus determining whether the crawler uses a home agent or a foreign agent. The publish _ time field is used to extract the time when the article was published. Since some websites require manual intervention, the bdedhumanintervent field is used to determine whether a click operation needs to be considered. The bAjaxLoad field is used for judging whether the asynchronous request needs to be completely loaded after the loading of the detail page is completed, and the specific crawler can perform the judgment after the detail page is opened. The bTurnByNumber field is used to determine whether or not a page can be turned according to a page number.

In one embodiment, a method for performing template tagging on a target webpage to obtain a crawler tagging template includes: marking items to be captured in a target webpage, and obtaining at least two cascading style sheet expressions of the marked items; generating a cascading style sheet general formula of the marked item according to the cascading style sheet expressions; and generating a crawler labeling template of the target webpage according to the cascading style sheet formula and the webpage information of the target webpage.

The target web page is a web page subjected to template labeling, the Cascading Style sheet expression is a CSS (Cascading Style Sheets) expression, and the Cascading Style sheet general formula is a CSS general formula. The marked items refer to marked items which need to be captured, such as titles, authors, publication time, other information and the like, all the information refers to marked items which are marked to obtain a CSS general formula, and the crawler can automatically extract important information after collecting list pages.

Specifically, referring to the flowchart shown in fig. 5, the detailed steps of template labeling are as follows:

step 1, a user marks a captured item (such as a title of a list page) required by a target webpage by using a template marking tool in a Chrome browser. After annotation is completed, the template annotation tool will obtain CSS expressions for the two annotated items, denoted sample1 and sample 2. Assume sample1 in this embodiment is "HTML > BODY > DIV: nth-of-type (5) > DIV: nth-of-type (2) > DIV: nth-of-type (1) > DIV: nth-of-type (1) > DIV: nth-of-type (1) > UL: nth-of-type (1) > LI: nth-of-type (1) > A: nth-of-type (1)) ", and assume sample2 in this embodiment is "HTML > BODY > DIV: nth-of-type (5) > DIV: nth-of-type (2) > DIV: nth-of-type (1) > DIV: nth-of-type (1) > DIV: nth-of-type (1) > UL: nth-of-type (1) > LI: nth-of-type (9) > A: nth-of-type (1) ". These two parameters are necessary parameters for eventually generating the CSS formula.

And 2, judging whether the sample1 and the sample2 are empty strings or not, and if so, directly jumping to the end. If not, the shorter of the lengths of the sample1 and sample2 strings is calculated, indicating that the length of the string is length. The purpose is that when sample1 is traversed subsequently, the condition that the array is out of range does not occur, and the robustness of the code is improved.

Step 3, the subscripts traverse sample1 from 0 to length, whenever "(" character records subscript value start until sample1 and sample2 are the same subscript. observing sample1 and sample2, both CSS expressions now uniquely specify different HTML tags in the same list page, so their structures are very similar, except that by specifying an index in the CSS hierarchy where the annotation item is located, different HTML tags are represented. for example, sample2 in the "LI: nth-of-type (9)" section, index 9 specifies the 9 th annotation item in the < LI > tag hierarchy.

Step 4, continue traversing sample1 characters from subscript start until finding character yes "), record subscript end. The purpose of this is to record the end positions of the samples 1 and 2 that are different in structure, and facilitate subsequent deletion to obtain the CSS general formula.

Step 5, judging whether the start and the end meet any one of the conditions: start is 0 or end < start. If at least one is satisfied, jump to step 5.1. Otherwise jump to step 5.2. The purpose of this step is to verify the correctness of the above steps.

Step 5.1, newly creating a sample variable with the value of the shorter one of sample1 and sample 2. Go to step 6. The purpose of this step is to make the final CSS formula shorter.

And 5.2, the CSS general formula cannot be extracted, and error information is output. Go to step 11.

Step 6, it can be determined that the level where the subscripts start to end are located is the CSS level of the required labeling information, and the character string between the subscripts start to end in the sample (in this example, "9") is deleted. For example, the expressions "LI: nth-child (x)" and "LI: nth-of-type (x)" refer to the unique tags in the < LI > series tags, and the CSS general formula in which the required list information can be obtained by deleting the specific information is provided. Then, the symbol sequence of "nth-child ()" and "nth-of-type ()" in the level where start and end are located in sample is replaced by an empty character string.

Step 7, returning the sample as the final CSS formula, in this example, the final result of the CSS formula is "HTML > BODY > DIV, nth-of-type (5) > DIV, nth-of-type (2) > DIV, nth-of-type (1) > UL, nth-of-type (1) > LI > A, nth-of-type (1)", since this embodiment takes the title as an example, all titles of this list page can be selected from the obtained CSS formula. And then fields such as UrlModel, DataSource and the like in the template can be generated according to webpage information such as url links and the like, and required crawling information is manually input, so that the template is finally obtained and stored.

And 8, finishing template marking.

In the embodiment, in the template marking process, only the link CSS general formula of the detail page in the list page needs to be marked, the list page correctly displays the CSS general formula, the release time, the release author, the page turning button CSS general formula and the like, so that the crawler can accurately and efficiently extract data, the complexity of the time of the whole template marking process is only O (n), the speed is very high, and the efficiency of the crawler is further improved.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 6, there is provided a crawler apparatus based on Chrome plug-in, including: a task request module 601, a template request module 602 and a crawler module 603,

and the task request module 601 is configured to request a crawler task from the task scheduling center.

The template request module 602 is configured to perform template tagging on the target webpage in advance according to the crawler tagging template corresponding to the crawler task request.

And the crawler module 603 is configured to start the Chrome plug-in, and perform crawler operation by the Chrome plug-in according to the webpage elements labeled in the crawler labeling template.

In one embodiment, the crawler module 603 is further configured to obtain links of list pages corresponding to the crawler tasks, and a list page success selector field and a detail page link selector field marked in the crawler marking template; loading the link of the list page, and judging whether the list page is loaded successfully according to the successful selector field of the list page; when the list page is not loaded successfully, refreshing the link of the list page for reloading; when the list page is loaded successfully, acquiring a link of the detail page included in the list page according to the link selector field of the detail page and clicking to load; exporting the hypertext markup language of the loaded detail pages as a crawler result, and finishing the crawler after the links of the detail pages in the list page are loaded.

In one embodiment, the crawler module 603 is further configured to obtain a field of a detail page loading completion selector obtained by labeling in the crawler labeling template, and determine whether the loading of the detail page is successful according to the field of the detail page loading completion selector; when the detail page is not loaded successfully, clicking the link for loading the detail page again; and when the detail page is loaded successfully, the step of exporting the hypertext markup language of the loaded detail page as a crawler result is entered.

In one embodiment, the crawler module 603 is further configured to determine whether the currently loaded list page is an end page; and if the currently loaded list page is not the end page, completing page turning operation according to the webpage elements marked in the crawler marking template until the currently loaded list page is the end page.

In one embodiment, the crawler module 603 is further configured to obtain a field value corresponding to the selector field, and search for a web page source code of the web page according to the field value; wherein the selector field comprises a list page success selector field and/or a detail page load complete selector field, and the web page comprises a list page corresponding to the list page success selector field and/or the detail page corresponding to the detail page load complete selector field; and when the webpage element corresponding to the field value is found in the webpage source code, determining that the webpage corresponding to the selector field is loaded successfully.

In one embodiment, the crawler module 603 is further configured to obtain a page turning button selector field and a page turning button text field marked in the crawler marking template; searching whether a page turning button exists in the currently loaded list page or not according to the page turning button selector field and the page turning button text field; when the page turning button does not exist, page turning is carried out on the list page; and when the page turning button exists, clicking the page turning button to turn pages, returning to the step of judging whether the list page is successfully loaded according to the list page success selector field until the currently loaded list page is the end page.

In one embodiment, the system further comprises a template marking module, which is used for marking the items to be captured in the target webpage and obtaining at least two cascading style sheet expressions of the marked items; generating a cascading style sheet general formula of the marked item according to the cascading style sheet expressions; and generating a crawler labeling template of the target webpage according to the cascading style sheet formula and the webpage information of the target webpage.

For the specific definition of the crawler device based on the Chrome plug-in, reference may be made to the above definition of the crawler method based on the Chrome plug-in, and details are not repeated here. The various modules in the above-described Chrome plug-in based crawler device may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above-mentioned crawler method embodiments based on the Chrome plug-in. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc.

In one embodiment, a computer device, which may be a server, is provided that includes a processor, a memory, and a network interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a Chrome plug-in based crawler method. Illustratively, a computer program may be partitioned into one or more modules, which are stored in a memory and executed by a processor to implement the present invention. One or more of the modules may be a sequence of computer program instruction segments for describing the execution of a computer program in a computer device that is capable of performing certain functions.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

It will be understood by those skilled in the art that the computer device structure shown in the embodiment is only a partial structure related to the solution of the present invention, and does not constitute a limitation to the computer device to which the present invention is applied, and a specific computer device may include more or less components, or combine some components, or have different component arrangements.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

requesting a crawler task from a task scheduling center;

according to a crawler labeling template corresponding to the crawler task request, template labeling is carried out on a target webpage in advance through the crawler labeling template to obtain a target webpage;

and starting the Chrome plug-in, and performing crawler operation by the Chrome plug-in according to the webpage elements marked in the crawler marking template.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a link of a list page corresponding to a crawler task, and a list page success selector field and a detail page link selector field which are marked in a crawler marking template; loading the link of the list page, and judging whether the list page is loaded successfully according to the successful selector field of the list page; when the list page is not loaded successfully, refreshing the link of the list page for reloading; when the list page is loaded successfully, acquiring a link of the detail page included in the list page according to the link selector field of the detail page and clicking to load; exporting the hypertext markup language of the loaded detail pages as a crawler result, and finishing the crawler after the links of the detail pages in the list page are loaded.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a detail page loading completion selector field marked in a crawler marking template, and judging whether the detail page is loaded successfully according to the detail page loading completion selector field; when the detail page is not loaded successfully, clicking the link for loading the detail page again; and when the detail page is loaded successfully, the step of exporting the hypertext markup language of the loaded detail page as a crawler result is entered.

In one embodiment, the processor, when executing the computer program, further performs the steps of: judging whether the currently loaded list page is an end page or not; and if the currently loaded list page is not the end page, completing page turning operation according to the webpage elements marked in the crawler marking template until the currently loaded list page is the end page.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a field value corresponding to a selector field, and searching in a webpage source code of a webpage according to the field value; wherein the selector field comprises a list page success selector field and/or a detail page load complete selector field, and the web page comprises a list page corresponding to the list page success selector field and/or the detail page corresponding to the detail page load complete selector field; and when the webpage element corresponding to the field value is found in the webpage source code, determining that the webpage corresponding to the selector field is loaded successfully.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a page turning button selector field and a page turning button text field which are marked in a crawler marking template; searching whether a page turning button exists in the currently loaded list page or not according to the page turning button selector field and the page turning button text field; when the page turning button does not exist, page turning is carried out on the list page; and when the page turning button exists, clicking the page turning button to turn pages, returning to the step of judging whether the list page is successfully loaded according to the list page success selector field until the currently loaded list page is the end page.

In one embodiment, the processor, when executing the computer program, further performs the steps of: marking items to be captured in a target webpage, and obtaining at least two cascading style sheet expressions of the marked items; generating a cascading style sheet general formula of the marked item according to the cascading style sheet expressions; and generating a crawler labeling template of the target webpage according to the cascading style sheet formula and the webpage information of the target webpage.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

requesting a crawler task from a task scheduling center;

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a link of a list page corresponding to a crawler task, and a list page success selector field and a detail page link selector field which are marked in a crawler marking template; loading the link of the list page, and judging whether the list page is loaded successfully according to the successful selector field of the list page; when the list page is not loaded successfully, refreshing the link of the list page for reloading; when the list page is loaded successfully, acquiring a link of the detail page included in the list page according to the link selector field of the detail page and clicking to load; exporting the hypertext markup language of the loaded detail pages as a crawler result, and finishing the crawler after the links of the detail pages in the list page are loaded.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a detail page loading completion selector field marked in a crawler marking template, and judging whether the detail page is loaded successfully according to the detail page loading completion selector field; when the detail page is not loaded successfully, clicking the link for loading the detail page again; and when the detail page is loaded successfully, the step of exporting the hypertext markup language of the loaded detail page as a crawler result is entered.

In one embodiment, the computer program when executed by the processor further performs the steps of: judging whether the currently loaded list page is an end page or not; and if the currently loaded list page is not the end page, completing page turning operation according to the webpage elements marked in the crawler marking template until the currently loaded list page is the end page.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a field value corresponding to a selector field, and searching in a webpage source code of a webpage according to the field value; wherein the selector field comprises a list page success selector field and/or a detail page load complete selector field, and the web page comprises a list page corresponding to the list page success selector field and/or the detail page corresponding to the detail page load complete selector field; and when the webpage element corresponding to the field value is found in the webpage source code, determining that the webpage corresponding to the selector field is loaded successfully.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a page turning button selector field and a page turning button text field which are marked in a crawler marking template; searching whether a page turning button exists in the currently loaded list page or not according to the page turning button selector field and the page turning button text field; when the page turning button does not exist, page turning is carried out on the list page; and when the page turning button exists, clicking the page turning button to turn pages, returning to the step of judging whether the list page is successfully loaded according to the list page success selector field until the currently loaded list page is the end page.

In one embodiment, the computer program when executed by the processor further performs the steps of: marking items to be captured in a target webpage, and obtaining at least two cascading style sheet expressions of the marked items; generating a cascading style sheet general formula of the marked item according to the cascading style sheet expressions; and generating a crawler labeling template of the target webpage according to the cascading style sheet formula and the webpage information of the target webpage.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A crawler method based on Chrome plug-in is characterized by comprising the following steps:

requesting a crawler task from a task scheduling center;

2. The method of claim 1, wherein a Chrome plug-in is started, and the crawler plug-in performs crawler operations according to the webpage elements marked in the crawler marking template, and the method comprises the following steps:

3. The method according to claim 2, wherein after acquiring the link of the detail page included in the list page according to the detail page link selector field and clicking to load, the method comprises:

4. The method of claim 2, wherein prior to ending the crawler, the method further comprises:

judging whether the currently loaded list page is an end page or not;

5. The method of claim 3, wherein determining whether the web page was loaded successfully based on the selector field comprises:

6. The method according to claim 4, wherein the completing page turning operations according to the webpage elements labeled in the crawler labeling template until the currently loaded list page is an end page comprises:

7. The method of claim 1, wherein the step of template tagging the target webpage to obtain the crawler tagging template comprises:

8. A crawler device based on Chrome plug-in, comprising:

9. A computer device comprising a processor and a memory, said memory storing a computer program, wherein said processor is configured to implement the Chrome plug-in based crawler method of any one of claims 1-7 when executing said computer program.

10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the Chrome plug-in based crawler method of any one of claims 1-7.