CN109783728B

CN109783728B - Page crawler rule updating method and system

Info

Publication number: CN109783728B
Application number: CN201811637755.7A
Authority: CN
Inventors: 韩建民; 王玮; 苏文畅; 王兆育; 孙志豪
Original assignee: Anhui Tingjian Technology Co ltd
Current assignee: Anhui Tingjian Technology Co ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-10-19
Anticipated expiration: 2038-12-29
Also published as: CN109783728A

Abstract

The invention discloses a method and a system for updating page crawler rules, wherein the method mainly comprises the following steps: crawling content entities of the page by using an initial crawler rule; judging whether rule updating is needed or not according to the consistent relation between the entity sample and the content entity; and if so, updating the initial crawler rule according to the entity sample and the page information of the changed content entity. According to the crawler monitoring method and the crawler monitoring system, updated crawler rules can be output in a self-adaptive mode according to the crawling result, manual frequent monitoring and maintenance of crawler programs are not needed, manual workload is greatly reduced, and meanwhile accuracy and stability of crawler effects can be effectively guaranteed.

Description

Page crawler rule updating method and system

Technical Field

The invention relates to the field of crawlers, in particular to a method and a system for updating page crawler rules.

Background

The crawler technology is well known as a mode for acquiring webpage information, for example, with the explosive growth of internet information amount, and the continuous rise of drama, live broadcast and short videos, more and more people start to watch videos through a network, but some bad videos with yellow storm are possibly present in the videos, so that the crawler technology can be used for automatically acquiring page information of a video website, further positioning source videos according to the page information, and subsequently combining with an automatic AI identification technology to accurately judge whether video content is yellow storm.

The existing crawler scheme mainly comprises the steps of manually inquiring page codes, researching corresponding crawling rules of the page codes and compiling corresponding crawler programs. However, when the change of the website style causes the original rule not to be applicable, the new rule is required to be re-established manually. Because the number of web pages is large and how the change of a website can not be predicted, the state of a crawler is monitored regularly to judge whether a crawling rule is invalid, when the crawling rule is found out to be overdue, a part of inaccurate data is often generated, the data needs to be deleted, a new crawling rule is made, or the crawling rule needs to be maintained manually and frequently, and the process is high in input labor cost and low in efficiency. For artificial individuals, the process is also tedious and time-consuming, and accuracy and stability of the crawler effect are difficult to guarantee.

Disclosure of Invention

In order to solve the problems, the invention provides a page crawler rule method and a page crawler rule system, which can automatically judge and update a crawling rule based on the existing crawling result and the changed content, thereby avoiding the defects of manually maintaining the crawler.

The technical scheme adopted by the invention is as follows:

a page crawler rule updating method comprises the following steps:

crawling content entities of the page by using an initial crawler rule;

judging whether rule updating is needed or not according to the consistent relation between the entity sample and the content entity;

and if so, updating the initial crawler rule according to the entity sample and the page information of the changed content entity.

Optionally, the updating the initial crawler rule according to the entity sample and the page information of the content entity that changes includes:

acquiring a target entity consistent with the entity sample in a page corresponding to the entity sample;

determining the position information of the target entity in the page;

determining target position information from the respective position information of different pages;

and updating the initial crawler rule by using the target position information.

Optionally, the determining the target location information from the respective location information of the different pages includes:

summarizing the position information aiming at the same entity sample in one page into a set;

finding the intersection of the sets of all pages;

and taking the position information in the intersection as the target position information.

Optionally, the determining the location information of the target entity in the page includes:

taking the upper layer element of the target entity as a positioning condition, and acquiring a result capable of matching the positioning condition in the page;

determining the location information based on current location conditions until the target entity is the only result obtained.

Optionally, the taking an upper layer element of the target entity as a positioning condition includes:

acquiring a first layer element above the target entity;

judging whether the first layer element contains specific attribute information or not;

if so, using the first layer element as a positioning condition;

if not, acquiring a second layer element above the target entity, and judging whether the second layer element contains the specific attribute information, and in this way, taking all the acquired upper layer elements as positioning conditions until any one upper layer element containing the specific attribute information is acquired.

Optionally, the determining the location information of the target entity in the page further includes:

judging whether the obtained results based on the current layer elements are multiple or not;

if yes, acquiring an element on the upper layer of the current layer element to re-determine the positioning condition, and re-acquiring a result which can be matched with the positioning condition in the page;

if not, determining that the target entity is the only acquired result.

Optionally, the determining whether rule updating is required according to a consistent relationship between the entity sample and the content entity includes:

monitoring whether the content entity is consistent with the entity sample in real time;

and when the number of the pages with the inconsistency is monitored to exceed a preset threshold value, judging that the rule needs to be updated.

A page crawler rule updating system comprising:

the crawler module is used for crawling content entities of the page by utilizing an initial crawler rule;

the updating judging module is used for judging whether rule updating is needed or not according to the consistent relation between the entity sample and the content entity;

and the rule updating module is used for updating the initial crawler rule according to the entity sample and the page information of the changed content entity when the output of the updating judging module is yes.

Optionally, the rule updating module specifically includes:

the target entity positioning unit is used for acquiring a target entity consistent with the entity sample in a page corresponding to the entity sample;

the position information determining unit is used for determining the position information of the target entity in the page;

a target position information determining unit configured to determine target position information from the respective position information of different pages;

and the rule updating unit is used for updating the initial crawler rule by using the target position information.

Optionally, the update determining module specifically includes:

the change monitoring unit is used for monitoring whether the content entity is consistent with the entity sample in real time;

and the updating judgment unit is used for judging that the rule updating is required when the number of the pages with the inconsistency exceeds a preset threshold value.

The method comprises the steps of automatically updating a crawler rule, associating the crawler rule with the current webpage content and an original rule, and specifically crawling a content entity of a page by using the original crawler rule; judging whether rule updating is needed or not according to the consistent relation between the entity sample and the content entity; and if so, updating the initial crawler rule according to the entity sample and the page information of the changed content entity. According to the crawler monitoring method and the crawler monitoring system, updated crawler rules can be output in a self-adaptive mode according to the crawling result, manual frequent monitoring and maintenance of crawler programs are not needed, manual workload is greatly reduced, and meanwhile accuracy and stability of crawler effects can be effectively guaranteed.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of an embodiment of a method for updating a page crawler rule provided by the present invention;

FIG. 2 is a flow chart of an embodiment of a change monitoring method provided by the present invention;

FIG. 3 is a flowchart of a rule updating method according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for determining location information according to an embodiment of the present invention;

fig. 5 is a flowchart of a positioning condition determining method according to an embodiment of the present invention;

fig. 6 is a flowchart of a positioning result determining method according to an embodiment of the present invention;

FIG. 7 is a flowchart of a method for determining location information of a target according to an embodiment of the present invention;

fig. 8 is a block diagram illustrating an embodiment of a system for updating page crawler rules according to the present invention.

Description of reference numerals:

1 crawler module, 2 updating judgment module, 3 rule updating module

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

The invention provides an embodiment of a method for updating a page crawler rule, which comprises the following steps as shown in fig. 1:

step S1, crawling the content entity of the page by using the initial crawler rule;

the setting of the initial crawler rule is referred to as setting a rule for a certain category of network resources, and the definition of the web page category may be preset at a previous period, for example, a web page in a video website, a web page in a shopping website, a web page in a news website, and the like. Different page categories have different page layouts due to differences in functions and high probability, and the same type of web pages have content layouts with higher similarity, so that the initial crawler rule referred to in this embodiment is specific to a certain type of web pages, and so on for other types of web pages, and the description is omitted in the present invention.

Specifically, the manner of obtaining the initial crawler rule may be to set the initial crawler rule for at least one specific target in the page, where the specific target may be according to an actual content plate or a subdivision category of the page, for example, for a video webpage, the specific target may include, but is not limited to, a title, a brief introduction, an author, and the like, and of course, different targets also have respective corresponding rules, that is, the initial crawler rule may be understood as being aggregated by at least one "target rule".

Regarding the crawling manner, the crawling manner may be that according to the foregoing initial crawler rule, content entities corresponding to each specific target are crawled from the page by a preset crawler program according to the initial crawler rule at a predetermined period, for example, every 30 minutes, or every 1 hour, or every 2 hours, and the like, where three points need to be pointed out, and in one embodiment, the number of pages to be crawled is not limited; secondly, the number of specific targets under one page is not limited; thirdly, the invention does not limit the form of the asymmetric content entity, the entity sample in the following text, and the like, and can be the content form in any webpage such as characters, symbols, codes, and the like.

Step S2, judging whether rule updating is needed according to the consistent relation between the entity sample and the content entity;

the entity sample referred to herein may be an entity sample preset for content to be crawled under a specific target before crawling, or may be a content entity under a specific target that is crawled according to an initial crawler rule for the first time as the entity sample, but no matter what manner is adopted to obtain the entity sample, the entity sample is for at least one specific target in each web page, for example, a title, a brief introduction, and an author in five video web pages are to be crawled respectively, when these three specific targets are, fifteen entity samples need to be prepared, which of course may also be understood as one web page corresponding to one sample, and the one sample includes three specific target samples. In practical operation, this step may further include monitoring the content entity under a specific target, and accordingly the present invention further provides an embodiment of a preferred monitoring method, as shown in fig. 2, which may include the following steps:

step S201, monitoring whether the content entity is consistent with the entity sample in real time;

whether the crawled content entity is consistent with the entity sample or not is judged, and matching can be performed from different dimensions, such as the position of the content entity or the dimensions of the characters of the content entity and the like one by one. For example, the content entity crawled according to the initial crawler rule each time after the first time is matched with the entity sample in one dimension or multiple dimensions for the first time, so as to determine whether the two entities are consistent.

Step S202, when the number of the pages with inconsistency between the two is monitored to exceed a preset threshold value, the rule needs to be updated.

In this step, the present invention emphasizes that the occurrence of a change in a content entity is a timing condition for determining whether a rule is updated or not, that is, when it is determined that the content entity is changed, a subsequent rule updating step is performed. Therefore, it can be seen that the inconsistency between the content entity and the entity sample (i.e. the crawled content entity changes) does not mean that the subsequent steps need to be performed once any content entity changes, and those skilled in the art can understand that the process includes a certain limiting condition, that is, when the crawled content entity of a certain specific target is inconsistent with the corresponding entity sample, and the number of pages (or the proportion) with the inconsistent specific target exceeds a preset standard, it is determined that "the content entity changes", for example, the number of pages with the inconsistent content entity of the specific target title in the crawl exceeds 50%, it is determined that the content entity is inconsistent with the entity sample.

In the foregoing, if it is determined that the rule update is required, step S3 is executed to update the initial crawler rule according to the entity sample and the page information of the changed content entity.

In this step, the present invention proposes that updating the rule does not depend on human monitoring, but performs the updating operation of the rule based on the aforementioned entity sample and the page information of the changed content entity currently crawled, for example, as shown in fig. 3, the following preferred specific scheme may be included:

step S31, acquiring a target entity consistent with the entity sample in a page corresponding to the entity sample;

in the step, after all content information (for example, source codes) of the page determined in the step that the content entity is inconsistent with the entity sample is obtained, the entity sample is used as a basis, consistent content is searched in the current page as a called target entity, for example, the entity sample aiming at a title in a certain page is ' new person with vocal reflex ', the step is that the ' new person with vocal reflex ' is searched in the page with inconsistent content entity and entity sample, in actual operation, one new person with vocal reflex or a plurality of ' new persons with vocal reflex respectively positioned at different positions of the webpage can be obtained in the current webpage, and the method is continuously implemented by the following steps; if the 'new person with voice' is not acquired in the current webpage, the webpage does not participate in the subsequent operation of rule updating.

Step S32, determining the position information of the target entity in the page;

after the target entity is obtained, the present invention proposes to determine the position representation of the target entity in the web page, that is, in the subsequent step, the position information of the target entity is used as the formulation basis of the new rule, and in order to ensure the validity and pertinence of the new rule, it is preferable that the unique position representation of the target entity in the page is used as the formulation basis of the new rule, that is, when a non-unique result is obtained in the page, the position information may be considered as ambiguous, and how to obtain the position information specifically, refer to the embodiment shown in fig. 4, which includes the following steps:

step S321, taking the upper layer element of the target entity as a positioning condition, and acquiring a result capable of matching the positioning condition in the page;

step S322, determining the position information based on the current positioning condition until the target entity is the only acquired result.

For example, it is obtained in the current page that the target entity is a new person with a voice:

…

<a>

new person with phase sound

</a>

…

In which all the upper-layer elements such as < a > </a > in the example are "new person with voice", and other entities may also contain the same one or more upper-layer elements, therefore, this embodiment uses the upper-layer elements of the target entity to check whether only the only called target entity can be located in the page, and if the final location result using the one or more upper-layer elements is only "new person with voice", then the one or more upper-layer elements can be used to determine the called location information.

In practice, reference may be further made to the following preferred embodiment, as illustrated in fig. 5:

step S3211, obtaining a first layer element on the target entity;

step S3212, judging whether the first layer element contains specific attribute information;

if yes, executing step S3213, using the first layer element as a positioning condition;

if not, step S3214 is executed to obtain the second layer element on the target entity, and determine whether the second layer element includes the specific attribute information, in this way, until any upper layer element including the specific attribute information is obtained, all the obtained upper layer elements are used as the positioning conditions.

In this step, the method starts with the previous layer of the content entity, that is, the < a > </a > in the example is obtained first, and then whether the < a > </a > has specific attribute information which can describe the rule, such as id (unique identifier), class (text style type), Name (text label Name), custom attribute, and the like, is judged, and it can be understood by those skilled in the art that when one element has the above information, the corresponding rule can be determined accordingly, so that when the specific attribute information is included, the < a > </a > and the specific attribute information thereof can be used as the called positioning condition to perform comprehensive retrieval and matching result in the page; if the < a > </a > does not meet the positioning condition called by the embodiment, judging whether the content entity contains specific attribute information or not from a second layer element which is a content entity by a previous layer, for example, the step needs to check the attribute information of the , and so on, if the contains the called specific attribute information, searching a matching result in the current page information with the changed content entity by using the < a >; if the does not contain the specific attribute information, the same query operation is carried out on the elements at the upper layer until the elements containing the specific attribute information are obtained, all the layer elements below the elements are integrated, and the matching result is comprehensively retrieved in the page.

In addition, on the basis of the above embodiment, a processing scheme when multiple results are obtained may be further considered, as shown in fig. 6:

step S3221, determining whether there are a plurality of results obtained based on the current layer element;

if yes, executing step S3222, obtaining an element on the previous layer of the current layer element, re-determining the positioning condition, and re-obtaining a result capable of matching the positioning condition in the page;

if not, step S3223 is executed to determine that the target entity is the only acquired result.

If a plurality of results are obtained according to the positioning conditions, the element conditions are considered to be inappropriate, so that by taking the reference to the embodiment, new positioning conditions are obtained again from the upper layer of the current layer element according to the step of determining the positioning conditions, and the matching results are comprehensively retrieved again in the page based on the new positioning conditions until the obtained results are only called target entities.

Continuing with step S32, step S33, determining target position information from the respective position information of the different pages;

it is mentioned above that it is usually necessary to crawl content entities in multiple pages, and therefore when a rule needs to be updated, there is usually a certain number or proportion of web pages that have changed in content entities, so that the foregoing steps can obtain respective location information of multiple different web pages, and can determine target location information for a subsequent update rule. Specifically, in practical operation, as shown in fig. 7, the method includes the following steps:

step S331, summarizing the position information aiming at the same entity sample in one page into a set;

step S332, solving the intersection of the sets of all the pages;

step S333 sets the position information in the intersection as the target position information.

In the foregoing, it is possible to obtain multiple target entities in a web page by using an entity sample, and in the foregoing steps, each target entity corresponding to the entity sample may obtain a unique positional relationship representation, that is, one page may eventually include one or more pieces of positional information. Meanwhile, the respective sets can be obtained in a similar manner on different pages, so that only the intersection of the sets needs to be determined, that is, the same position relation representation exists in each page. It should be noted that the position indication in the intersection depends on the entity samples, so that an intersection corresponds to an entity sample, and one or more pieces of position information may be included in the intersection, so that in an actual operation, any one piece of position information in the intersection may be used as the target position information, or some criteria may be set to perform the selection of the position information, so as to determine the target position information for the entity sample.

In the above, step S34 is finally executed to update the initial crawler rule with the target location information.

When the target position information is determined, the target position information can replace an original rule corresponding to a certain entity sample in the initial crawler rule, and the initial crawler rule is updated in the mode; it will be understood by those skilled in the art that the foregoing method is applicable to a plurality of specific targets in a plurality of web pages of the same type, and therefore, the target location information is referred to as a plurality of target location information, and all original rules to be updated are replaced when the rules are updated, and a comprehensive crawler program is reconfigured accordingly.

Corresponding to the foregoing embodiments and preferred solutions, the present invention further provides an embodiment of a page crawler rule updating system, as shown in fig. 8, where the system may include at least one memory for storing relevant instructions and at least one processor connected to the memory and configured to execute the following modules (in other embodiments, one or more processors may also directly perform corresponding step actions without performing the corresponding step actions through the following modules, for example, the processor directly performs page crawling, change monitoring, rule updating, and the like):

the crawler module 1 is used for crawling content entities of the page by utilizing an initial crawler rule;

the updating judging module 2 is used for judging whether rule updating is needed or not according to the consistent relation between the entity sample and the content entity;

and the rule updating module 3 is used for updating the initial crawler rule according to the entity sample and the page information of the changed content entity when the output of the updating judging module is yes.

Further, the rule updating module specifically includes:

Further, the target location information determining unit specifically includes:

the summarizing subunit is used for summarizing the position information aiming at the same entity sample in one page into a set;

the intersection acquisition unit is used for solving the intersection of the sets of all the pages;

a target position information determining unit configured to use the position information in the intersection as the target position information.

Further, the location information determining unit specifically includes:

a positioning subunit, configured to use an upper element of the target entity as a positioning condition, and obtain, in the page, a result that can match the positioning condition;

and the position information determining subunit is used for determining the position information based on the current positioning condition until the target entity is the only acquired result.

Further, the positioning subunit specifically includes:

an element acquisition first component to acquire a first layer of elements above the target entity;

the attribute information judging component is used for judging whether the first layer element contains specific attribute information or not;

a positioning condition determining component for utilizing the first layer element as a positioning condition if the output of the attribute information judging component is yes;

and the element acquisition second component is used for acquiring a second layer element above the target entity and judging whether the second layer element contains the specific attribute information if the output of the attribute information judgment component is negative, and in this way, all the acquired upper layer elements are used as positioning conditions until any one upper layer element contains the specific attribute information.

Further, the position information determination unit further includes:

a positioning result quantity monitoring subunit, configured to determine whether a plurality of results are obtained based on the current layer element;

a repositioning subunit, configured to, if the output of the positioning result quantity monitoring subunit is yes, obtain a previous-layer element of the current-layer element to re-determine the positioning condition, and re-obtain a result that can match the positioning condition in the page;

and the positioning result determining unit is used for determining that the target entity is the only obtained result if the output of the positioning result quantity monitoring subunit is negative.

Further, the update determination module specifically includes:

In summary, the method of the present invention automatically updates the crawler rule to be associated with the current web page content and the original rule, specifically, crawls the content entity of the page by using the original crawler rule; judging whether rule updating is needed or not according to the consistent relation between the entity sample and the content entity; and if so, updating the initial crawler rule according to the entity sample and the page information of the changed content entity. According to the crawler monitoring method and the crawler monitoring system, updated crawler rules can be output in a self-adaptive mode according to the crawling result, manual frequent monitoring and maintenance of crawler programs are not needed, manual workload is greatly reduced, and meanwhile accuracy and stability of crawler effects can be effectively guaranteed.

While the above system embodiments and preferred modes of operation and technical principles are described in the foregoing, it should be noted that the various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. The modules or units or components in the embodiments may be combined into one module or unit or component, or may be divided into a plurality of sub-modules or sub-units or sub-components to be implemented.

In addition, the embodiments in the present specification are all described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.

Claims

1. A method for updating page crawler rules is characterized by comprising the following steps:

crawling content entities of the page by using an initial crawler rule;

if so, updating the initial crawler rule according to the entity sample and the page information of the content entity, wherein the updating comprises the following steps: firstly finding out a target entity consistent with the entity sample from the page information with the changed content entities; then taking the upper layer element of the target entity as a positioning condition, and acquiring a result which can be matched with the positioning condition from the page; and determining the position information for updating the crawler rule based on the current positioning condition until the target entity is the only result matched with the current positioning condition.

2. The method for updating page crawler rules according to claim 1, wherein said updating the initial crawler rules according to the entity samples and the changed page information of the content entities further comprises:

and updating the initial crawler rule by using the target position information.

3. The method for updating page crawler rules according to claim 2, wherein said determining target location information from said respective location information of different pages comprises:

finding the intersection of the sets of all pages;

4. The method for updating page crawler rules according to claim 1, wherein said taking upper layer elements of the target entity as positioning conditions comprises:

acquiring a first layer element above the target entity;

if so, using the first layer element as a positioning condition;

5. The method for updating page crawler rules according to claim 4, wherein determining said location information further comprises:

if not, determining that the target entity is the only acquired result.

6. The method for updating the rules of the page crawler according to any one of claims 1 to 5, wherein the determining whether the rule update is required according to the consistent relationship between the entity sample and the content entity comprises:

7. A system for updating page crawler rules, comprising:

a rule updating module, configured to update the initial crawler rule according to the entity sample and the page information of the changed content entity when the update determination module outputs yes, where the rule updating module includes: firstly finding out a target entity consistent with the entity sample from the page information with the changed content entities; then taking the upper layer element of the target entity as a positioning condition, and acquiring a result which can be matched with the positioning condition from the page; and determining the position information for updating the crawler rule based on the current positioning condition until the target entity is the only result matched with the current positioning condition.

8. The system for updating page crawler rules according to claim 7, wherein the rule updating module specifically comprises:

9. The system according to claim 7 or 8, wherein the update determining module specifically includes: