CN109783728A

CN109783728A - Page crawler rule update method and system

Info

Publication number: CN109783728A
Application number: CN201811637755.7A
Authority: CN
Inventors: 韩建民; 王玮; 苏文畅; 王兆育; 孙志豪
Original assignee: Anhui Hear Technology Co Ltd
Current assignee: Anhui Hear Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-05-21
Anticipated expiration: 2038-12-29
Also published as: CN109783728B

Abstract

The invention discloses a kind of page crawler rule update method and system, method specifically includes that the content substance that the page is crawled using initial crawler rule；According to the concord between entity sample and the content substance, determine a need for carrying out Policy Updates；If so, updating the initial crawler rule according to the entity sample and the changed page info of the content substance.While greatly reducing labor workload, the Stability and veracity of crawler effect can be effectively ensured without artificial frequently monitoring, maintenance crawlers according to the crawler rule for crawling result and adaptively exporting update through the invention.

Description

Page crawler rule update method and system

Technical field

The present invention relates to crawler field more particularly to a kind of page crawler rule update method and systems.

Background technique

Crawler technology has been known as a kind of mode that webpage information obtains, such as with internet information amount Explosive growth, and net are acute, live streaming, the continuous rise of short-sighted frequency, and more and more people begin through network viewing video, but There may be some bad videos for relating to Huang Shebao in these videos, therefore can be started with by crawler technology, obtain video automatically The page info of website, and accordingly further positioning source video, it is subsequent can accurate judgement in conjunction with AI automatic identification technology Whether video content relates to Huang Shebao.

Existing crawler scheme is mainly artificial enquiry page code, study its it is corresponding crawl rule, write corresponding Crawlers.But when website modal shift leads to meta-rule can not where applicable, it usually needs manually reformulate new rule.Due to Webpage quantity is more and can not predict what kind of variation website has, and timing monitors crawler state to judge to crawl whether rule is lost Effect has often generated the data of a part of inaccuracy, has needed to delete this partial data, formulate new crawl when finding expired Rule, or artificial continually maintenance is needed to crawl rule, the human cost of process investment is larger but efficiency is low.For people For work individual, the process is same uninteresting cumbersome and abnormal time-consuming, it is difficult to guarantee the Stability and veracity of crawler effect.

Summary of the invention

In view of the above-mentioned problems, crawling knot based on existing the invention proposes a kind of page crawler rule and method and system Fruit and the content changed determine and update to crawl rule automatically, it is possible thereby to avoid the above-mentioned drawback of manual maintenance crawler.

The technical solution adopted by the invention is as follows:

A kind of page crawler rule update method, comprising:

The content substance of the page is crawled using initial crawler rule；

According to the concord between entity sample and the content substance, determine a need for carrying out Policy Updates；

If so, being updated described first according to the entity sample and the changed page info of the content substance Beginning crawler rule.

Optionally, described according to the entity sample and the changed page info of the content substance, update institute Stating initial crawler rule includes:

In the page corresponding with the entity sample, obtain and the consistent target entity of entity sample；

Determine location information of the target entity in this page；

Never with target position information determining in the respective location information of the page；

The initial crawler rule is updated using the target position information.

Optionally, described never to include: with target position information determining in the respective location information of the page

The location information that the same entity sample is directed in one page is summarized for set；

Seek the intersection of sets collection of whole pages；

Using the location information in the intersection as the target position information.

Optionally, location information of the determination target entity in this page includes:

Using the upper layer element of the target entity as location condition, the location condition can be matched by obtaining in this page Result；

Until determining that the position is believed based on current location condition when the target entity is the unique consequence got Breath.

Optionally, described to include: as location condition using the upper layer element of the target entity

Obtain the first layer element on the target entity；

Judge whether the first layer element includes particular community information；

If so, using the first layer element as location condition；

If it is not, then obtain the second layer element on the target entity, and judge the second layer element whether include The particular community information, by this method until getting in any upper layer element comprising after the particular community information, by institute There is the upper layer element got as location condition.

Optionally, location information of the determination target entity in this page further include:

Judge whether based on result accessed by current layer element be multiple；

If so, the upper one layer of element for obtaining current layer element redefines the location condition, and again in this page Middle acquisition can match the result of the location condition；

If not, it is determined that the target entity is the unique consequence got.

Optionally, the concord according between entity sample and the content substance determines a need for carrying out Policy Updates include:

Whether content substance described in real-time monitoring and the entity sample are consistent；

When monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determine to need to carry out rule more Newly.

A kind of page crawler Policy Updates system, comprising:

Crawler module, for crawling the content substance of the page using initial crawler rule；

Determination module is updated, is needed for determining whether according to the concord between entity sample and the content substance Carry out Policy Updates；

Policy Updates module, when for exporting in the update determination module to be, according to the entity sample and institute The changed page info of content substance is stated, the initial crawler rule is updated.

Optionally, the Policy Updates module specifically includes:

Target entity positioning unit, for obtaining and the entity sample in the page corresponding with the entity sample Consistent target entity；

Location information determination unit, for determining location information of the target entity in this page；

Target position information determination unit, for never with target position determining in the respective location information of page letter Breath；

Policy Updates unit, for updating the initial crawler rule using the target position information.

Optionally, the update determination module specifically includes:

Variation monitoring unit, it is whether consistent for content substance and the entity sample described in real-time monitoring；

Judging unit is updated, for when the quantity for both monitoring to occur the inconsistent page is more than preset threshold, then Judgement needs to carry out Policy Updates.

What the present invention carried out be crawler rule is automatically updated it is associated with current web page content and original rule, specifically It is the content substance that the page is crawled using initial crawler rule；According to the consistent pass between entity sample and the content substance System determines a need for carrying out Policy Updates；If so, changed according to the entity sample and the content substance Page info updates the initial crawler rule.It through the invention can be according to the crawler for crawling result and adaptively exporting update While greatly reducing labor workload, crawler can be effectively ensured without artificial frequently monitoring, maintenance crawlers in rule The Stability and veracity of effect.

Detailed description of the invention

To make the object, technical solutions and advantages of the present invention clearer, the present invention is made into one below in conjunction with attached drawing Step description, in which:

Fig. 1 is the flow chart of the embodiment of page crawler rule update method provided by the invention；

Fig. 2 is the flow chart of the embodiment of change monitoring method provided by the invention；

Fig. 3 is the flow chart of the specific embodiment of rule update method provided by the invention；

Fig. 4 is the flow chart of the specific embodiment of method of determining position information provided by the invention；

Fig. 5 is the flow chart for the specific embodiment that location condition provided by the invention determines method；

Fig. 6 is the flow chart for the specific embodiment that positioning result provided by the invention determines method；

Fig. 7 is the flow chart for the specific embodiment that target position information provided by the invention determines method；

Fig. 8 is the block diagram of the embodiment of page crawler Policy Updates system provided by the invention.

Description of symbols:

1 crawler module 2 updates 3 Policy Updates module of determination module

Specific embodiment

The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, and for explaining only the invention, and is not construed as limiting the claims.

The present invention provides a kind of embodiments of page crawler rule update method, as shown in Figure 1, may include walking as follows It is rapid:

Step S1, the content substance of the page is crawled using initial crawler rule；

The initial crawler rule of setting designated herein, which refers to, carries out targetedly rule settings to the Internet resources of a certain classification, The other delimitation of web page class then can be default in early period, such as the webpage in video class website, the webpage in shopping class website, news Webpage etc. in class website.Different page classifications has different page cloth in maximum probability since function has otherness Office, and the webpage of same type then has the higher contents and distribution of similitude, therefore initial crawler rule alleged by the present embodiment is For certain a kind of webpage, the webpage of other classifications is equal and so on, the present invention repeats no more.

The mode for specifically obtaining initial crawler rule can be based at least one specific objective in the page, set needle To the initial crawler of specific objective rule, specific objective described here be can be according to the actual content plate of the page or thin Divide depending on type, such as video web-pages, specific objective can include but is not limited to title, brief introduction, author etc., certainly, no Same target also has respective rule, is that the initial crawler rule is understood that as by least one, " target is advised Then " converge.

About the mode that crawls, then can be according to aforementioned initial crawler rule, by predetermined period, such as every 30 minutes or It 1 hour or 2 hours etc., i.e., is crawled from the page according to initial crawler rule to corresponding to each spy by pre-set crawlers Content substance under setting the goal needs to point out at 3 points here, and the quantity that one, the present embodiment do not treat the page crawled is limited It is fixed；Secondly, the quantity of the specific objective under a page is not defined；Thirdly, the present invention not to alleged content substance, under Entity sample in text etc. carries out form restriction, can be the content-form in any webpage such as text, symbol, code.

Step S2, it according to the concord between entity sample and content substance, determines a need for carrying out Policy Updates；

Entity sample designated herein can be before crawling as the pre-set reality of content to be crawled under specific objective Body sample is also possible to using the content substance under the specific objective crawled for the first time according to initial crawler rule as the entity sample This, but which kind of mode no matter is used to obtain entity sample, entity sample is all at least one specific objective being directed in each webpage, Such as title in five video web-pages, brief introduction, author are crawled respectively, when this three specific objectives, then need to formulate 15 A entity sample, naturally it is also possible to be interpreted as the corresponding sample of a webpage, include three specific mesh in a sample Standard specimen sheet.In actual operation, this step can also include the monitoring to the content substance under specific objective, and the present invention is gone back accordingly A kind of embodiment of preferred monitoring method is provided, as shown in Fig. 2, may include steps of:

Step S201, whether real-time monitoring content substance and entity sample are consistent；

Judge whether both the content substance crawled and entity sample are consistent, it can be real from different dimensions, such as content The dimensions such as the character of body position or content substance itself are matched one by one.Such as can be will be each after for the first time The content substance that the secondary initial crawler rule of basis crawls carries out a dimension or multiple dimensions with aforementioned entities sample for the first time Matching determines whether the two is consistent with this.

Step S202, when monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determine to need Carry out Policy Updates.

In this step the present invention, it is emphasized that content substance changes opportunity item whether be determining Policy Updates Part then enters subsequent Policy Updates step that is, when determining that content substance changes.It can be seen that content substance and reality Body sample inconsistent (content substance crawled changes) does not imply that once there is any content substance that any change occurs Change, it is necessary to subsequent step is entered, it will be appreciated by persons skilled in the art that the process includes certain qualifications, I.e. when the content substance of some specific objective crawled occurs inconsistent with corresponding entity sample, while this specific objective There is more than one preset standard of inconsistent page quantity (or accounting) and just determines " content substance changes ", such as this The inconsistent webpage quantity of content substance for crawling middle specific objective title is more than 50%, then is determined as the reality of content alleged by the present invention Body and entity sample are inconsistent.

It connects above, however, it is determined that need to carry out Policy Updates, then follow the steps S3, according to entity sample and content substance Changed page info updates initial crawler rule.

The present invention is proposed for regular update not against artificial monitoring in this step, but is based on aforementioned entities sample And the changed page info of content substance currently crawled, the update for carrying out rule operates, such as following one kind are excellent The concrete scheme of choosing, as shown in figure 3, may include steps of:

Step S31, it in the page corresponding with entity sample, obtains and the consistent target entity of entity sample；

This step is to carry out the page of " content substance and entity sample are inconsistent " that is determined by abovementioned steps in whole After holding information (such as source code) acquisition, using aforementioned entities sample as foundation, consistent content is searched in current page and is made For alleged target entity, such as the entity sample of title in certain page be " cross-talk has new person ", then this step be " cross-talk has new person " is found in content substance and the inconsistent page of entity sample, in actual operation, it is possible to current Multiple " cross-talk has new person " that a cross-talk has new person or is located on the different location of webpage are got in webpage, it is such Situation then is continued to implement by following steps of the process；If not getting " cross-talk has new person " in current web page, then the net Page is not involved in the operation of subsequent Policy Updates.

Step S32, location information of the target entity in this page is determined；

After getting above-mentioned target entity, the present invention proposes to determine that position of the target entity in webpage indicates, It that is to say by the formulation basis using the location information of the target entity as new rule in subsequent step, in order to ensure new rule Validity and specific aim, the unique positions preferably by target entity in this page indicate the formulation basis as new rule, It is when getting not unique result in the page, it is believed that location information is indefinite, specifically how to obtain the position Confidence breath, embodiment shown in Figure 4 include the following steps:

Step S321, using the upper layer element of target entity as location condition, this can be matched by, which obtaining in this page, determines The result of position condition；

Step S322, it until when the target entity is the unique consequence got, is determined based on current location condition Location information.

For example, it is that cross-talk has new person that target entity has been got in the current page:

…

<a>

Cross-talk has new person

</a>

…

Wherein such as in example<a></a>etc. the upper layer element for being all " cross-talk has new person ", and its He may also include identical one or more upper layer elements in entity, and therefore, the present embodiment is to utilize the target entity Upper layer element verifies whether be only capable of navigating to unique alleged target entity in this page, if utilizing one or more upper layers The result that element finally navigates to only has " cross-talk has new person ", then institute can be determined using one or more upper layer elements Claim location information.

It in the specific implementation, can be with further reference to following preferred embodiments, as shown in Figure 5:

Step S3211, the first layer element on target entity is obtained；

Step S3212, judge whether first layer element includes particular community information；

If so, thening follow the steps S3213, using first layer element as location condition；

If it is not, thening follow the steps S3214, obtaining the second layer element on target entity, and judge that second layer element is No includes particular community information, by this method until getting in any upper layer element comprising that will own after particular community information The upper layer element got is as location condition.

First started in this step with upper one layer of content substance, i.e., first got in example<a></a>, then judge<a ></a>whether have can be with the specific attribute information of description rule, such as id (unique identifier), class (text style class Type), Name (text label name) and Custom Attributes etc., it will be appreciated by persons skilled in the art that when an element When having above- mentioned information, then corresponding rule then can accordingly depending on, thus, when comprising the particular community information It utilizes<a></a>and its particular community information carries out complete search matching result as alleged location condition in the page；If <a></a>it does not meet as location condition alleged by the present embodiment, is then second layer element for content substance by upper one layer again In, determine whether that for example, this step need to be checked comprising particular community informationattribute information, with This analogizes, ifcomprising alleged particular community information, then utilize<a></a>in content Matching result is searched in the changed current page information of entity；Ifdo not include particular community to believe yet Breath then carries out same inquiry operation to more top element again, until getting the element comprising particular community information, and it is comprehensive All layers of element under the element carry out complete search matching result in the page.

In addition, the processing scheme when getting multiple results can also be further considered on that basi of the above embodiments, It is as shown in Figure 6:

Step S3221, judge whether based on result accessed by current layer element be multiple；

If so, then follow the steps S3222, obtain current layer element upper one layer of element redefine the location condition, And the result that can match the location condition is obtained in this page again；

If it is not, thening follow the steps S3223, determining the target entity is the unique consequence got.

If being multiple according to the result that above-mentioned location condition is got, then it is assumed that the element condition is improper, thus can To use for reference above-described embodiment, from upper one layer of current layer element according still further to aforementioned determining location condition the step of reacquire it is new Location condition, and complete search matching result is carried out in the page based on new location condition again, until the result got Only alleged target entity.

Step S32 above is connected, step S33, never with target position information determining in page respective positions information；

It is mentioned above and usually requires to crawl the content substance in multiple pages, thus when needing to update rule, usually There is the webpage of certain amount or ratio that content substance variation has occurred, so what can be obtained by These steps is multiple and different Webpage respective positions information, and can therefrom determine for the subsequent target position information for updating rule.Specifically in practical behaviour It, can be as shown in fig. 7, comprises following steps in work:

Step S331, the location information that the same entity sample is directed in a page is summarized for set；

Step S332, the intersection of sets collection of whole pages is sought；

Step S333, using the location information in intersection as target position information.

It is mentioned above, is possible to get multiple target entities in webpage using entity sample, it is each by abovementioned steps The unique positional relationship of available one of a target entity corresponding to the entity sample indicates that is, a page kind may One or more location informations are finally contained, therefore, the present invention proposes to converge one or more location informations in the page Total is set.Meanwhile again may be by similar mode in the different pages and obtaining respective set, therefore, only need to determine Each intersection of sets collection that is to say that existing identical positional relationship indicates in each page.What needs to be explained here is that intersection In position indicate to depend on entity sample, thus intersection corresponds to an entity sample, and may then cover in intersection One or more location informations, then in actual operation, then can use any of intersection location information as target Location information can also set the selection that certain standards carry out location information, so that it is determined that being directed to the target position of the entity sample Confidence breath.

It connects above, finally executes step S34, updates initial crawler rule using target position information.

After target position information determines, then the target position information can be replaced to corresponding Mr. Yu in initial crawler rule The original rule of a entity sample updates initial crawler rule by this method；It will be understood by those skilled in the art that hereinbefore Multiple specific objectives of the method suitable for multiple webpages of same type, also therefore alleged target position information is multiple, then In Policy Updates be by the original rule of update in need be replaced, and a comprehensive crawler journey is reconfigured with this Sequence.

Corresponding to foregoing embodiments and preferred embodiment, the present invention also provides a kind of page crawler Policy Updates systems Embodiment, as shown in figure 8, the system may include at least one be used for store dependent instruction memory and at least one with The memory connects and is used to execute the processors of following each modules (one or more processors can also in other embodiments Directly to execute corresponding step movement, without being executed by following modules, such as processor directly executes the page and crawls, supervises Survey the operations such as variation, update rule):

Crawler module 1, for crawling the content substance of the page using initial crawler rule；

Determination module 2 is updated, is needed for determining whether according to the concord between entity sample and the content substance Carry out Policy Updates；

Policy Updates module 3, when for exporting in the update determination module to be, according to the entity sample and institute The changed page info of content substance is stated, the initial crawler rule is updated.

Further, the Policy Updates module specifically includes:

Further, target position information determination unit specifically includes:

Summarize subelement, for by a page be directed to the same entity sample the location information summarize for Set；

Intersection acquiring unit, for seeking the intersection of sets collection of whole pages；

Target position information determination unit, for being believed using the location information in the intersection as the target position Breath.

Further, location information determination unit specifically includes:

Locator unit, for obtaining energy in this page using the upper layer element of the target entity as location condition Enough match the result of the location condition；

Location information determines subelement, for until being based on current when the target entity is the unique consequence got Location condition determine the location information.

Further, locator unit specifically includes:

Element obtains first assembly, for obtaining the first layer element on the target entity；

Attribute information determines component, for judging whether the first layer element includes particular community information；

Location condition determines component, if for attribute information determine component output be it is yes, utilize the first layer element As location condition；

Element obtain the second component, if for attribute information determine component output be it is no, obtain the target entity it On second layer element, and judge whether the second layer element includes the particular community information, by this method until obtaining Into any upper layer element comprising after the particular community information, using all upper layer elements got as location condition.

Further, location information determination unit further include:

Positioning result quantity monitors subelement, for judging whether based on result accessed by current layer element be more It is a；

Relocate subelement, if for positioning result quantity monitoring subelement output be it is yes, obtain current layer element Upper one layer of element redefines the location condition, and obtains the result that can match the location condition in this page again；

Positioning result determination unit, if being no for the monitoring subelement output of positioning result quantity, it is determined that the target Entity is the unique consequence got.

Further, the update determination module specifically includes:

In conclusion what the present invention carried out is to automatically update crawler rule and current web page content and original regular phase Association, specifically crawls the content substance of the page using initial crawler rule；According between entity sample and the content substance Concord, determine a need for carry out Policy Updates；If so, being sent out according to the entity sample and the content substance The page info for changing updates the initial crawler rule.It can adaptively be exported more according to result is crawled through the invention New crawler is regular, without artificial frequently monitoring, safeguards that crawlers can be effective while greatly reducing labor workload Guarantee the Stability and veracity of crawler effect.

Although the working method and technical principle of the above system embodiment and preferred embodiment are all recorded in above, still need to , it is noted that various component embodiments of the invention can be implemented in hardware, or to transport on one or more processors Capable software module is realized, or is implemented in a combination thereof.Module or unit or component in embodiment can be combined into One module or unit or component, also they can be divided into a plurality of submodules or subunits or subassembliess to be practiced.

And all the embodiments in this specification are described in a progressive manner, identical phase between each embodiment As partially may refer to each other, each embodiment focuses on the differences from other embodiments.Especially for For system embodiment, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to method The part of embodiment illustrates.System embodiment described above is only schematical, wherein saying as separation unit Bright unit may or may not be physically separated, and component shown as a unit can be or can not also It is physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual need Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying Out in the case where creative work, it can understand and implement.

It is described in detail structure, feature and effect of the invention based on the embodiments shown in the drawings, but more than Only presently preferred embodiments of the present invention needs to explain, technical characteristic involved in above-described embodiment and its preferred embodiment, this Field technical staff can be under the premise of not departing from, not changing mentality of designing and technical effect of the invention, reasonably group Conjunction mixes into a variety of equivalent schemes；Therefore, the present invention does not limit the scope of implementation as shown in the drawings, all according to conception of the invention Made change or equivalent example modified to equivalent change, when not going beyond the spirit of the description and the drawings, It should be within the scope of the present invention.

Claims

1. a kind of page crawler rule update method characterized by comprising

The content substance of the page is crawled using initial crawler rule；

If so, update is described initially to climb according to the entity sample and the content substance changed page info Worm rule.

2. page crawler rule update method according to claim 1, which is characterized in that described according to the entity sample And the changed page info of content substance, updating the initial crawler rule includes:

Determine location information of the target entity in this page；

The initial crawler rule is updated using the target position information.

3. page crawler rule update method according to claim 2, which is characterized in that described never respective with the page Determine that target position information includes: in the location information

Seek the intersection of sets collection of whole pages；

4. page crawler rule update method according to claim 2, which is characterized in that the determination target entity Location information in this page includes:

Using the upper layer element of the target entity as location condition, the knot that can match the location condition is obtained in this page Fruit；

Until determining the location information based on current location condition when the target entity is the unique consequence got.

5. page crawler rule update method according to claim 4, which is characterized in that described by the target entity Upper layer element includes: as location condition

Obtain the first layer element on the target entity；

If so, using the first layer element as location condition；

If it is not, then obtaining the second layer element on the target entity, and judge whether the second layer element includes described Particular community information, by this method until getting in any upper layer element comprising after the particular community information, by it is all The upper layer element got is as location condition.

6. page crawler rule update method according to claim 5, which is characterized in that the determination target entity Location information in this page further include:

Judge whether based on result accessed by current layer element be multiple；

If so, the upper one layer of element for obtaining current layer element redefines the location condition, and obtained in this page again Take the result that can match the location condition；

If not, it is determined that the target entity is the unique consequence got.

7. described in any item page crawler rule update methods according to claim 1~6, which is characterized in that described according to reality Concord between body sample and the content substance, determining a need for progress Policy Updates includes:

When monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determine to need to carry out Policy Updates.

8. a kind of page crawler Policy Updates system characterized by comprising

Update determination module, for according to the concord between entity sample and the content substance, determine a need for into Line discipline updates；

Policy Updates module, for the update determination module output for be when, according to the entity sample and it is described in Hold the changed page info of entity, updates the initial crawler rule.

9. page crawler Policy Updates system according to claim 8, which is characterized in that the Policy Updates module is specific Include:

Target entity positioning unit, for obtaining consistent with the entity sample in the page corresponding with the entity sample Target entity；

Target position information determination unit, for never with target position information determining in the respective location information of the page；

10. page crawler Policy Updates system according to claim 8 or claim 9, which is characterized in that the update determination module It specifically includes:

Judging unit is updated, for when monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determining It needs to carry out Policy Updates.