CN109783728A - Page crawler rule update method and system - Google Patents
Page crawler rule update method and system Download PDFInfo
- Publication number
- CN109783728A CN109783728A CN201811637755.7A CN201811637755A CN109783728A CN 109783728 A CN109783728 A CN 109783728A CN 201811637755 A CN201811637755 A CN 201811637755A CN 109783728 A CN109783728 A CN 109783728A
- Authority
- CN
- China
- Prior art keywords
- page
- entity
- crawler
- rule
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of page crawler rule update method and system, method specifically includes that the content substance that the page is crawled using initial crawler rule;According to the concord between entity sample and the content substance, determine a need for carrying out Policy Updates;If so, updating the initial crawler rule according to the entity sample and the changed page info of the content substance.While greatly reducing labor workload, the Stability and veracity of crawler effect can be effectively ensured without artificial frequently monitoring, maintenance crawlers according to the crawler rule for crawling result and adaptively exporting update through the invention.
Description
Technical field
The present invention relates to crawler field more particularly to a kind of page crawler rule update method and systems.
Background technique
Crawler technology has been known as a kind of mode that webpage information obtains, such as with internet information amount
Explosive growth, and net are acute, live streaming, the continuous rise of short-sighted frequency, and more and more people begin through network viewing video, but
There may be some bad videos for relating to Huang Shebao in these videos, therefore can be started with by crawler technology, obtain video automatically
The page info of website, and accordingly further positioning source video, it is subsequent can accurate judgement in conjunction with AI automatic identification technology
Whether video content relates to Huang Shebao.
Existing crawler scheme is mainly artificial enquiry page code, study its it is corresponding crawl rule, write corresponding
Crawlers.But when website modal shift leads to meta-rule can not where applicable, it usually needs manually reformulate new rule.Due to
Webpage quantity is more and can not predict what kind of variation website has, and timing monitors crawler state to judge to crawl whether rule is lost
Effect has often generated the data of a part of inaccuracy, has needed to delete this partial data, formulate new crawl when finding expired
Rule, or artificial continually maintenance is needed to crawl rule, the human cost of process investment is larger but efficiency is low.For people
For work individual, the process is same uninteresting cumbersome and abnormal time-consuming, it is difficult to guarantee the Stability and veracity of crawler effect.
Summary of the invention
In view of the above-mentioned problems, crawling knot based on existing the invention proposes a kind of page crawler rule and method and system
Fruit and the content changed determine and update to crawl rule automatically, it is possible thereby to avoid the above-mentioned drawback of manual maintenance crawler.
The technical solution adopted by the invention is as follows:
A kind of page crawler rule update method, comprising:
The content substance of the page is crawled using initial crawler rule;
According to the concord between entity sample and the content substance, determine a need for carrying out Policy Updates;
If so, being updated described first according to the entity sample and the changed page info of the content substance
Beginning crawler rule.
Optionally, described according to the entity sample and the changed page info of the content substance, update institute
Stating initial crawler rule includes:
In the page corresponding with the entity sample, obtain and the consistent target entity of entity sample;
Determine location information of the target entity in this page;
Never with target position information determining in the respective location information of the page;
The initial crawler rule is updated using the target position information.
Optionally, described never to include: with target position information determining in the respective location information of the page
The location information that the same entity sample is directed in one page is summarized for set;
Seek the intersection of sets collection of whole pages;
Using the location information in the intersection as the target position information.
Optionally, location information of the determination target entity in this page includes:
Using the upper layer element of the target entity as location condition, the location condition can be matched by obtaining in this page
Result;
Until determining that the position is believed based on current location condition when the target entity is the unique consequence got
Breath.
Optionally, described to include: as location condition using the upper layer element of the target entity
Obtain the first layer element on the target entity;
Judge whether the first layer element includes particular community information;
If so, using the first layer element as location condition;
If it is not, then obtain the second layer element on the target entity, and judge the second layer element whether include
The particular community information, by this method until getting in any upper layer element comprising after the particular community information, by institute
There is the upper layer element got as location condition.
Optionally, location information of the determination target entity in this page further include:
Judge whether based on result accessed by current layer element be multiple;
If so, the upper one layer of element for obtaining current layer element redefines the location condition, and again in this page
Middle acquisition can match the result of the location condition;
If not, it is determined that the target entity is the unique consequence got.
Optionally, the concord according between entity sample and the content substance determines a need for carrying out
Policy Updates include:
Whether content substance described in real-time monitoring and the entity sample are consistent;
When monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determine to need to carry out rule more
Newly.
A kind of page crawler Policy Updates system, comprising:
Crawler module, for crawling the content substance of the page using initial crawler rule;
Determination module is updated, is needed for determining whether according to the concord between entity sample and the content substance
Carry out Policy Updates;
Policy Updates module, when for exporting in the update determination module to be, according to the entity sample and institute
The changed page info of content substance is stated, the initial crawler rule is updated.
Optionally, the Policy Updates module specifically includes:
Target entity positioning unit, for obtaining and the entity sample in the page corresponding with the entity sample
Consistent target entity;
Location information determination unit, for determining location information of the target entity in this page;
Target position information determination unit, for never with target position determining in the respective location information of page letter
Breath;
Policy Updates unit, for updating the initial crawler rule using the target position information.
Optionally, the update determination module specifically includes:
Variation monitoring unit, it is whether consistent for content substance and the entity sample described in real-time monitoring;
Judging unit is updated, for when the quantity for both monitoring to occur the inconsistent page is more than preset threshold, then
Judgement needs to carry out Policy Updates.
What the present invention carried out be crawler rule is automatically updated it is associated with current web page content and original rule, specifically
It is the content substance that the page is crawled using initial crawler rule;According to the consistent pass between entity sample and the content substance
System determines a need for carrying out Policy Updates;If so, changed according to the entity sample and the content substance
Page info updates the initial crawler rule.It through the invention can be according to the crawler for crawling result and adaptively exporting update
While greatly reducing labor workload, crawler can be effectively ensured without artificial frequently monitoring, maintenance crawlers in rule
The Stability and veracity of effect.
Detailed description of the invention
To make the object, technical solutions and advantages of the present invention clearer, the present invention is made into one below in conjunction with attached drawing
Step description, in which:
Fig. 1 is the flow chart of the embodiment of page crawler rule update method provided by the invention;
Fig. 2 is the flow chart of the embodiment of change monitoring method provided by the invention;
Fig. 3 is the flow chart of the specific embodiment of rule update method provided by the invention;
Fig. 4 is the flow chart of the specific embodiment of method of determining position information provided by the invention;
Fig. 5 is the flow chart for the specific embodiment that location condition provided by the invention determines method;
Fig. 6 is the flow chart for the specific embodiment that positioning result provided by the invention determines method;
Fig. 7 is the flow chart for the specific embodiment that target position information provided by the invention determines method;
Fig. 8 is the block diagram of the embodiment of page crawler Policy Updates system provided by the invention.
Description of symbols:
1 crawler module 2 updates 3 Policy Updates module of determination module
Specific embodiment
The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end
Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing
The embodiment stated is exemplary, and for explaining only the invention, and is not construed as limiting the claims.
The present invention provides a kind of embodiments of page crawler rule update method, as shown in Figure 1, may include walking as follows
It is rapid:
Step S1, the content substance of the page is crawled using initial crawler rule;
The initial crawler rule of setting designated herein, which refers to, carries out targetedly rule settings to the Internet resources of a certain classification,
The other delimitation of web page class then can be default in early period, such as the webpage in video class website, the webpage in shopping class website, news
Webpage etc. in class website.Different page classifications has different page cloth in maximum probability since function has otherness
Office, and the webpage of same type then has the higher contents and distribution of similitude, therefore initial crawler rule alleged by the present embodiment is
For certain a kind of webpage, the webpage of other classifications is equal and so on, the present invention repeats no more.
The mode for specifically obtaining initial crawler rule can be based at least one specific objective in the page, set needle
To the initial crawler of specific objective rule, specific objective described here be can be according to the actual content plate of the page or thin
Divide depending on type, such as video web-pages, specific objective can include but is not limited to title, brief introduction, author etc., certainly, no
Same target also has respective rule, is that the initial crawler rule is understood that as by least one, " target is advised
Then " converge.
About the mode that crawls, then can be according to aforementioned initial crawler rule, by predetermined period, such as every 30 minutes or
It 1 hour or 2 hours etc., i.e., is crawled from the page according to initial crawler rule to corresponding to each spy by pre-set crawlers
Content substance under setting the goal needs to point out at 3 points here, and the quantity that one, the present embodiment do not treat the page crawled is limited
It is fixed;Secondly, the quantity of the specific objective under a page is not defined;Thirdly, the present invention not to alleged content substance, under
Entity sample in text etc. carries out form restriction, can be the content-form in any webpage such as text, symbol, code.
Step S2, it according to the concord between entity sample and content substance, determines a need for carrying out Policy Updates;
Entity sample designated herein can be before crawling as the pre-set reality of content to be crawled under specific objective
Body sample is also possible to using the content substance under the specific objective crawled for the first time according to initial crawler rule as the entity sample
This, but which kind of mode no matter is used to obtain entity sample, entity sample is all at least one specific objective being directed in each webpage,
Such as title in five video web-pages, brief introduction, author are crawled respectively, when this three specific objectives, then need to formulate 15
A entity sample, naturally it is also possible to be interpreted as the corresponding sample of a webpage, include three specific mesh in a sample
Standard specimen sheet.In actual operation, this step can also include the monitoring to the content substance under specific objective, and the present invention is gone back accordingly
A kind of embodiment of preferred monitoring method is provided, as shown in Fig. 2, may include steps of:
Step S201, whether real-time monitoring content substance and entity sample are consistent;
Judge whether both the content substance crawled and entity sample are consistent, it can be real from different dimensions, such as content
The dimensions such as the character of body position or content substance itself are matched one by one.Such as can be will be each after for the first time
The content substance that the secondary initial crawler rule of basis crawls carries out a dimension or multiple dimensions with aforementioned entities sample for the first time
Matching determines whether the two is consistent with this.
Step S202, when monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determine to need
Carry out Policy Updates.
In this step the present invention, it is emphasized that content substance changes opportunity item whether be determining Policy Updates
Part then enters subsequent Policy Updates step that is, when determining that content substance changes.It can be seen that content substance and reality
Body sample inconsistent (content substance crawled changes) does not imply that once there is any content substance that any change occurs
Change, it is necessary to subsequent step is entered, it will be appreciated by persons skilled in the art that the process includes certain qualifications,
I.e. when the content substance of some specific objective crawled occurs inconsistent with corresponding entity sample, while this specific objective
There is more than one preset standard of inconsistent page quantity (or accounting) and just determines " content substance changes ", such as this
The inconsistent webpage quantity of content substance for crawling middle specific objective title is more than 50%, then is determined as the reality of content alleged by the present invention
Body and entity sample are inconsistent.
It connects above, however, it is determined that need to carry out Policy Updates, then follow the steps S3, according to entity sample and content substance
Changed page info updates initial crawler rule.
The present invention is proposed for regular update not against artificial monitoring in this step, but is based on aforementioned entities sample
And the changed page info of content substance currently crawled, the update for carrying out rule operates, such as following one kind are excellent
The concrete scheme of choosing, as shown in figure 3, may include steps of:
Step S31, it in the page corresponding with entity sample, obtains and the consistent target entity of entity sample;
This step is to carry out the page of " content substance and entity sample are inconsistent " that is determined by abovementioned steps in whole
After holding information (such as source code) acquisition, using aforementioned entities sample as foundation, consistent content is searched in current page and is made
For alleged target entity, such as the entity sample of title in certain page be " cross-talk has new person ", then this step be
" cross-talk has new person " is found in content substance and the inconsistent page of entity sample, in actual operation, it is possible to current
Multiple " cross-talk has new person " that a cross-talk has new person or is located on the different location of webpage are got in webpage, it is such
Situation then is continued to implement by following steps of the process;If not getting " cross-talk has new person " in current web page, then the net
Page is not involved in the operation of subsequent Policy Updates.
Step S32, location information of the target entity in this page is determined;
After getting above-mentioned target entity, the present invention proposes to determine that position of the target entity in webpage indicates,
It that is to say by the formulation basis using the location information of the target entity as new rule in subsequent step, in order to ensure new rule
Validity and specific aim, the unique positions preferably by target entity in this page indicate the formulation basis as new rule,
It is when getting not unique result in the page, it is believed that location information is indefinite, specifically how to obtain the position
Confidence breath, embodiment shown in Figure 4 include the following steps:
Step S321, using the upper layer element of target entity as location condition, this can be matched by, which obtaining in this page, determines
The result of position condition;
Step S322, it until when the target entity is the unique consequence got, is determined based on current location condition
Location information.
For example, it is that cross-talk has new person that target entity has been got in the current page:
…
<span>
<a>
Cross-talk has new person
</a>
</span>
…
Wherein such as in example<span></span><a></a>etc. the upper layer element for being all " cross-talk has new person ", and its
He may also include identical one or more upper layer elements in entity, and therefore, the present embodiment is to utilize the target entity
Upper layer element verifies whether be only capable of navigating to unique alleged target entity in this page, if utilizing one or more upper layers
The result that element finally navigates to only has " cross-talk has new person ", then institute can be determined using one or more upper layer elements
Claim location information.
It in the specific implementation, can be with further reference to following preferred embodiments, as shown in Figure 5:
Step S3211, the first layer element on target entity is obtained;
Step S3212, judge whether first layer element includes particular community information;
If so, thening follow the steps S3213, using first layer element as location condition;
If it is not, thening follow the steps S3214, obtaining the second layer element on target entity, and judge that second layer element is
No includes particular community information, by this method until getting in any upper layer element comprising that will own after particular community information
The upper layer element got is as location condition.
First started in this step with upper one layer of content substance, i.e., first got in example<a></a>, then judge<a
></a>whether have can be with the specific attribute information of description rule, such as id (unique identifier), class (text style class
Type), Name (text label name) and Custom Attributes etc., it will be appreciated by persons skilled in the art that when an element
When having above- mentioned information, then corresponding rule then can accordingly depending on, thus, when comprising the particular community information
It utilizes<a></a>and its particular community information carries out complete search matching result as alleged location condition in the page;If
<a></a>it does not meet as location condition alleged by the present embodiment, is then second layer element for content substance by upper one layer again
In, determine whether that for example, this step need to be checked comprising particular community information<span></span>attribute information, with
This analogizes, if<span></span>comprising alleged particular community information, then utilize<span></span><a></a>in content
Matching result is searched in the changed current page information of entity;If<span></span>do not include particular community to believe yet
Breath then carries out same inquiry operation to more top element again, until getting the element comprising particular community information, and it is comprehensive
All layers of element under the element carry out complete search matching result in the page.
In addition, the processing scheme when getting multiple results can also be further considered on that basi of the above embodiments,
It is as shown in Figure 6:
Step S3221, judge whether based on result accessed by current layer element be multiple;
If so, then follow the steps S3222, obtain current layer element upper one layer of element redefine the location condition,
And the result that can match the location condition is obtained in this page again;
If it is not, thening follow the steps S3223, determining the target entity is the unique consequence got.
If being multiple according to the result that above-mentioned location condition is got, then it is assumed that the element condition is improper, thus can
To use for reference above-described embodiment, from upper one layer of current layer element according still further to aforementioned determining location condition the step of reacquire it is new
Location condition, and complete search matching result is carried out in the page based on new location condition again, until the result got
Only alleged target entity.
Step S32 above is connected, step S33, never with target position information determining in page respective positions information;
It is mentioned above and usually requires to crawl the content substance in multiple pages, thus when needing to update rule, usually
There is the webpage of certain amount or ratio that content substance variation has occurred, so what can be obtained by These steps is multiple and different
Webpage respective positions information, and can therefrom determine for the subsequent target position information for updating rule.Specifically in practical behaviour
It, can be as shown in fig. 7, comprises following steps in work:
Step S331, the location information that the same entity sample is directed in a page is summarized for set;
Step S332, the intersection of sets collection of whole pages is sought;
Step S333, using the location information in intersection as target position information.
It is mentioned above, is possible to get multiple target entities in webpage using entity sample, it is each by abovementioned steps
The unique positional relationship of available one of a target entity corresponding to the entity sample indicates that is, a page kind may
One or more location informations are finally contained, therefore, the present invention proposes to converge one or more location informations in the page
Total is set.Meanwhile again may be by similar mode in the different pages and obtaining respective set, therefore, only need to determine
Each intersection of sets collection that is to say that existing identical positional relationship indicates in each page.What needs to be explained here is that intersection
In position indicate to depend on entity sample, thus intersection corresponds to an entity sample, and may then cover in intersection
One or more location informations, then in actual operation, then can use any of intersection location information as target
Location information can also set the selection that certain standards carry out location information, so that it is determined that being directed to the target position of the entity sample
Confidence breath.
It connects above, finally executes step S34, updates initial crawler rule using target position information.
After target position information determines, then the target position information can be replaced to corresponding Mr. Yu in initial crawler rule
The original rule of a entity sample updates initial crawler rule by this method;It will be understood by those skilled in the art that hereinbefore
Multiple specific objectives of the method suitable for multiple webpages of same type, also therefore alleged target position information is multiple, then
In Policy Updates be by the original rule of update in need be replaced, and a comprehensive crawler journey is reconfigured with this
Sequence.
What the present invention carried out be crawler rule is automatically updated it is associated with current web page content and original rule, specifically
It is the content substance that the page is crawled using initial crawler rule;According to the consistent pass between entity sample and the content substance
System determines a need for carrying out Policy Updates;If so, changed according to the entity sample and the content substance
Page info updates the initial crawler rule.It through the invention can be according to the crawler for crawling result and adaptively exporting update
While greatly reducing labor workload, crawler can be effectively ensured without artificial frequently monitoring, maintenance crawlers in rule
The Stability and veracity of effect.
Corresponding to foregoing embodiments and preferred embodiment, the present invention also provides a kind of page crawler Policy Updates systems
Embodiment, as shown in figure 8, the system may include at least one be used for store dependent instruction memory and at least one with
The memory connects and is used to execute the processors of following each modules (one or more processors can also in other embodiments
Directly to execute corresponding step movement, without being executed by following modules, such as processor directly executes the page and crawls, supervises
Survey the operations such as variation, update rule):
Crawler module 1, for crawling the content substance of the page using initial crawler rule;
Determination module 2 is updated, is needed for determining whether according to the concord between entity sample and the content substance
Carry out Policy Updates;
Policy Updates module 3, when for exporting in the update determination module to be, according to the entity sample and institute
The changed page info of content substance is stated, the initial crawler rule is updated.
Further, the Policy Updates module specifically includes:
Target entity positioning unit, for obtaining and the entity sample in the page corresponding with the entity sample
Consistent target entity;
Location information determination unit, for determining location information of the target entity in this page;
Target position information determination unit, for never with target position determining in the respective location information of page letter
Breath;
Policy Updates unit, for updating the initial crawler rule using the target position information.
Further, target position information determination unit specifically includes:
Summarize subelement, for by a page be directed to the same entity sample the location information summarize for
Set;
Intersection acquiring unit, for seeking the intersection of sets collection of whole pages;
Target position information determination unit, for being believed using the location information in the intersection as the target position
Breath.
Further, location information determination unit specifically includes:
Locator unit, for obtaining energy in this page using the upper layer element of the target entity as location condition
Enough match the result of the location condition;
Location information determines subelement, for until being based on current when the target entity is the unique consequence got
Location condition determine the location information.
Further, locator unit specifically includes:
Element obtains first assembly, for obtaining the first layer element on the target entity;
Attribute information determines component, for judging whether the first layer element includes particular community information;
Location condition determines component, if for attribute information determine component output be it is yes, utilize the first layer element
As location condition;
Element obtain the second component, if for attribute information determine component output be it is no, obtain the target entity it
On second layer element, and judge whether the second layer element includes the particular community information, by this method until obtaining
Into any upper layer element comprising after the particular community information, using all upper layer elements got as location condition.
Further, location information determination unit further include:
Positioning result quantity monitors subelement, for judging whether based on result accessed by current layer element be more
It is a;
Relocate subelement, if for positioning result quantity monitoring subelement output be it is yes, obtain current layer element
Upper one layer of element redefines the location condition, and obtains the result that can match the location condition in this page again;
Positioning result determination unit, if being no for the monitoring subelement output of positioning result quantity, it is determined that the target
Entity is the unique consequence got.
Further, the update determination module specifically includes:
Variation monitoring unit, it is whether consistent for content substance and the entity sample described in real-time monitoring;
Judging unit is updated, for when the quantity for both monitoring to occur the inconsistent page is more than preset threshold, then
Judgement needs to carry out Policy Updates.
In conclusion what the present invention carried out is to automatically update crawler rule and current web page content and original regular phase
Association, specifically crawls the content substance of the page using initial crawler rule;According between entity sample and the content substance
Concord, determine a need for carry out Policy Updates;If so, being sent out according to the entity sample and the content substance
The page info for changing updates the initial crawler rule.It can adaptively be exported more according to result is crawled through the invention
New crawler is regular, without artificial frequently monitoring, safeguards that crawlers can be effective while greatly reducing labor workload
Guarantee the Stability and veracity of crawler effect.
Although the working method and technical principle of the above system embodiment and preferred embodiment are all recorded in above, still need to
, it is noted that various component embodiments of the invention can be implemented in hardware, or to transport on one or more processors
Capable software module is realized, or is implemented in a combination thereof.Module or unit or component in embodiment can be combined into
One module or unit or component, also they can be divided into a plurality of submodules or subunits or subassembliess to be practiced.
And all the embodiments in this specification are described in a progressive manner, identical phase between each embodiment
As partially may refer to each other, each embodiment focuses on the differences from other embodiments.Especially for
For system embodiment, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to method
The part of embodiment illustrates.System embodiment described above is only schematical, wherein saying as separation unit
Bright unit may or may not be physically separated, and component shown as a unit can be or can not also
It is physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual need
Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying
Out in the case where creative work, it can understand and implement.
It is described in detail structure, feature and effect of the invention based on the embodiments shown in the drawings, but more than
Only presently preferred embodiments of the present invention needs to explain, technical characteristic involved in above-described embodiment and its preferred embodiment, this
Field technical staff can be under the premise of not departing from, not changing mentality of designing and technical effect of the invention, reasonably group
Conjunction mixes into a variety of equivalent schemes;Therefore, the present invention does not limit the scope of implementation as shown in the drawings, all according to conception of the invention
Made change or equivalent example modified to equivalent change, when not going beyond the spirit of the description and the drawings,
It should be within the scope of the present invention.
Claims (10)
1. a kind of page crawler rule update method characterized by comprising
The content substance of the page is crawled using initial crawler rule;
According to the concord between entity sample and the content substance, determine a need for carrying out Policy Updates;
If so, update is described initially to climb according to the entity sample and the content substance changed page info
Worm rule.
2. page crawler rule update method according to claim 1, which is characterized in that described according to the entity sample
And the changed page info of content substance, updating the initial crawler rule includes:
In the page corresponding with the entity sample, obtain and the consistent target entity of entity sample;
Determine location information of the target entity in this page;
Never with target position information determining in the respective location information of the page;
The initial crawler rule is updated using the target position information.
3. page crawler rule update method according to claim 2, which is characterized in that described never respective with the page
Determine that target position information includes: in the location information
The location information that the same entity sample is directed in one page is summarized for set;
Seek the intersection of sets collection of whole pages;
Using the location information in the intersection as the target position information.
4. page crawler rule update method according to claim 2, which is characterized in that the determination target entity
Location information in this page includes:
Using the upper layer element of the target entity as location condition, the knot that can match the location condition is obtained in this page
Fruit;
Until determining the location information based on current location condition when the target entity is the unique consequence got.
5. page crawler rule update method according to claim 4, which is characterized in that described by the target entity
Upper layer element includes: as location condition
Obtain the first layer element on the target entity;
Judge whether the first layer element includes particular community information;
If so, using the first layer element as location condition;
If it is not, then obtaining the second layer element on the target entity, and judge whether the second layer element includes described
Particular community information, by this method until getting in any upper layer element comprising after the particular community information, by it is all
The upper layer element got is as location condition.
6. page crawler rule update method according to claim 5, which is characterized in that the determination target entity
Location information in this page further include:
Judge whether based on result accessed by current layer element be multiple;
If so, the upper one layer of element for obtaining current layer element redefines the location condition, and obtained in this page again
Take the result that can match the location condition;
If not, it is determined that the target entity is the unique consequence got.
7. described in any item page crawler rule update methods according to claim 1~6, which is characterized in that described according to reality
Concord between body sample and the content substance, determining a need for progress Policy Updates includes:
Whether content substance described in real-time monitoring and the entity sample are consistent;
When monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determine to need to carry out Policy Updates.
8. a kind of page crawler Policy Updates system characterized by comprising
Crawler module, for crawling the content substance of the page using initial crawler rule;
Update determination module, for according to the concord between entity sample and the content substance, determine a need for into
Line discipline updates;
Policy Updates module, for the update determination module output for be when, according to the entity sample and it is described in
Hold the changed page info of entity, updates the initial crawler rule.
9. page crawler Policy Updates system according to claim 8, which is characterized in that the Policy Updates module is specific
Include:
Target entity positioning unit, for obtaining consistent with the entity sample in the page corresponding with the entity sample
Target entity;
Location information determination unit, for determining location information of the target entity in this page;
Target position information determination unit, for never with target position information determining in the respective location information of the page;
Policy Updates unit, for updating the initial crawler rule using the target position information.
10. page crawler Policy Updates system according to claim 8 or claim 9, which is characterized in that the update determination module
It specifically includes:
Variation monitoring unit, it is whether consistent for content substance and the entity sample described in real-time monitoring;
Judging unit is updated, for when monitoring that the quantity for the inconsistent page of the two occur is more than preset threshold, then determining
It needs to carry out Policy Updates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811637755.7A CN109783728B (en) | 2018-12-29 | 2018-12-29 | Page crawler rule updating method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811637755.7A CN109783728B (en) | 2018-12-29 | 2018-12-29 | Page crawler rule updating method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109783728A true CN109783728A (en) | 2019-05-21 |
CN109783728B CN109783728B (en) | 2021-10-19 |
Family
ID=66498997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811637755.7A Active CN109783728B (en) | 2018-12-29 | 2018-12-29 | Page crawler rule updating method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109783728B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112417252A (en) * | 2020-12-04 | 2021-02-26 | 天津开心生活科技有限公司 | Crawler path determination method and device, storage medium and electronic equipment |
CN113626673A (en) * | 2021-07-30 | 2021-11-09 | 彩讯科技股份有限公司 | Page data acquisition method, system, terminal and storage medium |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101600A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Metadata automatic extraction method based on multiple rule in network search |
US20090019354A1 (en) * | 2007-07-10 | 2009-01-15 | Yahoo! Inc. | Automatically fetching web content with user assistance |
CN101599089A (en) * | 2009-07-17 | 2009-12-09 | 中国科学技术大学 | The automatic search of update information on content of video service website and extraction system and method |
CN101968819A (en) * | 2010-11-05 | 2011-02-09 | 中国传媒大学 | Audio/video intelligent catalog information acquisition method facing to wide area network |
US20110145218A1 (en) * | 2009-12-11 | 2011-06-16 | Microsoft Corporation | Search service administration web service protocol |
CN102760150A (en) * | 2012-04-05 | 2012-10-31 | 中国人民解放军国防科学技术大学 | Webpage extraction method based on attribute reproduction and labeled path |
CN103714116A (en) * | 2013-10-31 | 2014-04-09 | 北京奇虎科技有限公司 | Webpage information extracting method and webpage information extracting equipment |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
CN104899219A (en) * | 2014-03-06 | 2015-09-09 | 携程计算机技术(上海)有限公司 | Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system |
CN105512285A (en) * | 2015-12-07 | 2016-04-20 | 南京大学 | Self-adaption web crawler method based on machine learning |
CN105824965A (en) * | 2016-04-01 | 2016-08-03 | 无锡中科富农物联科技有限公司 | Data source finding method based on dynamic crawler technology |
CN105912547A (en) * | 2015-12-15 | 2016-08-31 | 乐视网信息技术(北京)股份有限公司 | Method and device for realizing data rapid processing based on web spider |
CN108073608A (en) * | 2016-11-09 | 2018-05-25 | 北京国双科技有限公司 | The update method and device of data message |
-
2018
- 2018-12-29 CN CN201811637755.7A patent/CN109783728B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101101600A (en) * | 2007-07-10 | 2008-01-09 | 北京大学 | Metadata automatic extraction method based on multiple rule in network search |
US20090019354A1 (en) * | 2007-07-10 | 2009-01-15 | Yahoo! Inc. | Automatically fetching web content with user assistance |
CN101599089A (en) * | 2009-07-17 | 2009-12-09 | 中国科学技术大学 | The automatic search of update information on content of video service website and extraction system and method |
US20110145218A1 (en) * | 2009-12-11 | 2011-06-16 | Microsoft Corporation | Search service administration web service protocol |
CN101968819A (en) * | 2010-11-05 | 2011-02-09 | 中国传媒大学 | Audio/video intelligent catalog information acquisition method facing to wide area network |
CN102760150A (en) * | 2012-04-05 | 2012-10-31 | 中国人民解放军国防科学技术大学 | Webpage extraction method based on attribute reproduction and labeled path |
CN103714116A (en) * | 2013-10-31 | 2014-04-09 | 北京奇虎科技有限公司 | Webpage information extracting method and webpage information extracting equipment |
CN104899219A (en) * | 2014-03-06 | 2015-09-09 | 携程计算机技术(上海)有限公司 | Screening method and system of pseudo-static URL (Uniform Resource Locator) and webpage crawling method and system |
CN104376063A (en) * | 2014-11-11 | 2015-02-25 | 南京邮电大学 | Multithreading web crawler method based on sort management and real-time information updating system |
CN105512285A (en) * | 2015-12-07 | 2016-04-20 | 南京大学 | Self-adaption web crawler method based on machine learning |
CN105912547A (en) * | 2015-12-15 | 2016-08-31 | 乐视网信息技术(北京)股份有限公司 | Method and device for realizing data rapid processing based on web spider |
CN105824965A (en) * | 2016-04-01 | 2016-08-03 | 无锡中科富农物联科技有限公司 | Data source finding method based on dynamic crawler technology |
CN108073608A (en) * | 2016-11-09 | 2018-05-25 | 北京国双科技有限公司 | The update method and device of data message |
Non-Patent Citations (2)
Title |
---|
Z.GUOJUN 等: "Design and application of intelligent dynamic crawler for web data mining", 《2017 32ND YOUTH ACADEMIC ANNUAL CONFERENCE OF CHINESE ASSOCIATION OF AUTOMATION(YAC)》 * |
唐琳 等: "基于Python的网络爬虫技术的关键性问题探索", 《电子世界》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112417252A (en) * | 2020-12-04 | 2021-02-26 | 天津开心生活科技有限公司 | Crawler path determination method and device, storage medium and electronic equipment |
CN112417252B (en) * | 2020-12-04 | 2023-05-09 | 天津开心生活科技有限公司 | Crawler path determination method and device, storage medium and electronic equipment |
CN113626673A (en) * | 2021-07-30 | 2021-11-09 | 彩讯科技股份有限公司 | Page data acquisition method, system, terminal and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109783728B (en) | 2021-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11768865B2 (en) | Tag weighting engine using past context and active context | |
JP5436665B2 (en) | Classification of simultaneously selected images | |
JP6303023B2 (en) | Temporary eventing system and method | |
EP1862916A1 (en) | Indexing Documents for Information Retrieval based on additional feedback fields | |
CN104750737B (en) | A kind of photograph album management method and device | |
US20130251217A1 (en) | Method and Apparatus to Incorporate Automatic Face Recognition in Digital Image Collections | |
CN104424244B (en) | A kind of method, apparatus and equipment obtaining search result | |
CN103488681A (en) | Slash label | |
CN110019823B (en) | Method and device for updating knowledge graph | |
US20050116966A1 (en) | Web imaging serving technology | |
CA2647725C (en) | Contextual search of a collaborative environment | |
CN108170731A (en) | Data processing method, device, computer storage media and server | |
CN107885873A (en) | Method and apparatus for output information | |
CN112989157A (en) | Method and device for detecting crawler request | |
CN109783728A (en) | Page crawler rule update method and system | |
CN106897432B (en) | System and method for crawling landmark information in electronic map | |
CN111191065B (en) | Homologous image determining method and device | |
CN105653533B (en) | A kind of method and apparatus updating classification associated set of words | |
US20200125680A1 (en) | Systems and methods for producing search results based on user preferences | |
CN110096658A (en) | A kind of data bury point methods and device | |
US20140244319A1 (en) | System and method for storing and finding activities | |
CN113590277A (en) | Task state switching method and device and electronic system | |
CN108021641B (en) | The method and apparatus that the association keyword of application is expanded | |
JP2005148846A (en) | Content classifying system and method, computer program, and recording medium | |
CN106415482A (en) | Providing aggregated metadata for programming content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |