CN110968756A - Webpage crawling method and device - Google Patents

Webpage crawling method and device Download PDF

Info

Publication number
CN110968756A
CN110968756A CN201811145540.3A CN201811145540A CN110968756A CN 110968756 A CN110968756 A CN 110968756A CN 201811145540 A CN201811145540 A CN 201811145540A CN 110968756 A CN110968756 A CN 110968756A
Authority
CN
China
Prior art keywords
crawling
framework
rule
rules
domain name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811145540.3A
Other languages
Chinese (zh)
Other versions
CN110968756B (en
Inventor
何熠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811145540.3A priority Critical patent/CN110968756B/en
Publication of CN110968756A publication Critical patent/CN110968756A/en
Application granted granted Critical
Publication of CN110968756B publication Critical patent/CN110968756B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage crawling method and device, and relates to the technical field of crawling. The method and the device mainly solve the problem that a new crawling architecture cannot be formed to crawl the webpage based on the pre-created crawling architecture in the prior art. The method of the invention comprises the following steps: acquiring a domain name of a target webpage, and determining a rule matched with the domain name; judging whether a first crawling framework created in advance contains at least part of rules matched with the domain name; if the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework; and creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework. The method can be widely applied to the scene of crawling the webpage.

Description

Webpage crawling method and device
Technical Field
The invention relates to the technical field of crawling, in particular to a webpage crawling method and device.
Background
With the rapid development of network technology, the information carried in the world wide web is more and more, and searching for information using the conventional search engine technology takes a lot of time and it is difficult to accurately search for information required by a user. How to quickly and effectively acquire information required by a user from a large amount of network information becomes a problem to be solved urgently.
To solve this problem, web crawlers have come to mind. The web crawler is a program or script for automatically capturing web information according to a certain rule, and can quickly and accurately acquire information required by a user from a web page. When crawling a webpage by using a crawler, a user firstly needs to create a crawling framework (Schema), but because creating a complete crawling framework can spend a large amount of time and energy of the user, the prior art can share the crawling framework created by other users in advance through a sharing mechanism, the user can directly quote the crawling framework created in advance to crawl the webpage when needed, but the user cannot form a new crawling framework meeting the crawling requirement of the current webpage based on the crawling framework created in advance, and crawls the webpage.
Disclosure of Invention
In view of this, the web page crawling method and apparatus provided by the invention mainly aim to solve the problem that a new crawling architecture cannot be formed based on a pre-created crawling architecture in the prior art to crawl a web page.
In order to solve the above problems, the present invention mainly provides the following technical solutions:
in a first aspect, the present invention provides a method for crawling a web page, including:
acquiring a domain name of a target webpage, and determining a rule matched with the domain name;
judging whether a first crawling framework created in advance contains at least part of rules matched with the domain name;
if the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework;
and creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
Optionally, the method further includes:
if the first crawling framework does not contain at least part of rules matched with the domain name, creating a third crawling framework with rules matched with the domain name;
and crawling the target webpage through the third crawling architecture.
Optionally, the first crawling framework includes at least part of rules matching the domain name, and includes:
the first crawling framework comprises all rules matched with the domain name; alternatively, the first and second electrodes may be,
the first crawling framework comprises a part of rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the method comprises the following steps:
the first crawling framework does not contain any rule matched with the domain name.
Optionally, if the first crawling architecture includes a part of rules matching with the domain name, the method further includes:
determining a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
creating the rule to be created;
the creating a second crawling architecture according to the at least part of the rule comprises:
and creating the second crawling framework according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: the field attribute is used for determining fields and field types required to be crawled by the crawling architecture, and each rule has the corresponding field attribute.
Optionally, the method further includes:
judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
if yes, determining a rule for crawling from the matched rules according to preset matching conditions;
the crawling of the target webpage through the second crawling framework or the third crawling framework comprises:
and the second crawling framework or the third crawling framework crawls the target webpage according to the determined rule.
Optionally, the method further includes:
when the rule in the first crawling framework is monitored to be deleted, and/or the field attribute corresponding to the rule in the first crawling framework is monitored to be deleted, or the first crawling framework is monitored to be deleted, the rule and/or the field attribute corresponding to the rule are added to the second crawling framework;
and releasing the inheritance relationship of the first crawling framework and the second crawling framework.
In a second aspect, the present invention further provides a web page crawling apparatus, including:
the acquisition unit is used for acquiring the domain name of the target webpage;
the determining unit is used for determining a rule matched with the domain name;
the judgment unit is used for judging whether a first crawling framework which is created in advance contains at least part of rules matched with the domain name;
the inheritance unit is used for inheriting at least part of rules from the first crawling framework when the first crawling framework contains at least part of rules matched with the domain name;
a creating unit, configured to create a second crawling framework according to the at least part of the rule;
and the crawling unit is used for crawling the target webpage through the second crawling framework.
Optionally, the creating unit is further configured to create a third crawling framework having rules matching the domain name when at least part of the rules matching the domain name is not included in the first crawling framework;
the crawling unit is further used for crawling the target webpage through the third crawling framework.
Optionally, the first crawling framework includes at least part of rules matching the domain name, and includes:
the first crawling framework comprises all rules matched with the domain name; alternatively, the first and second electrodes may be,
the first crawling framework comprises a part of rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the method comprises the following steps:
the first crawling framework does not contain any rule matched with the domain name.
Optionally, the determining unit is further configured to determine, when the first crawling architecture includes a partial rule matched with the domain name, a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
the creating unit is further configured to create the rule to be created;
the creating unit is further configured to create the second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: the field attribute is used for determining fields and field types required to be crawled by the crawling architecture, and each rule has the corresponding field attribute.
Optionally, the determining unit is further configured to determine whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
the determining unit is further used for determining a rule for crawling from the matched rules according to preset matching conditions when the rule is judged to be yes;
and the crawling unit is further used for crawling the target webpage by the second crawling framework or the third crawling framework according to the determined rule.
Optionally, the apparatus further comprises:
the adding unit is used for adding the rule and/or the field attribute corresponding to the rule into the second crawling framework when the condition that the rule in the first crawling framework is deleted, and/or the field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted is monitored;
and the removing unit is used for removing the inheritance relationship of the first crawling framework and the second crawling framework.
In a third aspect, in order to achieve the above object, the present invention further provides a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the web page crawling method according to the first aspect.
In a fourth aspect, to achieve the above object, the present invention further provides a processor, where the processor is configured to execute a program, where the program executes the method for crawling a web page according to the first aspect.
By the technical scheme, the technical scheme provided by the invention at least has the following advantages:
compared with the prior art, when the webpage is crawled, the method and the device only can fully quote the pre-established crawling framework, but cannot form a new crawling framework for use through the pre-established crawling framework. And after the determination, creating a second crawling framework according to the inherited rules, so that the second crawling framework comprises both the rules inherited from other crawling frameworks and the newly created rules. The second crawling framework part inherits the rules in the crawling framework created firstly, so that the time for creating the crawling framework is reduced, and the newly created crawling framework can meet the crawling requirement of the current webpage.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a method for crawling a web page according to an embodiment of the present invention;
FIG. 2 illustrates an organizational diagram of a first crawling framework provided by an embodiment of the invention;
FIG. 3 illustrates an organizational diagram of a first crawling framework provided by an embodiment of the invention;
FIG. 4 is a flowchart illustrating another web page crawling method according to an embodiment of the present invention;
FIG. 5 is a block diagram of a web page crawling apparatus according to an embodiment of the present invention;
FIG. 6 is a block diagram of another web page crawling apparatus provided by the embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Referring to fig. 1, an embodiment of the present invention provides a method for crawling a web page, where the method mainly includes:
101. and acquiring the domain name of the target webpage, and determining a rule matched with the domain name.
Before crawling the content of a web page, a whole target web page to be crawled needs to be obtained first, a domain name of the target web page is obtained, and then a rule matched with the domain name is determined, so that a crawling framework with the matched rule can accurately crawl the target web page. Further, the target web page may be saved locally for subsequent crawling locally.
102. And judging whether at least part of rules matched with the domain name are contained in the pre-created first crawling framework.
The pre-created first crawling framework is a crawling framework created by other users and shared to the public. As shown in FIG. 2, a crawling architecture defines a crawling scheme for a web page, and different field properties (Properties) in the crawling architecture are required to be used to identify different fields in the web page that need to be crawled by a crawler, and the types of the fields, respectively, when a user performs crawling work through the crawler. And a corresponding rule set is set for each field attribute in the crawling architecture, the rule set comprises a plurality of rules, each rule corresponds to an analysis rule configured on a certain webpage by the field attribute, and the scope of each rule is a domain name. For example, a crawling framework for crawling a news page includes a first field attribute and a second field attribute, the first field attribute identifies a crawling title field, and the type of the field is a text; the second field attribute identifies a crawl publish time field, the type of the field being time. The rule for the first field attribute configured on page https:// new. qq. com/omn/20180730A01I2200 is: # root > div > div > div.qq _ content.clearfix > div.left > h1, the scope of the rule being qq.com. In order to avoid time consumption for repeated creation, after the rule corresponding to the target webpage is determined, at least part of the rule matched with the domain name of the target webpage is searched in the existing crawling framework, and the condition of the at least part of the rule matched with the domain name comprises that the first crawling framework comprises all the rules matched with the domain name or the first crawling framework comprises part of the rules matched with the domain name.
103. If the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
If all rules or part of rules matched with the domain name are found from the existing first crawling framework, the all rules or part of rules are inherited from the first crawling framework. As shown in fig. 2 and fig. 3, the pre-created Schema1 includes partial rules Rule1 and Rule2 that match the domain name, that is, Rule1 and Rule2 can be inherited from Schema 1.
104. And creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
After determining part or all rules matched with the domain name of the target webpage from the first crawling framework, determining rules which can be inherited and rules to be created, and then creating the rules to be created. A second crawl framework is then created based on the inherited rules and the created rules without consuming time to create all of the rules in the second crawl framework to reduce the time consumed to create the second crawl framework. After the second crawling framework inherits the first crawling framework, when the second crawling framework is used for crawling the webpage, the rule in the first crawling framework is also used for crawling the webpage at the same time.
Compared with the prior art, when the webpage is crawled, the webpage crawling method provided by the embodiment of the invention can only fully quote the pre-created crawling framework, but cannot form a new crawling framework for use through the pre-created crawling framework. And after the determination, creating a second crawling framework according to the inherited rules, so that the second crawling framework comprises both the rules inherited from other crawling frameworks and the newly created rules. The second crawling framework part inherits the rules in the crawling framework created firstly, so that the time for creating the crawling framework is reduced, and the newly created crawling framework can meet the crawling requirement of the current webpage.
Based on the web page crawling method shown in fig. 1, another embodiment of the present invention further provides another web page crawling method, which is shown with reference to fig. 4 and mainly includes:
201. and acquiring the domain name of the target webpage, and determining a rule matched with the domain name.
Assuming that the domain name of the target webpage is qq.com; as shown in fig. 2 and 3, the Rule matching the domain name qq.com is determined to be Rule 1. Further, the target web page may be a plurality of web pages with different domain names, for example, the first domain name of the first target web page is qq.com, and the second domain name of the second target web page is sina.com; then, Rule1 is determined to be matched with the first domain name, and Rule2 is determined to be matched with the second domain name.
202. Judging whether a first crawling framework created in advance contains at least part of rules matched with the domain name; if yes, go to step 203; if not, go to step 205.
The first crawling framework comprises at least part of rules matched with the domain name, wherein the rules comprise the following two conditions: the first crawling framework comprises all rules matched with the domain name; or the first crawling framework comprises part of rules matched with the domain name. And the first crawling framework does not contain at least part of rules matched with the domain name, and the following conditions are included: the first crawling framework does not contain any rules that match the domain name. And judging whether the first crawling framework comprises at least part of rules matched with the domain name or not according to the condition.
203. The at least partial rule is inherited from the first crawling framework.
And after determining that the pre-created crawling framework contains the partial rule matched with the domain name, inheriting the partial rule. Further, when a rule is inherited, the field attribute and the rule set corresponding to the rule are inherited at the same time. As shown in fig. 2 and fig. 3, since the inherited Rule1 and Rule2 correspond to a Rule set (RuleSet1) and a field attribute (Property1), the Property1, RuleSet1, Rule1 and Rule2 are inherited from the Schema1 according to the correspondence, so as to further reduce the data required to be newly created.
204. And determining the rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name.
The rules matched with the domain name of the target webpage are determined in advance according to requirements, and then after partial rules are inherited from other crawling frameworks, the rules to be created can be determined according to the remaining partial rules which cannot be inherited, so that all rules required for crawling the target webpage are formed together according to the inherited rules and the rules to be created.
205. And creating the rule to be created.
And after the rule to be created is determined, creating the rule to be created. Further, when it is determined that the field attribute and the rule set inherited currently do not include the field attribute and the rule set corresponding to the rule to be created, the corresponding field attribute and the rule set are created synchronously when the rule to be created is created. As shown in fig. 3, Rule5, Rule6 and Rule7 are created, and Property3 and Rule set3 corresponding to Rule6 and Rule7 are created.
206. And creating a second crawling framework according to the created rule to be created and the partial rule matched with the domain name.
As shown in fig. 3, inherited Rule1, Rule2, Property1 and RuleSet1, and created Rule5, Rule6, Rule7, Property3 and RuleSet3 are acquired to create Schema 2. Specifically, according to the corresponding relation, the created Rule5 is associated with inherited Property1 and RuleSet1, and then Schema2 is created through the Rule1, Rule2, Rule5, Property1, RuleSet1, Rule6, Rule7, Property3 and RuleSet3 which are all associated.
207. A third crawling framework is created that has rules that match the domain name.
When the rules cannot be inherited from the pre-created crawling constructs, all rules matching the domain name need to be newly created, and then a third crawling framework is created according to the newly created rules.
208. Judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1; if yes, go to step 209; if no, go to step 210.
Before the crawling framework is used for crawling the target webpage, whether a plurality of rules with the same action domain and corresponding to the same field attribute are included in the rules matched with the domain name of the target webpage in the crawling framework needs to be further judged. If the rules do not comprise a plurality of rules, directly crawling through a crawling framework; if the rule is included, a rule needs to be extracted from the rule, and then crawling is carried out. Therefore, the problem that crawling data is disordered due to the fact that a plurality of rules are matched when a webpage under the same domain name crawls the same field is avoided.
209. And determining rules for crawling from the matched rules according to preset matching conditions.
The matching condition may be that one rule with the latest creation time is extracted from rules corresponding to the same field attribute and having the same scope, or that one rule stored locally is extracted from a plurality of rules corresponding to the same field attribute and having the same scope.
As shown in fig. 3, in Schema2, Rule1 and Rule5 corresponding to Property1 are both rules configured on the same domain name, and it is necessary to further determine rules finally used for crawling from Rule1 and Rule5 according to preset matching conditions. For example, when the matching condition is a Rule whose creation time is the latest, since the Schema1 modifies Rule1 on day 5 and Rule5 is created on day 1, Rule1 is determined as a Rule for crawling. When the matching condition is to extract a Rule stored locally, Rule5 is determined as a Rule for crawling because Rule5 is created locally by a user and Rule1 is shared by other users in a server.
210. And the created crawling framework crawls the target webpage according to the determined rule.
And after a second crawling framework is created according to the inherited rule and the rule to be created, or a third crawling framework is created according to the rule matched with the domain name, and the rule in the crawling framework is determined to be directly used, or a rule used for crawling is selected from the rules according to the matching condition, the target webpage can be crawled through the created crawling framework and the determined rule.
211. And when the condition that the rule in the first crawling framework is deleted and/or the field attribute corresponding to the rule in the first crawling framework is deleted or the first crawling framework is deleted is monitored, adding the rule and/or the field attribute corresponding to the rule into the second crawling framework.
When the fact that the inherited rule is deleted from the first crawling framework, the inherited rule and the field attribute corresponding to the rule are deleted from the first crawling framework, the fact that the field attribute corresponding to the inherited rule is deleted from the first crawling framework or the whole first crawling framework is deleted is monitored, the deleted rule and/or the field attribute are added into the second crawling framework, and the second crawling framework is formed based on the added rule and/or the field attribute and the rule and/or the field attribute in the second crawling framework. For example, when it is monitored that the Schema1 is deleted, or the Property1 in the Schema1 is deleted, or the Rule1 is deleted, the deleted Property1 or Rule1 is added to the Schema2, so that the Property1 or Rule1 is converted into data belonging to the Schema 2. So that when crawling a web page, the locally configured rules that correspond to the usage are still valid.
212. And releasing the inheritance relationship of the first crawling framework and the second crawling framework.
When the first crawling framework is deleted or the rule that the second crawling framework needs to be called does not exist in the first crawling framework, the inheritance relationship between the first crawling framework and the second crawling framework is removed, so that the first crawling framework does not need to be called any more when the second crawling framework is used for crawling the webpage in the subsequent process, data calling errors are avoided, and unnecessary time consumption is reduced.
According to the webpage crawling method provided by the embodiment of the invention, when at least part of rules can be inherited, a new crawling framework can be created according to the inherited part of rules and the rest of rules to be created; and when determining that relay bearing rules cannot be built from pre-created crawls, creating a crawl framework according to all required rules. And in order to avoid that a plurality of rules corresponding to the same field attribute in the crawling framework are matched with the same domain name, the crawling data are confused when the target webpage is crawled, and the rules for crawling are determined according to preset matching conditions before the target webpage is crawled by using the crawling framework. And when the condition that the rule in the first crawling framework is deleted, and/or the field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted is monitored, in order to ensure that the configured rule is valid, the deleted rule and/or field attribute is added into the second crawling framework, and the deleted rule and/or field attribute is converted into data belonging to the local. And then the inheritance relationship is released, so that data calling errors are avoided, and unnecessary time consumption is reduced.
Further, as an implementation of the method in the above embodiment, another embodiment of the present invention further provides a device for crawling a web page. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method.
Referring to fig. 5, the web page crawling apparatus includes an acquiring unit 31, a determining unit 32, a judging unit 33, an inheriting unit 34, a creating unit 35, and a crawling unit 36.
The acquiring unit 31 is configured to acquire a domain name of a target web page.
A determining unit 32 for determining rules matching the domain name.
Before crawling a target webpage, the acquiring unit 31 needs to acquire the whole webpage to be crawled, store the webpage to be crawled locally, and acquire the domain name of the target webpage. The rule matching the domain name is further determined by the determining unit 32, so that the crawling framework with the matching rule can accurately crawl the target web page.
A judging unit 33, configured to judge whether at least part of rules matching the domain name is included in the pre-created first crawling framework.
In order to avoid the time consumption of repeated creation, after the rule corresponding to the target web page is determined, the determining unit 33 may first determine whether at least a part of the rule matching the domain name of the target web page exists in an existing crawling framework.
And the inheritance unit 34 is used for relaying at least part of rules from the first crawling framework when the first crawling framework contains at least part of rules matched with the domain name.
If all or part of the rules matching the domain name are found from the existing first crawling framework, the inheriting unit 34 inherits all or part of the rules from the first crawling framework.
A creating unit 35, configured to create a second crawling architecture according to at least part of the rules.
And the crawling unit 36 is used for crawling the target webpage through a second crawling framework.
After determining part or all of the rules matched with the domain name of the target webpage from the first crawling framework, determining the rules which can be inherited and the rules to be created, and then creating the second crawling framework by the creating unit 35 according to the inherited rules and the rules to be created, without spending time on creating all the rules in the second crawling framework, so as to reduce the time consumed by creating the second crawling framework.
Optionally, the creating unit 35 is further configured to create a third crawling framework having rules matching the domain name when at least part of the rules matching the domain name is not included in the first crawling framework.
And the crawling unit 36 is further configured to crawl the target webpage through a third crawling architecture.
When the rules cannot be inherited from the pre-created crawling framework, all rules matching the domain name need to be newly created, and then the third crawling framework is created by the creating unit 35 according to the newly created rules. The target web page is then crawled through the created third crawling architecture by the crawling unit 36.
Optionally, the first crawling framework includes at least part of rules matching the domain name, including: the first crawling framework comprises all rules matched with the domain name; or the first crawling framework comprises a part of rules matched with the domain name.
At least part of rules matched with the domain name are not contained in the first crawling framework, and the rules comprise: the first crawling framework does not contain any rules that match the domain name.
Optionally, the determining unit 32 is further configured to determine, when the first crawling architecture includes a partial rule matched with the domain name, a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
after inheriting some rules from other crawling frameworks, the determining unit 32 may determine the rules to be created according to the remaining unsuccessfully inherited rules, so that all rules required for crawling the target webpage are formed together according to the inherited rules and the rules to be created.
The creating unit 37 is further configured to create a rule to be created.
After determining the rule to be created, the creating unit 37 creates the rule to be created. And when determining that the field attribute and the rule set which are inherited currently do not include the field attribute and the rule set which correspond to the rule to be created, synchronously creating the corresponding field attribute and the rule set when creating the rule to be created.
The creating unit 35 is further configured to create a second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
After the rule to be created is created, the creation unit 35 is controlled again to create the second crawling framework according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: the field attribute is used for determining fields and field types required to be crawled by the crawling architecture, and each rule has the corresponding field attribute.
Optionally, the determining unit 33 is further configured to determine whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1.
In order to avoid confusion of crawled data due to matching of a plurality of rules when a webpage under the same domain name crawls the same field, before the crawling framework is used for crawling the target webpage, the determining unit 33 is further required to determine whether a plurality of rules which have the same scope and correspond to the same field attribute are included in the rules matched with the domain name of the target webpage in the crawling framework.
And the determining unit 32 is further used for determining rules for crawling from the matched rules according to preset matching conditions when the judgment is yes.
When the determining unit 33 determines that the number of rules matched with the same domain name in the rules corresponding to the field attribute is greater than 1, the determining unit 32 may extract a rule with the closest creation time from the rules corresponding to the same field attribute and having the same scope, or extract a locally stored rule from a plurality of rules corresponding to the same field attribute and having the same scope, and determine the rule as a rule for crawling.
And the crawling unit 35 is further configured to crawl the target webpage according to the determined rule by using the second crawling framework or the third crawling framework.
After a second crawling framework is created according to the inherited rule and the rule to be created, or a third crawling framework is created according to the rule matched with the domain name, and the rule in the crawling framework is determined to be directly used, or a rule for crawling is selected from the rules according to the matching condition, the crawling unit 35 can crawl the target webpage through the created crawling framework and the determined rule.
Optionally, referring to fig. 6, the apparatus further includes:
and the adding unit 37 is used for adding the rule and/or the field attribute corresponding to the rule into the second crawling framework when the condition that the rule in the first crawling framework is deleted, and/or the field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted is monitored.
When it is monitored that the inherited rule is deleted from the first crawling framework, or the inherited rule and the field attribute corresponding to the rule are deleted from the first crawling framework, or the field attribute corresponding to the inherited rule is deleted from the first crawling framework, or the whole first crawling framework is deleted, the control adding unit 37 adds the deleted rule and/or the field attribute to the second crawling framework to convert the deleted rule and/or the field attribute into data belonging to the second crawling framework, so that when the webpage is crawled, the locally configured rule used correspondingly is still valid.
And the removing unit 38 is used for removing the inheritance relationship of the first crawling framework and the second crawling framework.
When the first crawling framework is deleted or no rule needing to be called by the second crawling framework exists in the first crawling framework, the control releasing unit 38 releases the inheritance relationship between the first crawling framework and the second crawling framework, so that when the second crawling framework is used for crawling the webpage later, the first crawling framework does not need to be called any more, data calling errors are avoided, and unnecessary time consumption is reduced.
Compared with the prior art, when the webpage is crawled, the webpage crawling device provided by the embodiment of the invention can only reference all the pre-created crawling framework, but cannot form a new crawling framework for use through the pre-created crawling framework. After the determination, a second crawling framework is created through the inheritance unit 34 and the creation unit 35 according to the inherited rules, so that the second crawling framework contains both the rules inherited from other crawling frameworks and the newly created rules. The second crawling framework part inherits the rules in the crawling framework created firstly, so that the time for creating the crawling framework is reduced, and the newly created crawling framework can meet the crawling requirement of the current webpage. In addition, in order to avoid that a plurality of rules corresponding to the same field attribute in the crawling framework are matched with the same domain name, and to make the crawling data be confused when the target webpage is crawled, the judging unit 33 and the determining unit 32 can determine the rule for crawling according to preset matching conditions before the target webpage is crawled by using the crawling framework. Moreover, when it is monitored that the rule in the first crawling framework is deleted, and/or the field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted, in order to ensure that the configured rule is valid, the adding unit 37 adds the deleted rule and/or field attribute to the second crawling framework, so that the deleted rule and/or field attribute is converted into data belonging to the local. And thereafter releases the inheritance relationship through the release unit 38, avoiding data call errors and reducing unnecessary time consumption.
The web page crawling device comprises a processor and a memory, wherein the determining unit, the judging unit, the creating unit, the inheriting unit, the crawling unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem that a new crawling framework cannot be formed based on a pre-created crawling framework to crawl the webpage in the prior art is solved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium, on which a program is stored, where the program, when executed by a processor, implements the following web page crawling method:
and acquiring the domain name of the target webpage, and determining a rule matched with the domain name.
And judging whether at least part of rules matched with the domain name are contained in a first pre-created crawling framework.
If the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
And creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
The embodiment of the invention provides a processor, which is used for running a program, wherein the program executes the following webpage crawling method during running:
and acquiring the domain name of the target webpage, and determining a rule matched with the domain name.
And judging whether at least part of rules matched with the domain name are contained in a first pre-created crawling framework.
If the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
And creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:
acquiring a domain name of a target webpage, and determining a rule matched with the domain name;
judging whether a first crawling framework created in advance contains at least part of rules matched with the domain name;
if the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework;
and creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
Optionally, the method further includes:
if the first crawling framework does not contain at least part of rules matched with the domain name, creating a third crawling framework with rules matched with the domain name;
and crawling the target webpage through the third crawling architecture.
Optionally, the first crawling framework includes at least part of rules matching the domain name, and includes:
the first crawling framework comprises all rules matched with the domain name; alternatively, the first and second electrodes may be,
the first crawling framework comprises a part of rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the method comprises the following steps:
the first crawling framework does not contain any rule matched with the domain name.
Optionally, if the first crawling architecture includes a part of rules matched with the domain name, the method further includes:
determining a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
creating the rule to be created;
the creating a second crawling architecture according to the at least part of the rule comprises:
and creating the second crawling framework according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: the field attribute is used for determining fields and field types required to be crawled by the crawling architecture, and each rule has the corresponding field attribute.
Optionally, judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
if yes, determining a rule for crawling from the matched rules according to preset matching conditions;
the crawling of the target webpage through the second crawling framework or the third crawling framework comprises:
and the second crawling framework or the third crawling framework crawls the target webpage according to the determined rule.
Optionally, the method further includes:
when the rule in the first crawling framework is monitored to be deleted, and/or the field attribute corresponding to the rule in the first crawling framework is monitored to be deleted, or the first crawling framework is monitored to be deleted, the rule and/or the field attribute corresponding to the rule are added to the second crawling framework;
and releasing the inheritance relationship of the first crawling framework and the second crawling framework.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device:
and acquiring the domain name of the target webpage, and determining a rule matched with the domain name.
And judging whether at least part of rules matched with the domain name are contained in a first pre-created crawling framework.
If the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
And creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. A
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for crawling a web page, the method comprising:
acquiring a domain name of a target webpage, and determining a rule matched with the domain name;
judging whether a first crawling framework created in advance contains at least part of rules matched with the domain name;
if the first crawling framework comprises at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework;
and creating a second crawling framework according to at least part of the rules, and crawling the target webpage through the second crawling framework.
2. The method of claim 1, further comprising:
if the first crawling framework does not contain at least part of rules matched with the domain name, creating a third crawling framework with rules matched with the domain name;
and crawling the target webpage through the third crawling architecture.
3. The method of claim 2, wherein:
the first crawling framework comprises at least part of rules matched with the domain name, and the rules comprise:
the first crawling framework comprises all rules matched with the domain name; alternatively, the first and second electrodes may be,
the first crawling framework comprises a part of rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the method comprises the following steps:
the first crawling framework does not contain any rule matched with the domain name.
4. The method of claim 3, wherein if the first crawling architecture includes a portion of rules matching the domain name, the method further comprises:
determining a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
creating the rule to be created;
the creating a second crawling architecture according to the at least part of the rule comprises:
and creating the second crawling framework according to the created rule to be created and the partial rule matched with the domain name.
5. The method of claim 2, wherein the first and second crawling architectures comprise: the field attribute is used for determining fields and field types required to be crawled by the crawling architecture, and each rule has the corresponding field attribute.
6. The method of claim 5, further comprising:
judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
if yes, determining a rule for crawling from the matched rules according to preset matching conditions;
the crawling of the target webpage through the second crawling framework or the third crawling framework comprises:
and the second crawling framework or the third crawling framework crawls the target webpage according to the determined rule.
7. The method of claim 1, further comprising:
when the rule in the first crawling framework is monitored to be deleted, and/or the field attribute corresponding to the rule in the first crawling framework is monitored to be deleted, or the first crawling framework is monitored to be deleted, the rule and/or the field attribute corresponding to the rule are added to the second crawling framework;
and releasing the inheritance relationship of the first crawling framework and the second crawling framework.
8. An apparatus for crawling web pages, the apparatus comprising:
the acquisition unit is used for acquiring the domain name of the target webpage;
the determining unit is used for determining a rule matched with the domain name;
the judgment unit is used for judging whether a first crawling framework which is created in advance contains at least part of rules matched with the domain name;
the inheritance unit is used for inheriting at least part of rules from the first crawling framework when the first crawling framework contains at least part of rules matched with the domain name;
a creating unit, configured to create a second crawling framework according to the at least part of the rule;
and the crawling unit is used for crawling the target webpage through the second crawling framework.
9. A storage medium, comprising a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the web page crawling method according to any one of claims 1 to 7.
10. A processor, configured to execute a program, wherein the program executes the web page crawling method according to any one of claims 1 to 7.
CN201811145540.3A 2018-09-29 2018-09-29 Webpage crawling method and device Active CN110968756B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811145540.3A CN110968756B (en) 2018-09-29 2018-09-29 Webpage crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811145540.3A CN110968756B (en) 2018-09-29 2018-09-29 Webpage crawling method and device

Publications (2)

Publication Number Publication Date
CN110968756A true CN110968756A (en) 2020-04-07
CN110968756B CN110968756B (en) 2023-05-12

Family

ID=70027138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811145540.3A Active CN110968756B (en) 2018-09-29 2018-09-29 Webpage crawling method and device

Country Status (1)

Country Link
CN (1) CN110968756B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259833A1 (en) * 2011-04-11 2012-10-11 Vistaprint Technologies Limited Configurable web crawler
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107885843A (en) * 2017-11-10 2018-04-06 天脉聚源(北京)传媒科技有限公司 A kind of method and device of intelligent reptile task
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120259833A1 (en) * 2011-04-11 2012-10-11 Vistaprint Technologies Limited Configurable web crawler
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107885843A (en) * 2017-11-10 2018-04-06 天脉聚源(北京)传媒科技有限公司 A kind of method and device of intelligent reptile task
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵思佳等: "基于规则引擎的个性化主题网页爬虫的研究", 《计算机技术与发展》 *

Also Published As

Publication number Publication date
CN110968756B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
US10152773B2 (en) Creating a blurred area for an image to reuse for minimizing blur operations
US11256817B2 (en) Tool for generating security policies for containers
CN104298588A (en) Continuous integration implementation method and device
CN108874379B (en) Page processing method and device
CN107015986B (en) Method and device for crawling webpage by crawler
CN106919620B (en) Single page processing method and device
CN104572431A (en) Test method and test device
CN109558548B (en) Method for eliminating CSS style redundancy and related product
CN104899217A (en) Method and apparatus for implementing customized function
CN110020343B (en) Method and device for determining webpage coding format
CN109977317B (en) Data query method and device
CN111125087B (en) Data storage method and device
CN112597105A (en) Processing method of file associated object, server side equipment and storage medium
CN112560403A (en) Text processing method and device and electronic equipment
CN109558549B (en) Method for eliminating CSS style redundancy and related product
CN110019497B (en) Data reading method and device
CN113094250B (en) Log early warning method and device, electronic equipment and storage medium
CN110955813A (en) Data crawling method and device
CN110968756A (en) Webpage crawling method and device
CN111078905A (en) Data processing method, device, medium and equipment
CN110968580B (en) Method and device for creating data storage structure
CN110969461B (en) Method and device for processing public number information, storage medium and processor
CN109710833B (en) Method and apparatus for determining content node
CN110968754B (en) Detection method and device for crawler page turning strategy
CN109426540B (en) Element click condition detection method and device, storage medium and processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant