CN110968756B - Webpage crawling method and device - Google Patents

Webpage crawling method and device Download PDF

Info

Publication number
CN110968756B
CN110968756B CN201811145540.3A CN201811145540A CN110968756B CN 110968756 B CN110968756 B CN 110968756B CN 201811145540 A CN201811145540 A CN 201811145540A CN 110968756 B CN110968756 B CN 110968756B
Authority
CN
China
Prior art keywords
crawling
rules
domain name
rule
architecture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811145540.3A
Other languages
Chinese (zh)
Other versions
CN110968756A (en
Inventor
何熠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811145540.3A priority Critical patent/CN110968756B/en
Publication of CN110968756A publication Critical patent/CN110968756A/en
Application granted granted Critical
Publication of CN110968756B publication Critical patent/CN110968756B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage crawling method and device, and relates to the technical field of crawling. The invention mainly solves the problem that the prior art cannot form a new crawling architecture based on the pre-established crawling architecture to crawl the web page. The method of the invention comprises the following steps: acquiring a domain name of a target webpage, and determining a rule matched with the domain name; judging whether a first pre-established crawling framework contains at least part of rules matched with the domain name or not; if the first crawling framework contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework; and creating a second crawling framework according to the at least partial rule, and crawling the target webpage through the second crawling framework. The invention can be widely applied to the scene of crawling the web page.

Description

Webpage crawling method and device
Technical Field
The invention relates to the technical field of crawling, in particular to a method and a device for crawling web pages.
Background
With the rapid development of network technology, more and more information is carried in the world wide web, and searching for information using conventional search engine technology consumes a lot of time, and it is difficult to accurately search for information required by users. How to quickly and effectively obtain information needed by a user from a large amount of network information becomes a problem to be solved.
To solve this problem, web crawlers have been developed. The web crawler is a program or script for automatically capturing web information according to a certain rule, and can quickly and accurately acquire information required by a user from a web page. When a crawler is used for crawling a webpage, a user firstly needs to create a crawling architecture (Schema), but because creating a complete crawling architecture can take a great deal of time and effort for the user, the prior art can share the crawling architecture which is created in advance by other users through a sharing mechanism, the user can directly refer to the crawling architecture which is created in advance to crawl the webpage when needed, but cannot form a new crawling architecture which meets the crawling requirement of the current webpage based on the crawling architecture which is created in advance, and the webpage is crawled.
Disclosure of Invention
In view of this, the method and device for crawling web pages provided by the invention mainly aim to solve the problem that a new crawling architecture cannot be formed based on a pre-established crawling architecture to crawl web pages in the prior art.
In order to solve the problems, the invention mainly provides the following technical scheme:
in a first aspect, the present invention provides a method for crawling a web page, the method comprising:
Acquiring a domain name of a target webpage, and determining a rule matched with the domain name;
judging whether a first pre-established crawling framework contains at least part of rules matched with the domain name or not;
if the first crawling framework contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework;
and creating a second crawling framework according to the at least partial rule, and crawling the target webpage through the second crawling framework.
Optionally, the method further comprises:
if the first crawling framework does not contain at least part of rules matched with the domain name, creating a third crawling framework with rules matched with the domain name;
and crawling the target webpage through the third crawling architecture.
Optionally, the first crawling framework includes at least part of rules matched with the domain name, including:
the first crawling architecture comprises all rules matched with the domain name; or alternatively, the process may be performed,
the first crawling architecture comprises partial rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the first crawling framework comprises the following steps:
The first crawling framework does not contain any rules matching the domain name.
Optionally, if the first crawling architecture includes a part of rules matched with the domain name, the method further includes:
determining a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
creating the rule to be created;
said creating a second crawling architecture according to said at least part of the rules, comprising:
and creating the second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: and the field attribute is used for determining the field and the field type which need to be crawled by the crawling architecture, and each rule has a corresponding field attribute.
Optionally, the method further comprises:
judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
if yes, determining a rule for crawling from the matched rules according to a preset matching condition;
the crawling the target web page through the second crawling framework or the third crawling framework includes:
And the second crawling framework or the third crawling framework crawls the target webpage according to the determined rule.
Optionally, the method further comprises:
when the rule in the first crawling framework is deleted, and/or a field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted, adding the rule and/or the field attribute corresponding to the rule into the second crawling framework;
and releasing the inheritance relationship of the first crawling framework and the second crawling framework.
In a second aspect, the present invention also provides a web crawling apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the domain name of the target webpage;
a determining unit, configured to determine a rule matched with the domain name;
the judging unit is used for judging whether the first pre-established crawling framework contains at least part of rules matched with the domain name or not;
an inheritance unit, configured to inherit at least a part of rules from the first crawling framework when the first crawling framework contains at least a part of rules matched with the domain name;
a creating unit, configured to create a second crawling architecture according to the at least part of the rules;
And the crawling unit is used for crawling the target webpage through the second crawling framework.
Optionally, the creating unit is further configured to create a third crawling framework with a rule matching the domain name when at least a part of the rules matching the domain name are not included in the first crawling framework;
the crawling unit is further configured to crawl the target web page through the third crawling architecture.
Optionally, the first crawling framework includes at least part of rules matched with the domain name, including:
the first crawling architecture comprises all rules matched with the domain name; or alternatively, the process may be performed,
the first crawling architecture comprises partial rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the first crawling framework comprises the following steps:
the first crawling framework does not contain any rules matching the domain name.
Optionally, the determining unit is further configured to determine, when the first crawling architecture includes a part of the rule matching the domain name, a rule to be created according to the rule matching the domain name and the part of the rule matching the domain name;
The creating unit is further used for creating the rule to be created;
the creating unit is further configured to create the second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: and the field attribute is used for determining the field and the field type which need to be crawled by the crawling architecture, and each rule has a corresponding field attribute.
Optionally, the judging unit is further configured to judge whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
the determining unit is further used for determining a rule for crawling from the matched rules according to a preset matching condition when the determination unit determines that the determination unit is yes;
the crawling unit is further configured to crawl the target web page by using the second crawling framework or the third crawling framework according to the determined rule.
Optionally, the apparatus further includes:
an adding unit, configured to add the rule and/or a field attribute corresponding to the rule to the second crawling framework when it is monitored that the rule in the first crawling framework is deleted, and/or a field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted;
And the releasing unit is used for releasing the inheritance relationship of the first crawling framework and the second crawling framework.
In order to achieve the above object, the present invention further provides a storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute the method for crawling a web page according to the first aspect.
In a fourth aspect, in order to achieve the above object, the present invention further provides a processor, where the processor is configured to run a program, where the program executes the method for crawling a web page according to the first aspect.
By means of the technical scheme, the technical scheme provided by the invention has at least the following advantages:
compared with the prior art, the method and the device for crawling the web page can only fully quote the pre-built crawling architecture, but can not form a new crawling architecture for use through the pre-built crawling architecture. After the determination, a second crawling architecture is created according to the inherited rules, such that the second crawling architecture contains both rules inherited from other crawling architectures and newly created rules. The second crawling framework part inherits rules in the crawling framework which is created first, so that the time for creating the crawling framework is reduced, and the newly created crawling framework can meet the crawling requirement of the current webpage.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 shows a flowchart of a method for crawling web pages provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating the organization of a first crawling framework according to an embodiment of the present invention;
FIG. 3 illustrates a schematic organization of a first crawling framework provided by an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another method for crawling web pages according to an embodiment of the present invention;
FIG. 5 shows a block diagram of a web crawling apparatus according to an embodiment of the present invention;
Fig. 6 is a block diagram of another web crawling apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Referring to fig. 1, an embodiment of the present invention provides a method for crawling a web page, where the method mainly includes:
101. and acquiring the domain name of the target webpage and determining a rule matched with the domain name.
Before crawling the content of the web page, the whole target web page to be crawled needs to be acquired first, the domain name of the target web page is acquired, and then the rule matched with the domain name is determined, so that a crawling framework with the matched rule can be accurately crawled according to the target web page. Further, the target web page may be saved locally for subsequent crawling locally.
102. It is determined whether the pre-created first crawling framework contains at least a portion of rules that match the domain name.
The first pre-created crawling framework is a crawling framework that other users create and share to the public. As shown in FIG. 2, the crawling architecture defines a crawling scheme for a web page, and when a user performs crawling tasks by a crawler, different field attributes (properties) in the crawling architecture need to be used to identify different fields in the web page that need to be crawled by the crawler, and the types of these fields, respectively. And setting a corresponding rule set for each field attribute in the crawling architecture, wherein the rule set comprises a plurality of rules, each rule corresponds to an analysis rule configured on a certain webpage by the field attribute, and the scope of the rules is a domain name. For example, a crawling architecture for crawling news pages includes a first field attribute and a second field attribute, where the first field attribute identifies a crawling title field, and the type of field is text; the second field attribute identifies a crawling issue time field, the type of field being time. The rules for the first field property configuration on page https:// new.qq.com/omn/20180730A01I2200 are: # root > div > div > div.qq_content. Clearfix > div.LEFT > h1, the scope of the rule is qq.com. In order to avoid the time consumption of repeated creation, after determining the rule corresponding to the target webpage, searching at least part of rules matched with the domain name of the target webpage in the existing crawling framework, wherein the condition of at least part of rules matched with the domain name comprises that all rules matched with the domain name are contained in the first crawling framework or part of rules matched with the domain name are contained in the first crawling framework.
103. If the first crawling framework contains at least part of the rules matched with the domain name, inheriting the at least part of the rules from the first crawling framework.
If all or part of rules matched with the domain name are found from the existing first crawling framework, the all or part of rules are inherited from the first crawling framework. As shown in fig. 2 and 3, the pre-created Schema1 includes partial rules Rule1 and Rule2 matching the domain name, i.e. Rule1 and Rule2 can be inherited from Schema 1.
104. And creating a second crawling framework according to the at least part of rules, and crawling the target webpage through the second crawling framework.
After determining some or all rules matching the domain name of the target web page from the first crawling framework, determining rules which can be inherited and rules to be created, and then creating the rules to be created. The second crawling framework is then created based on the inherited rules and the created rules without taking time to create all rules in the second crawling framework to reduce the time taken to create the second crawling framework. When the second crawling framework inherits the first crawling framework and the second crawling framework is utilized to crawl the webpage, the rules in the first crawling framework are utilized to crawl the webpage at the same time.
Compared with the prior art, the method for crawling the web page provided by the embodiment of the invention can only fully quote the pre-built crawling architecture, but can not form a new crawling architecture for use through the pre-built crawling architecture. After the determination, a second crawling architecture is created according to the inherited rules, such that the second crawling architecture contains both rules inherited from other crawling architectures and newly created rules. The second crawling framework part inherits rules in the crawling framework which is created first, so that the time for creating the crawling framework is reduced, and the newly created crawling framework can meet the crawling requirement of the current webpage.
Based on the method for crawling web pages shown in fig. 1, another embodiment of the present invention further provides another method for crawling web pages, referring to fig. 4, the method mainly includes:
201. and acquiring the domain name of the target webpage and determining a rule matched with the domain name.
Assuming that the domain name of the target webpage is qq.com; as shown in fig. 2 and 3, the Rule that determines a match to the domain name qq.com is Rule1. Further, the target web page may be a plurality of web pages having different domain names, for example, a first domain name of a first target web page is qq.com and a second domain name of a second target web page is sina.com; then, rule1 is determined as Rule matching the first domain name, and Rule2 is determined as Rule matching the second domain name.
202. Judging whether a first pre-established crawling framework contains at least part of rules matched with the domain name; if yes, go to step 203; if not, step 205 is performed.
The first crawling framework contains at least part of rules matched with the domain name, wherein the rules comprise the following two cases: the first crawling architecture comprises all rules matched with the domain name; or the first crawling architecture contains part of the rules matching the domain name. And the first crawling framework does not contain at least part of rules matched with the domain name, the method comprises the following steps: the first crawling framework does not contain any rules matching the domain name. And judging whether the first crawling framework contains at least part of rules matched with the domain name according to the conditions.
203. The at least partial rule is inherited from the first crawling framework.
After determining that the pre-created crawling framework contains the partial rule matched with the domain name, inheriting the partial rule. Further, when inheriting a rule, inheriting the field attribute and the rule set corresponding to the rule at the same time. As shown in fig. 2 and 3, since Rule1 and Rule2 inherited correspond to a Rule set (Rule set 1) and a field attribute (Property 1), property1, rule set1, rule1 and Rule2 are inherited from Schema1 according to the correspondence, so that data to be newly created is further reduced.
204. And determining the rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name.
And determining rules matched with the domain name of the target webpage according to requirements in advance, and then after inheriting part of the rules from other crawling frameworks, determining the rules to be created according to the rest part of the rules which cannot be inherited, so that all rules required for crawling the target webpage are formed by the inherited rules and the rules to be created together.
205. The rule to be created is created.
And after determining the rule to be created, creating the rule to be created. Further, when determining that the field attribute and the rule set corresponding to the rule to be created are not included in the field attribute and the rule set which are inherited currently, when creating the rule to be created, the corresponding field attribute and the rule set are synchronously created. As shown in FIG. 3, rule5, rule6 and Rule7 are created, and Property3 and Rule set3 corresponding to Rule6, rule7 are created.
206. And creating a second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
As shown in fig. 3, inherited Rule1, rule2, property1 and Rule set1 are obtained, and Rule5, rule6, rule7, property3 and Rule set3 are created to create scheme 2. Specifically, according to the corresponding relation, the created Rule5 is associated with inherited properties 1 and Rule set1, and then the scheme a2 is created through the completely associated Rule1, rule2, rule5, property1 and Rule set1, rule6, rule7, property3 and Rule set3.
207. A third crawling framework is created having rules that match the domain name.
When a rule cannot be inherited from a pre-created crawling construct, then all rules matching the domain name need to be newly created, and then a third crawling framework is created according to the newly created rules.
208. Judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1; if yes, go to step 209; if not, step 210 is performed.
Before the target webpage is crawled by utilizing the crawling framework, whether a plurality of rules with the same scope and corresponding to the same field attribute are included in the rules matched with the domain name of the target webpage in the crawling framework is further needed to be judged. If the rules are not included, the crawling framework is directly used for crawling; if so, a rule needs to be extracted from the rule, and then crawling is performed. In this way, when the web page under the same domain name crawls the same field, a plurality of rules are matched, so that crawling data is disordered.
209. And determining the rule for crawling from the matched rules according to preset matching conditions.
The matching condition may be to extract one rule having the latest creation time from among rules corresponding to the same field attribute and having the same scope, or to extract one rule stored locally from among a plurality of rules corresponding to the same field attribute and having the same scope.
As shown in fig. 3, in scheme 2, rule1 and Rule5 corresponding to Property1 are rules configured on the same domain name, and a Rule finally used for crawling needs to be determined from Rule1 and Rule5 according to a preset matching condition. For example, when the matching condition is a Rule whose creation time is nearest, rule1 is determined as a Rule for crawling because Rule1 is modified by scheme 1 on day 5 and Rule5 is created on day 1. When the matching condition is to extract a Rule stored locally, rule5 is determined as a Rule for crawling because Rule5 is created locally by the user and Rule1 is shared in the server by other users.
210. The built crawling architecture crawls the target web page according to the determined rules.
After the second crawling framework is created according to the inheritance rules and the rules to be created, or the third crawling framework is created according to the rules matched with the domain name, and the rules in the crawling framework are determined to be directly used, or one rule for crawling is selected according to the matching conditions, the target webpage can be crawled through the created crawling framework and the determined rules.
211. And when the rule in the first crawling framework is deleted and/or the field attribute corresponding to the rule in the first crawling framework is deleted or the first crawling framework is deleted, adding the rule and/or the field attribute corresponding to the rule into the second crawling framework.
When the inherited rule is deleted from the first crawling framework or the inherited rule and the field attribute corresponding to the rule are deleted from the first crawling framework or the field attribute corresponding to the inherited rule is deleted from the first crawling framework or the whole first crawling framework is deleted, the deleted rule and/or the field attribute are added into the second crawling framework, and the second crawling framework is formed based on the added rule and/or the field attribute and the original rule and/or the field attribute in the second crawling framework. For example, when it is monitored that the Schema1 is deleted, or the Property1 in the Schema1 is deleted, or Rule1 is deleted, the deleted Property1 or Rule1 is added to the Schema2, so that the Property1 or Rule1 is converted into data belonging to the Schema 2. So that the rules corresponding to the local configuration used remain valid when crawling the web page.
212. And releasing the inheritance relationship of the first crawling framework and the second crawling framework.
When the first crawling framework is deleted or the rule that the second crawling framework needs to be called does not exist in the first crawling framework, the inheritance relation between the first crawling framework and the second crawling framework is released, so that when the second crawling framework is used for crawling the webpage subsequently, the first crawling framework does not need to be called any more, data calling errors are avoided, and unnecessary time consumption is reduced.
According to the webpage crawling method provided by the embodiment of the invention, when the fact that at least part of rules can be inherited is determined, a new crawling framework can be created according to the inherited part of rules and the rest of rules to be created; and upon determining that relay rules cannot be built from pre-created crawls, a crawl framework is created according to all required rules. In order to avoid that a plurality of rules corresponding to the same field attribute in the crawling framework are matched with the same domain name, crawling data are confused when crawling a target webpage, and the rules for crawling can be determined from the rules according to preset matching conditions before the crawling framework is utilized to crawl the target webpage. And when the rule in the first crawling framework is deleted and/or the field attribute corresponding to the rule in the first crawling framework is deleted or the first crawling framework is deleted, in order to ensure that the configured rule is valid, the deleted rule and/or the field attribute is added into the second crawling framework, so that the deleted rule and/or the field attribute is converted into data belonging to the local area. And then the inheritance relationship is released, so that data calling errors are avoided, and unnecessary time consumption is reduced.
Further, as an implementation of the method in the above embodiment, a further embodiment of the present invention further provides a web crawling apparatus. The embodiment of the device corresponds to the embodiment of the method, and for convenience of reading, details of the embodiment of the method are not repeated one by one, but it should be clear that the device in the embodiment can correspondingly realize all the details of the embodiment of the method.
Referring to fig. 5, the web crawling apparatus includes an acquisition unit 31, a determination unit 32, a judgment unit 33, an inheritance unit 34, a creation unit 35, and a crawling unit 36.
And an obtaining unit 31, configured to obtain a domain name of the target web page.
A determining unit 32 for determining a rule matching the domain name.
Before crawling the target web page, the acquiring unit 31 is required to acquire the whole web page to be crawled, store the web page to be crawled locally, and acquire the domain name of the target web page. The rules matching the domain name are then further determined by the determining unit 32 in order to subsequently cause the crawling framework with the matching rules to crawl correctly on the target web page.
A judging unit 33, configured to judge whether the first pre-created crawling architecture includes at least part of rules matched with the domain name.
In order to avoid the time consuming repeated creation, after determining the rule corresponding to the target web page, the determining unit 33 may first determine whether at least part of the rule matching the domain name of the target web page exists in the existing crawling architecture.
And an inheritance unit 34, configured to inherit at least a part of the rules from the first crawling framework when the first crawling framework contains at least a part of the rules matching the domain name.
If all or part of the rules matching the domain name are found from the existing first crawling framework, the inheritance unit 34 inherits all or part of the rules from the first crawling framework.
A creation unit 35 for creating a second crawling architecture according to at least part of the rules.
A crawling unit 36 for crawling the target web page by the second crawling framework.
After determining a part or all of the rules matching the domain name of the target web page from the first crawling frame, determining rules that can be inherited and rules to be created, and then creating the second crawling frame by the creating unit 35 according to the inherited rules and the rules to be created, without consuming time to create all of the rules in the second crawling frame, so as to reduce the time consumed in creating the second crawling frame.
Optionally, the creating unit 35 is further configured to create a third crawling framework with rules matching the domain name when at least part of the rules matching the domain name are not included in the first crawling framework.
The crawling unit 36 is further configured to crawl the target web page through a third crawling architecture.
When a rule cannot be inherited from a pre-created crawling construction, it is necessary to newly create all rules matching the domain name, and then create a third crawling framework according to the newly created rules by the creation unit 35. The target web page is then crawled through the created third crawling architecture by crawling unit 36.
Optionally, the first crawling framework includes at least part of rules matched with the domain name, including: the first crawling architecture comprises all rules matched with the domain name; alternatively, the first crawling architecture includes a portion of rules that match the domain name.
At least part of rules matched with the domain name are not contained in the first crawling framework, and the first crawling framework comprises the following steps: the first crawling framework does not contain any rules matching the domain name.
Optionally, the determining unit 32 is further configured to determine, when the first crawling architecture includes a part of the rules matched with the domain name, a rule to be created according to the rule matched with the domain name and the part of the rules matched with the domain name;
After inheriting some of the rules from other crawling frameworks, the determining unit 32 may determine the rules to be created according to the remaining partial rules that cannot be inherited, so that all rules required for crawling the target web page are formed together according to the inherited rules and the rules to be created.
The creating unit 37 is further configured to create a rule to be created.
After determining the rule to be created, the creating unit 37 creates the rule to be created. And when the field attribute and the rule set corresponding to the rule to be created are not included in the field attribute and the rule set which are inherited currently, the corresponding field attribute and the rule set are synchronously created when the rule to be created is created.
The creating unit 35 is further configured to create a second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
After creating the rule to be created, the re-control creating unit 35 creates a second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: the field attributes are used for determining fields and field types that the crawling architecture needs to crawl, and each rule has a corresponding field attribute.
Optionally, the judging unit 33 is further configured to judge whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1.
In order to avoid that when the web page under the same domain name crawls the same field, a plurality of rules are matched, so as to cause confusion of crawled data, before the target web page is crawled by utilizing the crawling framework, the judging unit 33 is further required to judge whether a plurality of rules which have the same scope and correspond to the same field attribute are included in the rules matched with the domain name of the target web page in the crawling framework.
The determining unit 32 is further configured to determine a rule for crawling from the matched rules according to a preset matching condition when the determination is yes.
When the judging unit 33 judges that the number of rules matching the same domain name is greater than 1, the determining unit 32 extracts one rule having the latest creation time from the rules corresponding to the same field attribute and having the same scope, or extracts one rule stored locally from the plurality of rules corresponding to the same field attribute and having the same scope, and determines it as a rule for crawling.
The crawling unit 35 is further configured to crawl the target web page by using the second crawling architecture or the third crawling architecture according to the determined rule.
After creating the second crawling framework according to the inheritance rule and the rule to be created, or creating the third crawling framework according to the rule matched with the domain name, and determining the rule in the crawling framework to be directly used, or selecting one rule for crawling according to the matching condition from the rules, the crawling unit 35 can crawl the target webpage through the created crawling framework and the determined rule.
Optionally, referring to fig. 6, the apparatus further includes:
an adding unit 37, configured to add a rule and/or a field attribute corresponding to the rule to the second crawling framework when it is monitored that the rule in the first crawling framework is deleted, and/or the field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted.
When it is monitored that the inherited rule is deleted from the first crawling framework, or the inherited rule and the field attribute corresponding to the rule are deleted from the first crawling framework, or the field attribute corresponding to the inherited rule is deleted from the first crawling framework, or the entire first crawling framework is deleted, the control adding unit 37 adds the deleted rule and/or field attribute to the second crawling framework to be converted into data belonging to the second crawling framework, so that the rule corresponding to the used local configuration is still valid when crawling the web page.
A releasing unit 38 for releasing the inheritance relationship of the first crawling framework and the second crawling framework.
When the first crawling frame has been deleted or there is no rule in the first crawling frame that the second crawling frame needs to call, the control releasing unit 38 releases the inheritance relationship of the first crawling frame and the second crawling frame, so that when the second crawling frame is used to crawl the web page subsequently, the first crawling frame does not need to be called any more, data call errors are avoided, and unnecessary time consumption is reduced.
When the device for crawling the web page provided by the embodiment of the invention is used for crawling the web page, compared with the prior art, the device can only fully quote the pre-built crawling architecture, but can not form a new crawling architecture for use through the pre-built crawling architecture. After the determination, a second crawling architecture is created by the inheritance unit 34 and the creation unit 35 according to the inherited rules, so that the second crawling architecture contains both rules inherited from other crawling architectures and newly created rules. The second crawling framework part inherits rules in the crawling framework which is created first, so that the time for creating the crawling framework is reduced, and the newly created crawling framework can meet the crawling requirement of the current webpage. In order to avoid that a plurality of rules corresponding to the same field attribute in the crawling architecture are matched with the same domain name, crawling data is confused when crawling a target webpage, and before crawling the target webpage by using the crawling architecture, the judging unit 33 and the determining unit 32 further determine the rules for crawling according to preset matching conditions. Moreover, when it is monitored that a rule in the first crawling architecture is deleted and/or a field attribute corresponding to the rule in the first crawling architecture is deleted or the first crawling architecture is deleted, in order to ensure that the configured rule is valid, the adding unit 37 adds the deleted rule and/or field attribute to the second crawling architecture, so that the deleted rule and/or field attribute is converted into data belonging to the local area. And thereafter releases the inheritance relationship by the release unit 38, avoiding data call errors, and reducing unnecessary time consumption.
The web crawling device comprises a processor and a memory, wherein the determining unit, the judging unit, the creating unit, the inheritance unit, the crawling unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor includes a kernel, and the kernel fetches the corresponding program unit from the memory. The kernel can be provided with one or more than one kernel, and the problem that a new crawling architecture cannot be formed on the basis of a pre-established crawling architecture to crawl the web page in the prior art is solved by adjusting kernel parameters.
The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.
The embodiment of the invention provides a storage medium, on which a program is stored, which when executed by a processor, implements the following web crawling method:
and acquiring the domain name of the target webpage and determining a rule matched with the domain name.
And judging whether the first pre-created crawling framework contains at least part of rules matched with the domain name.
And if the first crawling framework contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
And creating a second crawling framework according to the at least partial rule, and crawling the target webpage through the second crawling framework.
The embodiment of the invention provides a processor, which is used for running a program, wherein the program executes the following webpage crawling method:
and acquiring the domain name of the target webpage and determining a rule matched with the domain name.
And judging whether the first pre-created crawling framework contains at least part of rules matched with the domain name.
And if the first crawling framework contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
And creating a second crawling framework according to the at least partial rule, and crawling the target webpage through the second crawling framework.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program stored in the memory and capable of running on the processor, wherein the processor realizes the following steps when executing the program:
Acquiring a domain name of a target webpage, and determining a rule matched with the domain name;
judging whether a first pre-established crawling framework contains at least part of rules matched with the domain name or not;
if the first crawling framework contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework;
and creating a second crawling framework according to the at least partial rule, and crawling the target webpage through the second crawling framework.
Optionally, the method further comprises:
if the first crawling framework does not contain at least part of rules matched with the domain name, creating a third crawling framework with rules matched with the domain name;
and crawling the target webpage through the third crawling architecture.
Optionally, the first crawling framework includes at least part of rules matched with the domain name, including:
the first crawling architecture comprises all rules matched with the domain name; or alternatively, the process may be performed,
the first crawling architecture comprises partial rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling framework, and the first crawling framework comprises the following steps:
The first crawling framework does not contain any rules matching the domain name.
Optionally, if the first crawling architecture includes a part of rules matched with the domain name, the method further includes:
determining a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
creating the rule to be created;
said creating a second crawling architecture according to said at least part of the rules, comprising:
and creating the second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
Optionally, the first crawling architecture and the second crawling architecture include: and the field attribute is used for determining the field and the field type which need to be crawled by the crawling architecture, and each rule has a corresponding field attribute.
Optionally, judging whether the number of the rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
if yes, determining a rule for crawling from the matched rules according to a preset matching condition;
the crawling the target web page through the second crawling framework or the third crawling framework includes:
And the second crawling framework or the third crawling framework crawls the target webpage according to the determined rule.
Optionally, the method further comprises:
when the rule in the first crawling framework is deleted, and/or a field attribute corresponding to the rule in the first crawling framework is deleted, or the first crawling framework is deleted, adding the rule and/or the field attribute corresponding to the rule into the second crawling framework;
and releasing the inheritance relationship of the first crawling framework and the second crawling framework.
The device herein may be a server, PC, PAD, cell phone, etc.
The present application also provides a computer program product adapted to perform, when executed on a data processing device, a program code initialized with the method steps of:
and acquiring the domain name of the target webpage and determining a rule matched with the domain name.
And judging whether the first pre-created crawling framework contains at least part of rules matched with the domain name.
And if the first crawling framework contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling framework.
And creating a second crawling framework according to the at least partial rule, and crawling the target webpage through the second crawling framework.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. A step of
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (9)

1. A method of crawling web pages, the method comprising:
acquiring a domain name of a target webpage, and determining a rule matched with the domain name;
judging whether a first pre-established crawling architecture contains at least part of rules matched with the domain name or not;
if the first crawling architecture contains at least part of rules matched with the domain name, inheriting the at least part of rules from the first crawling architecture;
creating a second crawling architecture according to the at least partial rule, and crawling the target webpage through the second crawling architecture;
the method further comprises the steps of:
if the first crawling architecture does not contain at least part of rules matched with the domain name, creating a third crawling architecture with rules matched with the domain name;
and crawling the target webpage through the third crawling architecture.
2. The method according to claim 1, characterized in that:
the first crawling architecture includes at least part of rules matched with the domain name, including:
the first crawling architecture comprises all rules matched with the domain name; or alternatively, the process may be performed,
the first crawling architecture comprises partial rules matched with the domain name;
at least part of rules matched with the domain name are not contained in the first crawling architecture, and the first crawling architecture comprises the following steps:
the first crawling architecture does not include any rules matching the domain name.
3. The method of claim 2, wherein if the first crawling architecture includes a portion of rules matching the domain name, the method further comprises:
determining a rule to be created according to the rule matched with the domain name and the partial rule matched with the domain name;
creating the rule to be created;
said creating a second crawling architecture according to said at least part of the rules, comprising:
and creating the second crawling architecture according to the created rule to be created and the partial rule matched with the domain name.
4. The method of claim 1, wherein the first and second crawling architectures comprise: and the field attribute is used for determining the field and the field type which need to be crawled by the crawling architecture, and each rule has a corresponding field attribute.
5. The method according to claim 4, wherein the method further comprises:
judging whether the number of rules matched with the same domain name in the rules corresponding to each field attribute is greater than 1;
if yes, determining a rule for crawling from the matched rules according to a preset matching condition;
the crawling the target web page through the second crawling architecture or the third crawling architecture includes:
and the second crawling architecture or the third crawling architecture crawls the target webpage according to the determined rule.
6. The method according to claim 1, wherein the method further comprises:
when the rule in the first crawling architecture is deleted, and/or a field attribute corresponding to the rule in the first crawling architecture is deleted, or the first crawling architecture is deleted, adding the rule and/or the field attribute corresponding to the rule into the second crawling architecture;
and releasing the inheritance relationship of the first crawling architecture and the second crawling architecture.
7. A web crawling apparatus, the apparatus comprising:
the acquisition unit is used for acquiring the domain name of the target webpage;
A determining unit, configured to determine a rule matched with the domain name;
the judging unit is used for judging whether the first pre-established crawling architecture contains at least part of rules matched with the domain name;
an inheritance unit, configured to inherit at least a part of rules from the first crawling architecture when the first crawling architecture contains at least a part of rules that match the domain name;
a creating unit, configured to create a second crawling architecture according to the at least part of the rules;
the crawling unit is used for crawling the target webpage through the second crawling architecture;
the apparatus further comprises:
the creation unit is further used for creating a third crawling architecture with rules matched with the domain name when at least part of rules matched with the domain name are not contained in the first crawling architecture;
and the crawling unit is also used for crawling the target webpage through the third crawling architecture.
8. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the web crawling method of any one of claims 1-6.
9. A processor for running a program, wherein the program when run performs the method of crawling a web page of any one of claims 1 to 6.
CN201811145540.3A 2018-09-29 2018-09-29 Webpage crawling method and device Active CN110968756B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811145540.3A CN110968756B (en) 2018-09-29 2018-09-29 Webpage crawling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811145540.3A CN110968756B (en) 2018-09-29 2018-09-29 Webpage crawling method and device

Publications (2)

Publication Number Publication Date
CN110968756A CN110968756A (en) 2020-04-07
CN110968756B true CN110968756B (en) 2023-05-12

Family

ID=70027138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811145540.3A Active CN110968756B (en) 2018-09-29 2018-09-29 Webpage crawling method and device

Country Status (1)

Country Link
CN (1) CN110968756B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107885843A (en) * 2017-11-10 2018-04-06 天脉聚源(北京)传媒科技有限公司 A kind of method and device of intelligent reptile task
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8799262B2 (en) * 2011-04-11 2014-08-05 Vistaprint Schweiz Gmbh Configurable web crawler

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107885843A (en) * 2017-11-10 2018-04-06 天脉聚源(北京)传媒科技有限公司 A kind of method and device of intelligent reptile task
CN107943991A (en) * 2017-12-01 2018-04-20 成都嗨翻屋文化传播有限公司 A kind of distributed reptile frame and implementation method based on memory database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于规则引擎的个性化主题网页爬虫的研究;赵思佳等;《计算机技术与发展》;20110310;第21卷(第03期);第56-63页 *

Also Published As

Publication number Publication date
CN110968756A (en) 2020-04-07

Similar Documents

Publication Publication Date Title
US10152773B2 (en) Creating a blurred area for an image to reuse for minimizing blur operations
US8677366B2 (en) Systems and methods for processing hierarchical data in a map-reduce framework
CN106951231B (en) Computer software development method and device
CN110968808B (en) Method and device for realizing webpage theme update
US20130091416A1 (en) Method for establishing a relationship between semantic data and the running of a widget
US20140282062A1 (en) Techniques for efficient and granular composition of a user profile
CN110941428B (en) Website creation method and device
CN108874379B (en) Page processing method and device
CN101937335A (en) Method for generating Widget icon, method for generating Widget summary document and Widget engine
US10951540B1 (en) Capture and execution of provider network tasks
CN111783018A (en) Page processing method, device and equipment
CN110968756B (en) Webpage crawling method and device
CN111125087B (en) Data storage method and device
CN110708270B (en) Abnormal link detection method and device
CN114710318B (en) Method, device, equipment and medium for limiting high-frequency access of crawler
CN109542401B (en) Web development method and device, storage medium and processor
CN110874322A (en) Test method and test server for application program
CN111651160B (en) Plug-in construction and webpage design method and device
CN110968580B (en) Method and device for creating data storage structure
Stueben et al. Defensive programming
CN109710833B (en) Method and apparatus for determining content node
CN109426540B (en) Element click condition detection method and device, storage medium and processor
CN111079392A (en) Automatic filling method and device for webpage form, storage medium and electronic equipment
CN110968758B (en) Webpage data crawling method and device
CN113094047B (en) Method and device for processing webpage buttons, storage medium and processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant