CN110020068B - Method and device for configuring page crawling rules - Google Patents

Method and device for configuring page crawling rules Download PDF

Info

Publication number
CN110020068B
CN110020068B CN201710884074.XA CN201710884074A CN110020068B CN 110020068 B CN110020068 B CN 110020068B CN 201710884074 A CN201710884074 A CN 201710884074A CN 110020068 B CN110020068 B CN 110020068B
Authority
CN
China
Prior art keywords
page
crawled
elements
crawling
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710884074.XA
Other languages
Chinese (zh)
Other versions
CN110020068A (en
Inventor
满悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710884074.XA priority Critical patent/CN110020068B/en
Publication of CN110020068A publication Critical patent/CN110020068A/en
Application granted granted Critical
Publication of CN110020068B publication Critical patent/CN110020068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for configuring a page crawling rule, relates to the technical field of computers, and mainly aims to automatically generate the page crawling rule and improve the generation speed of the crawling rule, wherein the main technical scheme of the invention is as follows: selecting page elements to be crawled from pages needing to be configured with crawling rules; generating path information of the page elements according to the attribute information corresponding to the page elements to be crawled; generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled; and configuring the page crawling rule of the page to be crawled according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled. The method is mainly used for configuring the page crawling rule.

Description

Method and device for configuring page crawling rules
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for configuring page crawling rules.
Background
With the deep development of cloud computing and big data technology, a large amount of structured and unstructured information searching and mining technologies on web pages become a hot research problem. When analyzing data, a lot of time and energy are often spent, and in the big data era, the crawler technology becomes an important way to acquire network data.
Generally, before crawling a page by a crawler, a crawling rule needs to be manually configured, and the crawling rule contains path information of elements of the crawling page and a regular expression which needs to be verified by crawling content. Due to the complexity and the changeability of the webpage structure, in the process of configuring the crawling rule, in the prior art, firstly, the page elements are positioned in a manual mode, then the page elements are analyzed, interference information generated temporarily in the page element path information is removed, and finally, the crawling rule is generated by manually writing the path information of the page elements and a regular expression needing to be subjected to crawling content verification.
However, as the number of crawled pages is continuously increased, writing a path of a page element manually and checking the crawled content of a regular expression require a writer to have certain professional knowledge and analysis capability, and time and energy are consumed, so that the generation speed of the crawled rule is slow.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for configuring a page crawling rule, and mainly aims to automatically generate a page crawling rule and increase the generation speed of the crawling rule.
In order to solve the above problems, the present invention mainly provides the following technical solutions:
in one aspect, an embodiment of the present invention provides a method for configuring a page crawling rule, including:
selecting page elements to be crawled from pages needing to be configured with crawling rules;
generating path information of page elements according to attribute information corresponding to the page elements to be crawled, wherein the path information of the page elements is used for recording position information of the page elements to be crawled in the page to be crawled;
generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, wherein the regular expression is used for recording the display rule of the page element to be crawled in the page to be crawled;
and configuring the page crawling rule of the page to be crawled according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled.
Further, before determining the page elements to be crawled from the page needing to configure the crawling rules, the method further comprises:
and starting a browser plug-in, and determining page content needing to be configured with crawling rules from the browser plug-in.
Further, the generating, according to the attribute information corresponding to the page element to be crawled, path information of the page element includes:
generating relative path information of the page elements in each upper-level element by inquiring attribute information corresponding to the page elements to be crawled;
and generating position information of the page elements in the page to be crawled according to the relative path information of the page elements in each superior element.
Further, before the generating the relative path information of the page element in each upper-level element by querying the attribute information corresponding to the page element to be crawled, the method further includes:
and filtering the temporarily generated attributes in the attribute information corresponding to the page element to be crawled.
Further, the generating of the regular expression matched with the content of the page element to be crawled by setting the regular expression template matched with the page element to be crawled comprises:
setting a regular expression template matched with the page element to be crawled according to the keyword corresponding to the page element to be crawled;
and generating a regular expression matched with the content of the page element to be crawled by taking the regular expression template as a matching rule.
In order to achieve the above object, according to another aspect of the present invention, a storage medium is provided, where the storage medium includes a stored program, and when the program runs, a device where the storage medium is located is controlled to execute the above configuration method for the page crawling rule.
In order to achieve the above object, according to another aspect of the present invention, there is provided a processor for executing a program, wherein the program executes the method for configuring the page crawling rule.
On the other hand, an embodiment of the present invention further provides a device for configuring a page crawling rule, including:
the selecting unit is used for selecting page elements to be crawled from the pages needing to be configured with the crawling rules;
the first generating unit is used for generating path information of page elements according to attribute information corresponding to the page elements to be crawled, wherein the path information of the page elements is used for recording position information of the page elements to be crawled in the page to be crawled;
the second generation unit is used for generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, and the regular expression is used for recording the display rule of the page element to be crawled in the page to be crawled;
and the configuration unit is used for configuring the page crawling rule of the page to be crawled according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled.
Further, the apparatus further comprises:
the device comprises a determining unit, a searching unit and a display unit, wherein the determining unit is used for starting a browser plug-in before selecting a page element to be crawled from a page needing to be configured with a crawling rule, and determining the page content needing to be configured with the crawling rule from the browser plug-in.
Further, the first generation unit includes:
the first generation module is used for generating relative path information of the page elements in each upper-level element by inquiring the attribute information corresponding to the page elements to be crawled;
and the second generation module is used for generating the position information of the page elements in the page to be crawled according to the relative path information of the page elements in each superior element.
Further, the first generation unit further includes:
and the filtering module is used for filtering the temporarily generated attribute in the attribute information corresponding to the page element to be crawled before generating the relative path information of the page element in each superior element by inquiring the class attribute and the tag attribute corresponding to the page element to be crawled.
Further, the second generation unit includes:
the setting module is used for setting a regular expression template matched with the page element to be crawled according to the keyword corresponding to the page element to be crawled;
and the third generation module is used for generating a regular expression matched with the content of the page element to be crawled by taking the regular expression template as a matching rule.
By the technical scheme, the technical scheme provided by the embodiment of the invention at least has the following advantages:
according to the method and the device for configuring the page crawling rule, the page elements to be crawled are selected from the pages needing to be configured with the crawling rule, the page crawling rule of the page to be crawled can be automatically generated according to the display rule of the page elements to be crawled in the page to be crawled and the position information of the page elements to be crawled in the page to be crawled, and then the content in the page to be crawled is crawled according to the page crawling rule. Compared with the method for configuring the page crawling rule by manually writing the path information of the page elements and the regular expression needing to verify the crawling content in the prior art, the method and the device for configuring the page crawling rule have the advantages that the path information of the page elements to be crawled is inquired, the path information of the page elements is generated, the preset regular expression of the page elements is generated by setting the keywords corresponding to the page elements to be crawled, so that the page crawling rule is configured according to the path information of the page elements and the preset regular expression of the page elements, a user does not need to write crawling rule codes in the generating process of the page crawling rule, and the generating speed of the crawling rule is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a method for configuring a page crawling rule according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for configuring page crawling rules according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating a configuration apparatus for page crawling rules according to an embodiment of the present invention;
fig. 4 is a block diagram of another apparatus for configuring page crawling rules according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The embodiment of the invention provides a method for configuring a page crawling rule, as shown in fig. 1, the method configures a page crawling rule file according to a display rule of a generated page element in a page to be crawled and position information of the page element in the page to be crawled, so that the page crawling rule can be automatically generated, and the generation speed of the crawling rule is improved, and the following specific steps are provided for the embodiment of the invention:
101. and selecting page elements to be crawled from the pages needing to be configured with the crawling rules.
For the embodiment of the invention, a page needing to be configured with the crawling rule can be opened from the browser plug-in, for example, a newwave microblog home page is opened in a Chrome browser plug-in, or a vacation news page is opened in a Chrome browser plug-in, wherein the newwave microblog home page or the vacation news page is the page needing to be configured with the crawling rule.
Because page elements displayed in different pages are different, for example, page elements related to microblogs such as a microblog blogger, a microblog forum, a hot microblog and the like can be displayed in a new wave microblog page, page elements related to news such as a flight news first page and tags in various fields can be displayed in a flight news page, page elements to be crawled are further selected from the opened page needing to be configured with a crawling rule, for example, page elements such as microblog comment number, microblog author, release time and the like can be selected but not limited as the elements to be crawled for the new wave microblog page, and certainly, after the crawling rule is configured, the number of the elements to be crawled can be updated, for example, one or more elements to be crawled are added or deleted, and the embodiment of the invention is not limited.
According to the method and the device for crawling the webpage, the webpage elements to be crawled are selected from the webpage needing to be configured with the crawling rule, and the webpage elements related to the webpage are selected according to different webpages, so that the content of the crawled webpage is more accurate, and the actual requirements of a user are met.
102. And generating path information of the page elements according to the attribute information corresponding to the page elements to be crawled.
The path information of the page element is used for recording position information of the page element to be crawled in the page to be crawled, such as position information of elements such as tags, pictures and animations in the page to be crawled, and the position information of the page element in the page to be crawled can be stored in a manner of a Cascading Style Sheets (CSSs) directory file corresponding to a website.
It should be noted that, because a page element in the page to be crawled does not exist in the page independently, but has a nested relationship with the page or other elements in the page, for example, if the element a is inside the element B, the element B is a parent element, the element a is a child element of the element B, of course, the element B may also exist a parent element C, the element a may also exist a child element D, and the like, and an element having a nested relationship with a page element to be crawled is often used to reflect relative path information of the page element in each nested element thereof, so that the position information of the element in the page to be crawled can be located according to the relative path information of the page element in each nested element thereof.
The method and the device for generating the page element path information specifically analyze the attribute information corresponding to the page element to be crawled, so that the relative path information of the element to be crawled in each upper nested element is generated, and the page element path information is further generated according to the relative path information in each upper nested element.
103. And generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled.
The regular expression is used for recording display rules of elements of the page to be crawled in the page to be crawled, such as display rules of elements of labels, pictures, animations and the like in the page to be crawled. For the embodiment of the invention, when a user opens a browser plug-in to select a page needing to be configured with crawling rules, keywords describing elements of the page are displayed, for example, a title element of a microblog issued by a microblog blogger is displayed with 'web promotion method consultation-web promotion case discussion', the displayed content of the title element is the keywords describing the title element and is also the content needing to be crawled, the contents of the keywords needing to be crawled are different for different pages needing to be crawled, for example, if an article is displayed in the page to be crawled, an author of the article needs to be crawled, after the author element of the article is selected, a regular expression matched with the elements of the author is generated by setting regular expression templates matched with the elements of the author, usually, a regular expression matched with the elements of the author is generated for a plurality of keywords behind the author, and if a news report is displayed in the page to be crawled, source keywords of news reports need to be crawled, and a regular expression of news sources needed to be crawled is generated by setting a regular expression template matched with news source elements, usually a plurality of keywords behind the news sources.
It should be noted that, the regular expression is a regular character string formed by using some predefined specific characters and combinations of the specific characters, and is used for filtering a logic formula of the character string, and regular crawling is performed on the page content by generating the regular expression, so that higher accuracy of crawling to the content is ensured.
104. And configuring page crawling rules according to the display rules of the page elements to be crawled in the page to be crawled and the position information of the page elements to be crawled in the page to be crawled.
For the embodiment of the invention, the display rule of the page element to be crawled in the page to be crawled can indicate a crawler to crawl required keywords, the position information of the page element to be crawled in the page to be crawled can indicate the position information of the keyword required by the crawler to crawl, and the page worm crawling rule is further generated according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled.
It should be noted that the crawling rule may be equivalent to a style information file of a web page, for example, display rule information is set for page content information such as characters, pictures, hyperlinks, audio, video, and animation included in the web page, and the crawling rule set in design of each website is different, so that a crawler crawls the page content according to the page crawling rule when crawling the page content.
According to the method for configuring the page crawling rule, the page elements to be crawled are selected from the pages needing to be configured with the crawling rule, the page crawling rule of the page to be crawled can be automatically generated according to the display rule of the page elements to be crawled in the page to be crawled and the position information of the page elements to be crawled in the page to be crawled, and then the contents in the page are crawled according to the page crawling rule. Compared with the method for configuring the page crawling rule by manually writing the path information of the page elements and the regular expression needing to verify the crawling content in the prior art, the method and the device for configuring the page crawling rule have the advantages that the path information of the page elements to be crawled is inquired, the path information of the page elements is generated, the preset regular expression of the page elements is generated by setting the keywords corresponding to the page elements to be crawled, so that the page crawling rule is configured according to the path information of the page elements and the preset regular expression of the page elements, a user does not need to write crawling rule codes in the generating process of the page crawling rule, and the generating speed of the crawling rule is improved.
In order to describe the configuration method of the page crawling rule provided by the present invention in more detail, particularly regarding the steps of generating the path information of the page element to be crawled and the regular expression of the page element to be crawled, an embodiment of the present invention further provides another configuration method of the page crawling rule, as shown in fig. 2, the specific steps of the method include:
201. and starting a browser plug-in, and determining a page needing to be configured with a crawling rule from the browser plug-in.
After the browser plug-in is started, corresponding crawling rules need to be configured due to the fact that contents displayed on different pages and contents needing to be crawled are different, and the page needing to be configured with the crawling rules is further determined from the browser plug-in.
In the embodiment of the invention, the page to be crawled can be a Sinlang microblog page or an encyclopedia page opened in a Chrome browser plug-in, the page is a plain Text file containing a hypertext Markup Language (HTML) tag, when one page is opened in the tag page of the browser, the browser (client side) sends a request of an HTML file to be accessed to a server, all HTML file information contained in a website is stored in the server, and in the process of realizing information interaction between the browser and a network server through a hypertext Transfer Protocol (HTTP), the server responds to the request, returns the HTML file corresponding to the request to the browser, loads the HTML file through the browser, and presents content information of the page in the tag page of the browser.
202. And selecting page elements to be crawled from the pages needing to be configured with the crawling rules.
For the embodiment of the invention, the content information presented in the page to be crawled basically comprises the areas such as a title, a header and a footer, main content, a functional area, an advertisement column and the like, each area is provided with corresponding page elements, the main content is an important page area in the webpage and generally comprises page elements such as pictures, documents and the like.
For different display rules and display positions of different elements in a page, page elements to be crawled are further selected from the opened page needing to be configured with the crawling rules, for example, page elements such as a bar top product, a bar comment number, bar information, a bar entrance board and the like can be selected but not limited as the elements to be crawled for a Baidu bar page, and certainly, the number of the elements to be crawled can be updated after the crawling rules are configured, for example, one or more elements to be crawled are added or deleted, and the embodiment of the invention is not limited.
203. And filtering the temporarily generated attributes in the attribute information corresponding to the page element to be crawled.
The attribute information corresponding to the page element to be crawled mainly relates to a class attribute and a label attribute, wherein the class attribute is a class attribute, the same element or different elements in the page to be crawled are set to be the same class attribute through the class attribute, so that the elements with the same class attribute have the same display rule, the label attribute corresponding to the page element to be crawled is an id attribute, the id attribute is a unique identity given to the page element to be identified, and two page elements with the same id attribute are not allowed to appear in the same page.
It should be noted that one page element may define multiple class attributes, which may cause some page elements to generate many temporary class attributes and id attributes during the creation process, and the temporary class attributes and id attributes may interfere with page contents, so that a crawler cannot accurately crawl page contents.
204. And generating relative path information of the page elements in each superior element by inquiring the attribute information corresponding to the page elements to be crawled.
Because the page elements in the page to be crawled do not independently exist in the page, but have a nested relationship with the page to be crawled or other elements in the page to be crawled, if the element a is inside the element B, the element B is a parent element, the element a is a child element of the element B, certainly, the element B may also have a parent element C, the element a may also have a child element D, and the like, the page elements having the nested relationship indicate that the page elements have the same display rule, that is, the same class attribute, and further, each superior element having the same class attribute as the element to be crawled is queried through the id attribute of the element to be crawled, so as to obtain the relative path information of the page elements in each superior element.
For example, the element a is an upper element of the element B, the element B is an upper element of the element C and has the same class attribute, the id attribute of the element C is used to query the relative path information of the element C in the element B having the same class attribute, the id attribute of the element B is used to query the relative path information of the element B in the element a having the same class attribute, and the relative path information of the element C in the element a and the element B having the same class attribute can be further obtained.
205. And generating position information of the page elements in the page to be crawled according to the relative path information of the page elements in each superior element.
The method includes the steps that relative paths of page elements in each superior element can reflect relative position information of the elements in a page, and further the position information of the page elements to be crawled in the page to be crawled is generated according to the relative path information of the page elements in each superior element, for example, classification of microblog accounts which are focused by a user and placed in a left-side typesetting mode in a typesetting mode of microblog pages includes the steps of: the friend circle, special attention, news hotspots, celebrity, television and movie and expansion, and in order to realize diversification of styles, each piece of classification information can be set into one box, namely, a child box is embedded in a parent box, so that the position information of the child box in a microblog page is positioned according to the relative path information of the child box in the parent box.
206. And setting a regular expression template matched with the page element to be crawled according to the keyword corresponding to the page element to be crawled.
The keywords are generally contents required to be crawled by a crawler, if the contents required to be crawled are source websites of news, a character string after the source colon of the news is crawled is set as a regular expression template, if the contents required to be crawled are article authors, the character string after the colon of the author is crawled is set as the regular expression template, the regular expression template is an expression containing a plurality of character string variables, and the contents required to be crawled are defined through the character string variables.
207. And generating a regular expression matched with the content of the page element to be crawled by taking the regular expression template as a matching rule.
It should be noted that, the expression manner of the string variables in the regular expression template may be various, such as asterisk, slash/, colon: and the different representation modes have different matching rules, and the embodiment of the present invention is not limited.
For the embodiment of the invention, after the corresponding regular expression template is set, the regular expression template is used as the matching rule to generate the display rule of the page elements in the page to be crawled, so that the crawler can crawl different page elements in the page to be crawled through the display rule, and the accuracy of crawling content is ensured.
208. And configuring page crawling rules according to the display rules of the page elements to be crawled in the page to be crawled and the position information of the page elements to be crawled in the page to be crawled.
For the embodiment of the present invention, specific application scenarios may include, but are not limited to, an implementation manner that a browser plug-in is started to open a new-wave microblog page as a page that needs to be configured with a crawling rule, page elements such as a microblog name, a message number, and a microblog source are selected from the new-wave microblog page as page elements to be crawled, since the page elements to be crawled may generate a lot of temporary class attributes and id attributes during creation, which may interfere with the page content, further filter the temporarily generated attributes in the class attributes and the id attributes corresponding to the page elements to be crawled, and generate relative path information of the page elements in each upper-level element, that is, relative path information of the microblog name in each upper-level element, by querying the class attributes and the id attributes corresponding to the page elements to be crawled, The method comprises the steps of setting corresponding character string variables according to keywords corresponding to the microblog name, the message number and the microblog source, setting the character string variables corresponding to the keywords as text character strings according to the microblog name, the message number and the microblog source, setting the character string variables corresponding to the keywords as numerical character strings according to the message number, generating display rules of page elements in a page to be crawled by taking the character string variables as matching rules, and finally configuring page crawling rules according to the display rules of the page elements in the page to be crawled and the position information of the page elements in the page to be crawled, wherein the crawling rules are equivalent to style information files of the page, for example, display rule information is set for page content information such as characters, pictures, hyperlinks, audio, video, animation and the like contained in a webpage, so that a crawler crawls the content of a page to be crawled according to page crawling rules in the crawling process.
According to another configuration method of the page crawling rule provided by the embodiment of the invention, the page element to be crawled is selected from the page needing to be configured with the crawling rule, the page crawling rule of the page to be crawled can be automatically generated according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled, and then the content in the page to be crawled is crawled according to the page crawling rule. Compared with the method for configuring the page crawling rule by manually writing the path information of the page elements and the regular expression needing to verify the crawling content in the prior art, the method and the device for configuring the page crawling rule have the advantages that the attribute information corresponding to the page elements to be crawled is inquired, the path information of the page elements is generated, the preset regular expression of the page elements is generated by setting the keywords corresponding to the page elements to be crawled, so that the page crawling rule is configured according to the path information of the page elements and the preset regular expression of the page elements, a user does not need to write the crawling rule codes in the generating process of the page crawling rule, and the generating speed of the crawling rule is increased
In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the above configuration method for the page crawling rule.
In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a processor, where the processor is configured to execute a program, where the program executes the method for configuring the page crawling rule described above.
Further, as an implementation of the method shown in fig. 1 and fig. 2, another embodiment of the present invention further provides a device for configuring a page crawling rule. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. The device can automatic generation page crawl rule to improve the speed of generation of crawling rule, specifically as shown in FIG. 3, the device includes:
the selecting unit 31 may be configured to select a page element to be crawled from a page to which crawling rules need to be configured;
the first generating unit 32 may be configured to generate path information of a page element according to attribute information corresponding to the page element to be crawled, where the path information of the page element is used to record position information of the page element to be crawled in the page to be crawled;
the second generating unit 33 may be configured to generate a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, where the regular expression is used to record a display rule of the page element to be crawled in the page to be crawled;
the configuration unit 34 may be configured to configure the page crawling rule of the page to be crawled according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled.
The embodiment of the invention provides a device for configuring page crawling rules, which is characterized in that page elements to be crawled are selected from pages needing to be configured with the crawling rules, page crawling rule files are configured according to display rules of the page elements to be crawled in the pages to be crawled and position information of the page elements to be crawled in the pages to be crawled, page crawling rules of the pages to be crawled can be automatically generated, and then contents in the pages to be crawled are crawled according to the page crawling rules. Compared with the method for configuring the page crawling rule by manually writing the path information of the page elements and the regular expression needing to verify the crawling content in the prior art, the method and the device for configuring the page crawling rule have the advantages that the path information of the page elements to be crawled is inquired, the path information of the page elements is generated, the preset regular expression of the page elements is generated by setting the keywords corresponding to the page elements to be crawled, so that the page crawling rule is configured according to the path information of the page elements and the preset regular expression of the page elements, a user does not need to write crawling rule codes in the generating process of the page crawling rule, and the generating speed of the crawling rule is improved.
Further, as shown in fig. 4, the apparatus further includes:
the determining unit 35 may be configured to start a browser plug-in before selecting a page element to be crawled from a page in which crawling rules need to be configured, and determine page content in which the crawling rules need to be configured from the browser plug-in.
Further, the first generation unit 32 includes:
the first generating module 321 may be configured to generate, by querying attribute information corresponding to the page element to be crawled, relative path information of the page element in each upper-level element;
the second generating module 322 may be configured to generate position information of the page element in the page to be crawled according to the relative path information of the page element in each upper-level element.
Further, the first generation unit 32 further includes:
the filtering module 323 may be configured to filter the attribute temporarily generated in the attribute information corresponding to the page element to be crawled before generating the relative path information of the page element in each upper-level element by querying the class attribute and the tag attribute corresponding to the page element to be crawled.
Further, the second generating unit 33 includes:
the setting module 331 may be configured to set a regular expression template matched with the page element to be crawled according to a keyword corresponding to the page element to be crawled;
the third generating module 332 may be configured to generate a regular expression matched with the content of the page element to be crawled by using the regular expression template as a matching rule.
According to the other page crawling rule configuration device provided by the embodiment of the invention, the page elements to be crawled are selected from the pages needing to be configured with the crawling rules, the page crawling rules of the pages to be crawled can be automatically generated according to the display rules of the page elements to be crawled in the pages to be crawled and the position information of the page elements to be crawled in the pages to be crawled, and then the contents in the pages to be crawled are crawled according to the page crawling rules. Compared with the method for configuring the page crawling rule by manually writing the path information of the page elements and the regular expression needing to verify the crawling content in the prior art, the method and the device for configuring the page crawling rule have the advantages that the attribute information corresponding to the page elements to be crawled is inquired, the path information of the page elements is generated, the preset regular expression of the page elements is generated by setting the keywords corresponding to the page elements to be crawled, so that the page crawling rule is configured according to the path information of the page elements and the preset regular expression of the page elements, a user does not need to write the crawling rule codes in the generating process of the page crawling rule, and the generating speed of the crawling rule is increased
The configuration device of the page crawling rule comprises a processor and a memory, wherein the selecting unit 31, the first generating unit 32, the second generating unit 33, the configuration unit 34 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, the page crawling rule is automatically generated by adjusting kernel parameters, and the generation speed of the crawling rule is increased.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium on which a program is stored, where the program implements a configuration method of the page crawling rule when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the configuration method of the page crawling rule is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:
a method for configuring page crawling rules comprises the following steps: selecting page elements to be crawled from pages needing to be configured with crawling rules; generating path information of page elements according to attribute information corresponding to the page elements to be crawled, wherein the path information of the page elements is used for recording position information of the page elements to be crawled in the page to be crawled; generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, wherein the regular expression is used for recording the display rule of the page element to be crawled in the page to be crawled; and configuring the page crawling rule of the page to be crawled according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled.
Further, before determining the page elements to be crawled from the page needing to configure the crawling rules, the method further comprises: and starting a browser plug-in, and determining page content needing to be configured with crawling rules from the browser plug-in.
Further, the generating, according to the attribute information corresponding to the page element to be crawled, path information of the page element includes: generating relative path information of the page elements in each upper-level element by inquiring attribute information corresponding to the page elements to be crawled; and generating position information of the page elements in the page to be crawled according to the relative path information of the page elements in each superior element.
Further, before the generating the relative path information of the page element in each upper-level element by querying the attribute information corresponding to the page element to be crawled, the method further includes: and filtering the temporarily generated attributes in the attribute information corresponding to the page element to be crawled.
Further, the generating of the regular expression matched with the content of the page element to be crawled by setting the regular expression template matched with the page element to be crawled comprises: setting a regular expression template matched with the page element to be crawled according to the keyword corresponding to the page element to be crawled; and generating a regular expression matched with the content of the page element to be crawled by taking the regular expression template as a matching rule.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: selecting page elements to be crawled from pages needing to be configured with crawling rules; generating path information of page elements according to attribute information corresponding to the page elements to be crawled, wherein the path information of the page elements is used for recording position information of the page elements to be crawled in the page to be crawled; generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, wherein the regular expression is used for recording the display rule of the page element to be crawled in the page to be crawled; and configuring the page crawling rule of the page to be crawled according to the display rule of the page element to be crawled in the page to be crawled and the position information of the page element to be crawled in the page to be crawled.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (7)

1. A method for configuring page crawling rules is characterized by comprising the following steps:
selecting page elements to be crawled from pages needing to be configured with crawling rules;
generating path information of page elements according to the attribute information corresponding to the page elements to be crawled, wherein the path information of the page elements is used for recording the position information of the page elements to be crawled in the page to be crawled, and specifically comprises the following steps: generating relative path information of the page elements in each upper-level element by inquiring attribute information corresponding to the page elements to be crawled; generating position information of the page elements in the page to be crawled according to the relative path information of the page elements in each superior element;
generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, wherein the regular expression is used for recording the display rule of the page element to be crawled in the page to be crawled;
configuring a page crawling rule of the page to be crawled according to a display rule of the page element to be crawled in the page to be crawled and position information of the page element to be crawled in the page to be crawled;
the method comprises the following steps of setting a regular expression template matched with a page element to be crawled, and generating a regular expression matched with the content of the page element to be crawled, wherein the regular expression template comprises the following steps:
setting a regular expression template matched with the page element to be crawled according to the keyword corresponding to the page element to be crawled;
and generating a regular expression matched with the content of the page element to be crawled by taking the regular expression template as a matching rule.
2. The method according to claim 1, wherein before determining page elements to be crawled from the page requiring configuration of crawling rules, the method further comprises:
and starting a browser plug-in, and determining page content needing to be configured with crawling rules from the browser plug-in.
3. The method according to claim 1, wherein before generating the relative path information of the page elements in each upper-level element by querying the attribute information corresponding to the page elements to be crawled, the method further comprises:
and filtering the temporarily generated attributes in the attribute information corresponding to the page element to be crawled.
4. An apparatus for configuring page crawling rules, comprising:
the selecting unit is used for selecting page elements to be crawled from the pages needing to be configured with the crawling rules;
the first generating unit is used for generating path information of page elements according to attribute information corresponding to the page elements to be crawled, wherein the path information of the page elements is used for recording position information of the page elements to be crawled in the page to be crawled;
the second generation unit is used for generating a regular expression matched with the content of the page element to be crawled by setting a regular expression template matched with the page element to be crawled, and the regular expression is used for recording the display rule of the page element to be crawled in the page to be crawled;
the configuration unit is used for configuring page crawling rules of the page to be crawled according to the display rules of the page elements to be crawled in the page to be crawled and the position information of the page elements to be crawled in the page to be crawled;
the second generation unit includes:
the setting module is used for setting a regular expression template matched with the page element to be crawled according to the keyword corresponding to the page element to be crawled;
the third generation module is used for generating a regular expression matched with the content of the page element to be crawled by taking the regular expression template as a matching rule;
the first generation unit includes:
the first generation module is used for generating relative path information of the page elements in each upper-level element by inquiring the attribute information corresponding to the page elements to be crawled;
and the second generation module is used for generating the position information of the page elements in the page to be crawled according to the relative path information of the page elements in each superior element.
5. The apparatus of claim 4, further comprising:
the device comprises a determining unit, a searching unit and a display unit, wherein the determining unit is used for starting a browser plug-in before selecting a page element to be crawled from a page needing to be configured with a crawling rule, and determining the page content needing to be configured with the crawling rule from the browser plug-in.
6. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device in which the storage medium is located is controlled to execute the configuration method of the page crawling rule in any one of claim 1 to claim 3.
7. A processor, configured to execute a program, wherein the program executes the method for configuring the page crawling rule according to any one of claims 1 to 3.
CN201710884074.XA 2017-09-26 2017-09-26 Method and device for configuring page crawling rules Active CN110020068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710884074.XA CN110020068B (en) 2017-09-26 2017-09-26 Method and device for configuring page crawling rules

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710884074.XA CN110020068B (en) 2017-09-26 2017-09-26 Method and device for configuring page crawling rules

Publications (2)

Publication Number Publication Date
CN110020068A CN110020068A (en) 2019-07-16
CN110020068B true CN110020068B (en) 2021-10-15

Family

ID=67186380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710884074.XA Active CN110020068B (en) 2017-09-26 2017-09-26 Method and device for configuring page crawling rules

Country Status (1)

Country Link
CN (1) CN110020068B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569416B (en) * 2019-09-04 2023-10-13 腾讯科技(深圳)有限公司 APP control processing method based on data crawling and related products
CN112417252B (en) * 2020-12-04 2023-05-09 天津开心生活科技有限公司 Crawler path determination method and device, storage medium and electronic equipment
CN112541107A (en) * 2020-12-25 2021-03-23 天津浪淘科技股份有限公司 Page data learning and automatic acquisition method
CN112685620A (en) * 2020-12-31 2021-04-20 山东奥邦交通设施工程有限公司 Bidding information processing method, system, readable storage medium and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN104750804A (en) * 2015-03-24 2015-07-01 南京途牛科技有限公司 Plug-in type configurable vertical network spider implementation method
CN106096056A (en) * 2016-06-30 2016-11-09 西南石油大学 A kind of based on distributed public sentiment data real-time collecting method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890692A (en) * 2011-07-22 2013-01-23 阿里巴巴集团控股有限公司 Webpage information extraction method and webpage information extraction system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN104750804A (en) * 2015-03-24 2015-07-01 南京途牛科技有限公司 Plug-in type configurable vertical network spider implementation method
CN106096056A (en) * 2016-06-30 2016-11-09 西南石油大学 A kind of based on distributed public sentiment data real-time collecting method and system

Also Published As

Publication number Publication date
CN110020068A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
US10817613B2 (en) Access and management of entity-augmented content
CN110020068B (en) Method and device for configuring page crawling rules
Furche et al. DIADEM: thousands of websites to a single database
CN107787491B (en) Document storage for reusing content in a document
CN110069683B (en) Method and device for crawling data based on browser
US7730409B2 (en) Method and system for visualizing weblog social network communities
US20140115439A1 (en) Methods and systems for annotating web pages and managing annotations and annotated web pages
US10255253B2 (en) Augmenting and presenting captured data
Hrabovskyi et al. Development of a Technology for Automation of Work with Sources of Information on the Internet
US20210019470A1 (en) Identifying Information Using Referenced Text
CN114118038A (en) Table document online editing method, device, equipment and medium
CN103262106A (en) Managing content from structured and unstructured data sources
Petrovski et al. The WDC gold standards for product feature extraction and product matching
Yatskov et al. Extraction of data from mass media web sites
Wanjari et al. Automatic news extraction system for Indian online news papers
KR100863121B1 (en) Ontology search system
CN107544980B (en) Method and device for searching webpage
Patnaik et al. Trends in web data extraction using machine learning
CN115098090B (en) Form generation method and device
Mukhitova et al. DEVELOPMENT OF AN ADAPTIVE GRAPHIC WEB INTERFACE MODEL FOR EDITING XML DATA.
Kumar et al. STAT simple text annotation tool (STAT): Web-based tool for creating training data for spaCy models
Nerdal Newsenhancer
Hetland et al. Activity 5: Web Scraping with Requests and BeautifulSoup
Chaudhary et al. Barriers Which Impede the Implementation of an Effective Deep Web Data Extraction in VBPS
Vogel Distributed and Connected Information in the Internet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant