CN109271145B

CN109271145B - Quick rule customizing method based on pythonQT and intelligent algorithm

Info

Publication number: CN109271145B
Application number: CN201811019150.1A
Authority: CN
Inventors: 邢航; 李森; 汪明
Original assignee: Ustc Sinovate Software Co ltd
Current assignee: Ustc Sinovate Software Co ltd
Priority date: 2018-09-03
Filing date: 2018-09-03
Publication date: 2021-12-14
Anticipated expiration: 2038-09-03
Also published as: CN109271145A

Abstract

The invention discloses a fast rule customization method based on pythonQT and an intelligent algorithm, and relates to the technical field of webpage rule customization. Inputting a URL of a page to be crawled, and loading the page through the URL by a client; extracting navigation list items in the page based on the selenium; extracting a text part of the detail page through an intelligent algorithm; acquiring a page element rule from a page through a js technology, and returning the rule to a client; and uploading the rules to a server, and crawling by the background crawler program according to the rules. The method comprises the steps of extracting navigation list items in a page through the selenium, and filtering < a > tags with vertical coordinates larger than the height of a browser and < a > tags with the same horizontal coordinates and the number smaller than a reference value; and then, the text part of the detail page is extracted through an intelligent algorithm, so that the problem that the website customization rule is not suitable for various complicated website page rule customizations through manual analysis is solved, the method is suitable for webpage rule customizations of different websites, and the webpage rule customization efficiency is improved.

Description

Quick rule customizing method based on pythonQT and intelligent algorithm

Technical Field

The invention belongs to the technical field of webpage rule customization, and particularly relates to a quick rule customization method based on pythonQT and an intelligent algorithm.

Background

With the rapid development of big data technology, data plays more and more important role as its fundamental research object. How to acquire data efficiently and quickly becomes one of the important issues of current research. The crawler is used as a basic technical means for acquiring internet data, so that the data are efficiently acquired, and the current technology is certainly optimized and improved. The basic idea of the prior crawler for acquiring webpage data is as follows:

(1) with a given target web page address, the crawler initiates a Request for the page, i.e., sends a Request, which may contain additional heartbeat, etc.

(2) And acquiring the content responded after the server is requested. If the server can respond normally, a Response is obtained, and the content of the Response is the content of the page to be acquired.

(3) And analyzing the content. After the web page content is obtained, the crawler analyzes the page structure and crawls the specified content.

(4) And saving the data. The stored data is different in form, can be stored as a text, can be stored in a database, or can be stored as a file with a specific format.

For content analysis in the step (3), a currently common technical means is that a technician manually analyzes a page structure of a website through a browser, gives a specific rule, and then a background crawler crawls data of the page according to the rule. This may be desirable if a single website is crawled; however, for a large number of websites, this way of extracting page rules one by one is obviously not desirable, and is time-consuming and labor-consuming. There is a certain bottleneck in efficiency. The root uncovering means that no uniform tool is used for acquiring the element rule of the page to be crawled.

The invention aims to develop a quick rule customization method based on pythonQT and an intelligent algorithm, and is used for solving the problems that the conventional webpage customization rule through manual analysis of a website is not suitable for various complicated website page rule customization, and is time-consuming, labor-consuming and low in efficiency.

Disclosure of Invention

The invention aims to provide a pythonQT and intelligent algorithm-based quick rule customization method, which is characterized in that navigation list items in a page are extracted through a selenium and non-conforming labels are filtered; meanwhile, the text part of the detail page is extracted through an intelligent algorithm, the customization of the webpage rules of various different websites is realized, and the problems that the conventional method for customizing the webpage rules through manually analyzing the websites is not suitable for the customization of various complicated website page rules, consumes time and labor and has low efficiency are solved.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention relates to a pythonQT and intelligent algorithm based quick rule customization method, which comprises the following steps:

s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;

s01: extracting navigation list items in the page based on the selenium;

s02: extracting a text part of the detail page through an intelligent algorithm;

s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;

s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;

the specific process of extracting the navigation list item in the page in the S01 is as follows:

a00: by the visible < a > tag in the selenium tag page;

a01: filtering the labeled < a > tags;

the specific process of extracting the text part of the detail page through the intelligent algorithm in the step S02 is as follows:

c00: removing html tags in the detail pages to obtain plain texts;

c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;

c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;

CO 3: and acquiring the sudden drop point and the sudden rise point and confirming the text area.

Preferably, the filtration mode in A01 is as follows:

t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;

t01: filtering out < a > tags with ordinate larger than browser height;

t02: presetting a reference value representing the number of the page navigation lists;

t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.

Preferably, the specific process of obtaining the page element rule from the page through the js technique in S03 is as follows:

POO: a client opens a web server;

PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;

PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.

Preferably, the line size M ranges from 1< M < total number of text lines; and the number N of the characters in the line block is the total number of the characters in the line block.

Preferably, the dip point judgment basis is that the ordinate of the next point satisfying the current point is 0; the abrupt rising point judgment basis is that the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold; the text area is a character between the sudden rising point and the sudden falling point.

The invention has the following beneficial effects:

the method comprises the steps of extracting navigation list items in a page through the selenium, and filtering < a > tags with vertical coordinates larger than the height of a browser and < a > tags with the same horizontal coordinates and the number smaller than a reference value; and then, the text part of the detail page is extracted through an intelligent algorithm, so that the problem that the website customization rule is not suitable for various complicated website page rule customizations through manual analysis is solved, the method is suitable for webpage rule customizations of different websites, and the webpage rule customization efficiency is improved.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of a fast rule customization method based on pythonQT and an intelligent algorithm of the present invention;

FIG. 2 is a flowchart of the extraction of the body part of the detail page by the intelligent algorithm in S02 according to the present invention;

FIG. 3 is a flowchart illustrating the process of obtaining page element rules from a page through js technique in S03 according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention is a fast rule customization method based on pythonQT and an intelligent algorithm, including the following steps:

s01: extracting navigation list items in the page based on the selenium;

a00: by the visible < a > tag in the selenium tag page;

a01: filtering the labeled < a > tags;

referring to fig. 2, in S02, the specific process of extracting the text part of the detail page by the intelligent algorithm is as follows:

c00: removing html tags in the detail pages to obtain plain texts;

Wherein, the filtration mode in A01 is as follows:

t01: filtering out < a > tags with ordinate larger than browser height;

Referring to fig. 3, the specific process of obtaining the page element rule from the page through js technique in S03 is as follows:

POO: a client opens a web server;

Wherein, the line number range M is 1< M < text line total number; the number of line block characters N is the total number of characters in a line block.

Wherein, the first judgment basis of the sudden drop point is that the ordinate of the next point meeting the current point is 0; judging whether the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold according to the sudden rising point; the text area is the character between the swell point and the swell point.

It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

In addition, those skilled in the art can understand that all or part of the steps in the method for implementing the embodiments described above can be implemented by a program to instruct the relevant hardware.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. The quick rule customizing method based on pythonQT and the intelligent algorithm is characterized by comprising the following steps of:

s01: extracting navigation list items in the page based on the selenium;

a00: by the visible < a > tag in the selenium tag page;

a01: filtering the labeled < a > tags;

c00: removing html tags in the detail pages to obtain plain texts;

CO 3: acquiring a sudden drop point and a sudden rise point and confirming a text area;

the filtration mode in A01 is as follows:

t01: filtering out < a > tags with ordinate larger than browser height;

2. The pythonQT and intelligent algorithm-based fast rule customization method according to claim 1, wherein the specific process of obtaining the page element rule from the page through js technology in S03 is as follows:

POO: a client opens a web server;

3. The python qt and intelligent algorithm based fast rule customization method according to claim 1, characterized in that the line size M ranges from 1< M < total number of text lines; and the number N of the characters in the line block is the total number of the characters in the line block.

4. The python QT and intelligent algorithm based fast rule customization method of claim 1, wherein the dip point decision criterion is that a next point ordinate, which is the first to meet a current point, is 0; the abrupt rising point judgment basis is that the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold; the text area is a character between the sudden rising point and the sudden falling point.