CN109271145B - Quick rule customizing method based on pythonQT and intelligent algorithm - Google Patents
Quick rule customizing method based on pythonQT and intelligent algorithm Download PDFInfo
- Publication number
- CN109271145B CN109271145B CN201811019150.1A CN201811019150A CN109271145B CN 109271145 B CN109271145 B CN 109271145B CN 201811019150 A CN201811019150 A CN 201811019150A CN 109271145 B CN109271145 B CN 109271145B
- Authority
- CN
- China
- Prior art keywords
- page
- rule
- intelligent algorithm
- point
- client
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/20—Software design
- G06F8/24—Object-oriented
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a fast rule customization method based on pythonQT and an intelligent algorithm, and relates to the technical field of webpage rule customization. Inputting a URL of a page to be crawled, and loading the page through the URL by a client; extracting navigation list items in the page based on the selenium; extracting a text part of the detail page through an intelligent algorithm; acquiring a page element rule from a page through a js technology, and returning the rule to a client; and uploading the rules to a server, and crawling by the background crawler program according to the rules. The method comprises the steps of extracting navigation list items in a page through the selenium, and filtering < a > tags with vertical coordinates larger than the height of a browser and < a > tags with the same horizontal coordinates and the number smaller than a reference value; and then, the text part of the detail page is extracted through an intelligent algorithm, so that the problem that the website customization rule is not suitable for various complicated website page rule customizations through manual analysis is solved, the method is suitable for webpage rule customizations of different websites, and the webpage rule customization efficiency is improved.
Description
Technical Field
The invention belongs to the technical field of webpage rule customization, and particularly relates to a quick rule customization method based on pythonQT and an intelligent algorithm.
Background
With the rapid development of big data technology, data plays more and more important role as its fundamental research object. How to acquire data efficiently and quickly becomes one of the important issues of current research. The crawler is used as a basic technical means for acquiring internet data, so that the data are efficiently acquired, and the current technology is certainly optimized and improved. The basic idea of the prior crawler for acquiring webpage data is as follows:
(1) with a given target web page address, the crawler initiates a Request for the page, i.e., sends a Request, which may contain additional heartbeat, etc.
(2) And acquiring the content responded after the server is requested. If the server can respond normally, a Response is obtained, and the content of the Response is the content of the page to be acquired.
(3) And analyzing the content. After the web page content is obtained, the crawler analyzes the page structure and crawls the specified content.
(4) And saving the data. The stored data is different in form, can be stored as a text, can be stored in a database, or can be stored as a file with a specific format.
For content analysis in the step (3), a currently common technical means is that a technician manually analyzes a page structure of a website through a browser, gives a specific rule, and then a background crawler crawls data of the page according to the rule. This may be desirable if a single website is crawled; however, for a large number of websites, this way of extracting page rules one by one is obviously not desirable, and is time-consuming and labor-consuming. There is a certain bottleneck in efficiency. The root uncovering means that no uniform tool is used for acquiring the element rule of the page to be crawled.
The invention aims to develop a quick rule customization method based on pythonQT and an intelligent algorithm, and is used for solving the problems that the conventional webpage customization rule through manual analysis of a website is not suitable for various complicated website page rule customization, and is time-consuming, labor-consuming and low in efficiency.
Disclosure of Invention
The invention aims to provide a pythonQT and intelligent algorithm-based quick rule customization method, which is characterized in that navigation list items in a page are extracted through a selenium and non-conforming labels are filtered; meanwhile, the text part of the detail page is extracted through an intelligent algorithm, the customization of the webpage rules of various different websites is realized, and the problems that the conventional method for customizing the webpage rules through manually analyzing the websites is not suitable for the customization of various complicated website page rules, consumes time and labor and has low efficiency are solved.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a pythonQT and intelligent algorithm based quick rule customization method, which comprises the following steps:
s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;
s01: extracting navigation list items in the page based on the selenium;
s02: extracting a text part of the detail page through an intelligent algorithm;
s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;
s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;
the specific process of extracting the navigation list item in the page in the S01 is as follows:
a00: by the visible < a > tag in the selenium tag page;
a01: filtering the labeled < a > tags;
the specific process of extracting the text part of the detail page through the intelligent algorithm in the step S02 is as follows:
c00: removing html tags in the detail pages to obtain plain texts;
c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;
c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;
CO 3: and acquiring the sudden drop point and the sudden rise point and confirming the text area.
Preferably, the filtration mode in A01 is as follows:
t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;
t01: filtering out < a > tags with ordinate larger than browser height;
t02: presetting a reference value representing the number of the page navigation lists;
t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.
Preferably, the specific process of obtaining the page element rule from the page through the js technique in S03 is as follows:
POO: a client opens a web server;
PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;
PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.
Preferably, the line size M ranges from 1< M < total number of text lines; and the number N of the characters in the line block is the total number of the characters in the line block.
Preferably, the dip point judgment basis is that the ordinate of the next point satisfying the current point is 0; the abrupt rising point judgment basis is that the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold; the text area is a character between the sudden rising point and the sudden falling point.
The invention has the following beneficial effects:
the method comprises the steps of extracting navigation list items in a page through the selenium, and filtering < a > tags with vertical coordinates larger than the height of a browser and < a > tags with the same horizontal coordinates and the number smaller than a reference value; and then, the text part of the detail page is extracted through an intelligent algorithm, so that the problem that the website customization rule is not suitable for various complicated website page rule customizations through manual analysis is solved, the method is suitable for webpage rule customizations of different websites, and the webpage rule customization efficiency is improved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a fast rule customization method based on pythonQT and an intelligent algorithm of the present invention;
FIG. 2 is a flowchart of the extraction of the body part of the detail page by the intelligent algorithm in S02 according to the present invention;
FIG. 3 is a flowchart illustrating the process of obtaining page element rules from a page through js technique in S03 according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention is a fast rule customization method based on pythonQT and an intelligent algorithm, including the following steps:
s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;
s01: extracting navigation list items in the page based on the selenium;
s02: extracting a text part of the detail page through an intelligent algorithm;
s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;
s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;
the specific process of extracting the navigation list item in the page in the S01 is as follows:
a00: by the visible < a > tag in the selenium tag page;
a01: filtering the labeled < a > tags;
referring to fig. 2, in S02, the specific process of extracting the text part of the detail page by the intelligent algorithm is as follows:
c00: removing html tags in the detail pages to obtain plain texts;
c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;
c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;
CO 3: and acquiring the sudden drop point and the sudden rise point and confirming the text area.
Wherein, the filtration mode in A01 is as follows:
t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;
t01: filtering out < a > tags with ordinate larger than browser height;
t02: presetting a reference value representing the number of the page navigation lists;
t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.
Referring to fig. 3, the specific process of obtaining the page element rule from the page through js technique in S03 is as follows:
POO: a client opens a web server;
PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;
PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.
Wherein, the line number range M is 1< M < text line total number; the number of line block characters N is the total number of characters in a line block.
Wherein, the first judgment basis of the sudden drop point is that the ordinate of the next point meeting the current point is 0; judging whether the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold according to the sudden rising point; the text area is the character between the swell point and the swell point.
It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
In addition, those skilled in the art can understand that all or part of the steps in the method for implementing the embodiments described above can be implemented by a program to instruct the relevant hardware.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.
Claims (4)
1. The quick rule customizing method based on pythonQT and the intelligent algorithm is characterized by comprising the following steps of:
s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;
s01: extracting navigation list items in the page based on the selenium;
s02: extracting a text part of the detail page through an intelligent algorithm;
s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;
s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;
the specific process of extracting the navigation list item in the page in the S01 is as follows:
a00: by the visible < a > tag in the selenium tag page;
a01: filtering the labeled < a > tags;
the specific process of extracting the text part of the detail page through the intelligent algorithm in the step S02 is as follows:
c00: removing html tags in the detail pages to obtain plain texts;
c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;
c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;
CO 3: acquiring a sudden drop point and a sudden rise point and confirming a text area;
the filtration mode in A01 is as follows:
t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;
t01: filtering out < a > tags with ordinate larger than browser height;
t02: presetting a reference value representing the number of the page navigation lists;
t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.
2. The pythonQT and intelligent algorithm-based fast rule customization method according to claim 1, wherein the specific process of obtaining the page element rule from the page through js technology in S03 is as follows:
POO: a client opens a web server;
PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;
PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.
3. The python qt and intelligent algorithm based fast rule customization method according to claim 1, characterized in that the line size M ranges from 1< M < total number of text lines; and the number N of the characters in the line block is the total number of the characters in the line block.
4. The python QT and intelligent algorithm based fast rule customization method of claim 1, wherein the dip point decision criterion is that a next point ordinate, which is the first to meet a current point, is 0; the abrupt rising point judgment basis is that the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold; the text area is a character between the sudden rising point and the sudden falling point.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811019150.1A CN109271145B (en) | 2018-09-03 | 2018-09-03 | Quick rule customizing method based on pythonQT and intelligent algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811019150.1A CN109271145B (en) | 2018-09-03 | 2018-09-03 | Quick rule customizing method based on pythonQT and intelligent algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109271145A CN109271145A (en) | 2019-01-25 |
CN109271145B true CN109271145B (en) | 2021-12-14 |
Family
ID=65187780
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811019150.1A Active CN109271145B (en) | 2018-09-03 | 2018-09-03 | Quick rule customizing method based on pythonQT and intelligent algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109271145B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110297962B (en) * | 2019-06-28 | 2021-08-24 | 北京金山安全软件有限公司 | Website resource crawling method, device, system and computer equipment |
CN113505288B (en) * | 2021-06-28 | 2023-08-01 | 南京大学 | Quick detection and positioning method based on statistics and pile positioning vision |
CN113987320B (en) * | 2021-11-24 | 2024-06-04 | 宁波深擎信息科技有限公司 | Real-time information crawler method, device and equipment based on intelligent page analysis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105335382A (en) * | 2014-06-27 | 2016-02-17 | 优视科技有限公司 | Webpage text extraction method and device |
CN105930385A (en) * | 2016-04-13 | 2016-09-07 | 珠海迈科智能科技股份有限公司 | Data crawling method and system |
CN107463696A (en) * | 2017-08-15 | 2017-12-12 | 中译语通科技(北京)有限公司 | A kind of method of Webpage largest block extraction |
CN108334508A (en) * | 2017-01-19 | 2018-07-27 | 阿里巴巴集团控股有限公司 | The extracting method and device of webpage information |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10108535B2 (en) * | 2016-07-08 | 2018-10-23 | Accenture Global Solutions Limited | Web application test script generation to test software functionality |
-
2018
- 2018-09-03 CN CN201811019150.1A patent/CN109271145B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105335382A (en) * | 2014-06-27 | 2016-02-17 | 优视科技有限公司 | Webpage text extraction method and device |
CN105930385A (en) * | 2016-04-13 | 2016-09-07 | 珠海迈科智能科技股份有限公司 | Data crawling method and system |
CN108334508A (en) * | 2017-01-19 | 2018-07-27 | 阿里巴巴集团控股有限公司 | The extracting method and device of webpage information |
CN107463696A (en) * | 2017-08-15 | 2017-12-12 | 中译语通科技(北京)有限公司 | A kind of method of Webpage largest block extraction |
Also Published As
Publication number | Publication date |
---|---|
CN109271145A (en) | 2019-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804512B (en) | Text classification model generation device and method and computer readable storage medium | |
CN105677764B (en) | Information extraction method and device | |
CN109271145B (en) | Quick rule customizing method based on pythonQT and intelligent algorithm | |
US9910842B2 (en) | Interactively predicting fields in a form | |
CN107730389A (en) | Electronic installation, insurance products recommend method and computer-readable recording medium | |
JP6827116B2 (en) | Web page clustering method and equipment | |
CN113705554A (en) | Training method, device and equipment of image recognition model and storage medium | |
CN110020312B (en) | Method and device for extracting webpage text | |
CN111538931A (en) | Big data-based public opinion monitoring method and device, computer equipment and medium | |
CN110633594A (en) | Target detection method and device | |
CN112001406A (en) | Text region detection method and device | |
CN111914159A (en) | Information recommendation method and terminal | |
CN112445915A (en) | Document map extraction method and device based on machine learning and storage medium | |
CN112650910A (en) | Method, device, equipment and storage medium for determining website update information | |
US10846462B2 (en) | Web page output selection | |
CN114359533A (en) | Page number identification method based on page text and computer equipment | |
CN110874570A (en) | Face recognition method, device, equipment and computer readable storage medium | |
CN107368923B (en) | Scenic spot heat prediction method and device | |
US10963690B2 (en) | Method for identifying main picture in web page | |
CN111581478A (en) | Cross-website general news acquisition method for specific subject | |
CN109101973B (en) | Character recognition method, electronic device and storage medium | |
CN107368464B (en) | Method and device for acquiring bidding product information | |
CN109948015B (en) | Meta search list result extraction method and system | |
CN112766269B (en) | Picture text retrieval method, intelligent terminal and storage medium | |
CN115270711A (en) | Electronic signature method, electronic signature device, electronic apparatus, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20190125 Assignee: Kedaduochuang cloud Technology Co.,Ltd. Assignor: USTC SINOVATE SOFTWARE CO.,LTD. Contract record no.: X2023980034512 Denomination of invention: A Fast Rule Customization Method Based on Python QT and Intelligent Algorithms Granted publication date: 20211214 License type: Common License Record date: 20230407 |
|
EE01 | Entry into force of recordation of patent licensing contract |