CN109271145B - Quick rule customizing method based on pythonQT and intelligent algorithm - Google Patents

Quick rule customizing method based on pythonQT and intelligent algorithm Download PDF

Info

Publication number
CN109271145B
CN109271145B CN201811019150.1A CN201811019150A CN109271145B CN 109271145 B CN109271145 B CN 109271145B CN 201811019150 A CN201811019150 A CN 201811019150A CN 109271145 B CN109271145 B CN 109271145B
Authority
CN
China
Prior art keywords
page
rule
intelligent algorithm
point
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811019150.1A
Other languages
Chinese (zh)
Other versions
CN109271145A (en
Inventor
邢航
李森
汪明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ustc Sinovate Software Co ltd
Original Assignee
Ustc Sinovate Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ustc Sinovate Software Co ltd filed Critical Ustc Sinovate Software Co ltd
Priority to CN201811019150.1A priority Critical patent/CN109271145B/en
Publication of CN109271145A publication Critical patent/CN109271145A/en
Application granted granted Critical
Publication of CN109271145B publication Critical patent/CN109271145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • G06F8/24Object-oriented

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a fast rule customization method based on pythonQT and an intelligent algorithm, and relates to the technical field of webpage rule customization. Inputting a URL of a page to be crawled, and loading the page through the URL by a client; extracting navigation list items in the page based on the selenium; extracting a text part of the detail page through an intelligent algorithm; acquiring a page element rule from a page through a js technology, and returning the rule to a client; and uploading the rules to a server, and crawling by the background crawler program according to the rules. The method comprises the steps of extracting navigation list items in a page through the selenium, and filtering < a > tags with vertical coordinates larger than the height of a browser and < a > tags with the same horizontal coordinates and the number smaller than a reference value; and then, the text part of the detail page is extracted through an intelligent algorithm, so that the problem that the website customization rule is not suitable for various complicated website page rule customizations through manual analysis is solved, the method is suitable for webpage rule customizations of different websites, and the webpage rule customization efficiency is improved.

Description

Quick rule customizing method based on pythonQT and intelligent algorithm
Technical Field
The invention belongs to the technical field of webpage rule customization, and particularly relates to a quick rule customization method based on pythonQT and an intelligent algorithm.
Background
With the rapid development of big data technology, data plays more and more important role as its fundamental research object. How to acquire data efficiently and quickly becomes one of the important issues of current research. The crawler is used as a basic technical means for acquiring internet data, so that the data are efficiently acquired, and the current technology is certainly optimized and improved. The basic idea of the prior crawler for acquiring webpage data is as follows:
(1) with a given target web page address, the crawler initiates a Request for the page, i.e., sends a Request, which may contain additional heartbeat, etc.
(2) And acquiring the content responded after the server is requested. If the server can respond normally, a Response is obtained, and the content of the Response is the content of the page to be acquired.
(3) And analyzing the content. After the web page content is obtained, the crawler analyzes the page structure and crawls the specified content.
(4) And saving the data. The stored data is different in form, can be stored as a text, can be stored in a database, or can be stored as a file with a specific format.
For content analysis in the step (3), a currently common technical means is that a technician manually analyzes a page structure of a website through a browser, gives a specific rule, and then a background crawler crawls data of the page according to the rule. This may be desirable if a single website is crawled; however, for a large number of websites, this way of extracting page rules one by one is obviously not desirable, and is time-consuming and labor-consuming. There is a certain bottleneck in efficiency. The root uncovering means that no uniform tool is used for acquiring the element rule of the page to be crawled.
The invention aims to develop a quick rule customization method based on pythonQT and an intelligent algorithm, and is used for solving the problems that the conventional webpage customization rule through manual analysis of a website is not suitable for various complicated website page rule customization, and is time-consuming, labor-consuming and low in efficiency.
Disclosure of Invention
The invention aims to provide a pythonQT and intelligent algorithm-based quick rule customization method, which is characterized in that navigation list items in a page are extracted through a selenium and non-conforming labels are filtered; meanwhile, the text part of the detail page is extracted through an intelligent algorithm, the customization of the webpage rules of various different websites is realized, and the problems that the conventional method for customizing the webpage rules through manually analyzing the websites is not suitable for the customization of various complicated website page rules, consumes time and labor and has low efficiency are solved.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention relates to a pythonQT and intelligent algorithm based quick rule customization method, which comprises the following steps:
s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;
s01: extracting navigation list items in the page based on the selenium;
s02: extracting a text part of the detail page through an intelligent algorithm;
s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;
s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;
the specific process of extracting the navigation list item in the page in the S01 is as follows:
a00: by the visible < a > tag in the selenium tag page;
a01: filtering the labeled < a > tags;
the specific process of extracting the text part of the detail page through the intelligent algorithm in the step S02 is as follows:
c00: removing html tags in the detail pages to obtain plain texts;
c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;
c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;
CO 3: and acquiring the sudden drop point and the sudden rise point and confirming the text area.
Preferably, the filtration mode in A01 is as follows:
t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;
t01: filtering out < a > tags with ordinate larger than browser height;
t02: presetting a reference value representing the number of the page navigation lists;
t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.
Preferably, the specific process of obtaining the page element rule from the page through the js technique in S03 is as follows:
POO: a client opens a web server;
PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;
PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.
Preferably, the line size M ranges from 1< M < total number of text lines; and the number N of the characters in the line block is the total number of the characters in the line block.
Preferably, the dip point judgment basis is that the ordinate of the next point satisfying the current point is 0; the abrupt rising point judgment basis is that the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold; the text area is a character between the sudden rising point and the sudden falling point.
The invention has the following beneficial effects:
the method comprises the steps of extracting navigation list items in a page through the selenium, and filtering < a > tags with vertical coordinates larger than the height of a browser and < a > tags with the same horizontal coordinates and the number smaller than a reference value; and then, the text part of the detail page is extracted through an intelligent algorithm, so that the problem that the website customization rule is not suitable for various complicated website page rule customizations through manual analysis is solved, the method is suitable for webpage rule customizations of different websites, and the webpage rule customization efficiency is improved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a fast rule customization method based on pythonQT and an intelligent algorithm of the present invention;
FIG. 2 is a flowchart of the extraction of the body part of the detail page by the intelligent algorithm in S02 according to the present invention;
FIG. 3 is a flowchart illustrating the process of obtaining page element rules from a page through js technique in S03 according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention is a fast rule customization method based on pythonQT and an intelligent algorithm, including the following steps:
s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;
s01: extracting navigation list items in the page based on the selenium;
s02: extracting a text part of the detail page through an intelligent algorithm;
s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;
s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;
the specific process of extracting the navigation list item in the page in the S01 is as follows:
a00: by the visible < a > tag in the selenium tag page;
a01: filtering the labeled < a > tags;
referring to fig. 2, in S02, the specific process of extracting the text part of the detail page by the intelligent algorithm is as follows:
c00: removing html tags in the detail pages to obtain plain texts;
c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;
c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;
CO 3: and acquiring the sudden drop point and the sudden rise point and confirming the text area.
Wherein, the filtration mode in A01 is as follows:
t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;
t01: filtering out < a > tags with ordinate larger than browser height;
t02: presetting a reference value representing the number of the page navigation lists;
t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.
Referring to fig. 3, the specific process of obtaining the page element rule from the page through js technique in S03 is as follows:
POO: a client opens a web server;
PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;
PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.
Wherein, the line number range M is 1< M < text line total number; the number of line block characters N is the total number of characters in a line block.
Wherein, the first judgment basis of the sudden drop point is that the ordinate of the next point meeting the current point is 0; judging whether the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold according to the sudden rising point; the text area is the character between the swell point and the swell point.
It should be noted that, in the above system embodiment, each included unit is only divided according to functional logic, but is not limited to the above division as long as the corresponding function can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
In addition, those skilled in the art can understand that all or part of the steps in the method for implementing the embodiments described above can be implemented by a program to instruct the relevant hardware.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (4)

1. The quick rule customizing method based on pythonQT and the intelligent algorithm is characterized by comprising the following steps of:
s00: inputting a URL of a page to be crawled, and loading the page through the URL by a client;
s01: extracting navigation list items in the page based on the selenium;
s02: extracting a text part of the detail page through an intelligent algorithm;
s03: acquiring a page element rule from a page through a js technology, and returning the rule to a client;
s04: uploading the rules to a server, and crawling by a background crawler program according to the rules;
the specific process of extracting the navigation list item in the page in the S01 is as follows:
a00: by the visible < a > tag in the selenium tag page;
a01: filtering the labeled < a > tags;
the specific process of extracting the text part of the detail page through the intelligent algorithm in the step S02 is as follows:
c00: removing html tags in the detail pages to obtain plain texts;
c01: setting the size M of the line number of the line block size and the threshold value of the line block number, and calculating the number N of characters of each line block;
c02: drawing a line block curve by taking the number M of the lines as an abscissa and the number N of the lines as an ordinate;
CO 3: acquiring a sudden drop point and a sudden rise point and confirming a text area;
the filtration mode in A01 is as follows:
t00: traversing the < a > label screened out in A00, calculating the coordinate position of the label and storing the coordinate position in the label abscissa in the label coordinate array;
t01: filtering out < a > tags with ordinate larger than browser height;
t02: presetting a reference value representing the number of the page navigation lists;
t03: judging whether the number of the same horizontal coordinates in the label coordinate array is smaller than a reference value or not; if yes, deleting the < a > label corresponding to the abscissa.
2. The pythonQT and intelligent algorithm-based fast rule customization method according to claim 1, wherein the specific process of obtaining the page element rule from the page through js technology in S03 is as follows:
POO: a client opens a web server;
PO 1: when a page URL is loaded, adding a form in a page; setting a click event for a click element, acquiring each attribute value of the element through the click element, and then submitting the attribute values to a web server through a form;
PO 2: the web server receives data sent by the form and displays the data on the client; the client stores the rules and synchronizes to the server.
3. The python qt and intelligent algorithm based fast rule customization method according to claim 1, characterized in that the line size M ranges from 1< M < total number of text lines; and the number N of the characters in the line block is the total number of the characters in the line block.
4. The python QT and intelligent algorithm based fast rule customization method of claim 1, wherein the dip point decision criterion is that a next point ordinate, which is the first to meet a current point, is 0; the abrupt rising point judgment basis is that the vertical coordinate of the next point meeting the current point is larger than the row block word number threshold; the text area is a character between the sudden rising point and the sudden falling point.
CN201811019150.1A 2018-09-03 2018-09-03 Quick rule customizing method based on pythonQT and intelligent algorithm Active CN109271145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811019150.1A CN109271145B (en) 2018-09-03 2018-09-03 Quick rule customizing method based on pythonQT and intelligent algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811019150.1A CN109271145B (en) 2018-09-03 2018-09-03 Quick rule customizing method based on pythonQT and intelligent algorithm

Publications (2)

Publication Number Publication Date
CN109271145A CN109271145A (en) 2019-01-25
CN109271145B true CN109271145B (en) 2021-12-14

Family

ID=65187780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811019150.1A Active CN109271145B (en) 2018-09-03 2018-09-03 Quick rule customizing method based on pythonQT and intelligent algorithm

Country Status (1)

Country Link
CN (1) CN109271145B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297962B (en) * 2019-06-28 2021-08-24 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN113505288B (en) * 2021-06-28 2023-08-01 南京大学 Quick detection and positioning method based on statistics and pile positioning vision
CN113987320B (en) * 2021-11-24 2024-06-04 宁波深擎信息科技有限公司 Real-time information crawler method, device and equipment based on intelligent page analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335382A (en) * 2014-06-27 2016-02-17 优视科技有限公司 Webpage text extraction method and device
CN105930385A (en) * 2016-04-13 2016-09-07 珠海迈科智能科技股份有限公司 Data crawling method and system
CN107463696A (en) * 2017-08-15 2017-12-12 中译语通科技(北京)有限公司 A kind of method of Webpage largest block extraction
CN108334508A (en) * 2017-01-19 2018-07-27 阿里巴巴集团控股有限公司 The extracting method and device of webpage information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10108535B2 (en) * 2016-07-08 2018-10-23 Accenture Global Solutions Limited Web application test script generation to test software functionality

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335382A (en) * 2014-06-27 2016-02-17 优视科技有限公司 Webpage text extraction method and device
CN105930385A (en) * 2016-04-13 2016-09-07 珠海迈科智能科技股份有限公司 Data crawling method and system
CN108334508A (en) * 2017-01-19 2018-07-27 阿里巴巴集团控股有限公司 The extracting method and device of webpage information
CN107463696A (en) * 2017-08-15 2017-12-12 中译语通科技(北京)有限公司 A kind of method of Webpage largest block extraction

Also Published As

Publication number Publication date
CN109271145A (en) 2019-01-25

Similar Documents

Publication Publication Date Title
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
CN105677764B (en) Information extraction method and device
CN109271145B (en) Quick rule customizing method based on pythonQT and intelligent algorithm
US9910842B2 (en) Interactively predicting fields in a form
CN107730389A (en) Electronic installation, insurance products recommend method and computer-readable recording medium
JP6827116B2 (en) Web page clustering method and equipment
CN113705554A (en) Training method, device and equipment of image recognition model and storage medium
CN110020312B (en) Method and device for extracting webpage text
CN111538931A (en) Big data-based public opinion monitoring method and device, computer equipment and medium
CN110633594A (en) Target detection method and device
CN112001406A (en) Text region detection method and device
CN111914159A (en) Information recommendation method and terminal
CN112445915A (en) Document map extraction method and device based on machine learning and storage medium
CN112650910A (en) Method, device, equipment and storage medium for determining website update information
US10846462B2 (en) Web page output selection
CN114359533A (en) Page number identification method based on page text and computer equipment
CN110874570A (en) Face recognition method, device, equipment and computer readable storage medium
CN107368923B (en) Scenic spot heat prediction method and device
US10963690B2 (en) Method for identifying main picture in web page
CN111581478A (en) Cross-website general news acquisition method for specific subject
CN109101973B (en) Character recognition method, electronic device and storage medium
CN107368464B (en) Method and device for acquiring bidding product information
CN109948015B (en) Meta search list result extraction method and system
CN112766269B (en) Picture text retrieval method, intelligent terminal and storage medium
CN115270711A (en) Electronic signature method, electronic signature device, electronic apparatus, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190125

Assignee: Kedaduochuang cloud Technology Co.,Ltd.

Assignor: USTC SINOVATE SOFTWARE CO.,LTD.

Contract record no.: X2023980034512

Denomination of invention: A Fast Rule Customization Method Based on Python QT and Intelligent Algorithms

Granted publication date: 20211214

License type: Common License

Record date: 20230407

EE01 Entry into force of recordation of patent licensing contract