CN108959539B - Rule-configurable webpage data analysis method - Google Patents

Rule-configurable webpage data analysis method Download PDF

Info

Publication number
CN108959539B
CN108959539B CN201810701727.0A CN201810701727A CN108959539B CN 108959539 B CN108959539 B CN 108959539B CN 201810701727 A CN201810701727 A CN 201810701727A CN 108959539 B CN108959539 B CN 108959539B
Authority
CN
China
Prior art keywords
page
webpage
data
information
web
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810701727.0A
Other languages
Chinese (zh)
Other versions
CN108959539A (en
Inventor
曹亮
罗山城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN201810701727.0A priority Critical patent/CN108959539B/en
Publication of CN108959539A publication Critical patent/CN108959539A/en
Application granted granted Critical
Publication of CN108959539B publication Critical patent/CN108959539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a rule-configurable webpage data analysis method, which comprises the following steps: s1, Web end task creation: the Web application program sends a data request to the server side, and submits the configured information after the task configuration information is filled; s2, webpage collection: acquiring acquisition information configured by task configuration in Web, and starting to capture a webpage by a background according to an incoming URL; s3, webpage analysis: acquiring analysis information configured by task configuration in Web, and acquiring list information after acquiring a webpage to perform data analysis; s4, data downloading: and viewing the task result through the task list, downloading the acquired webpage content in the task result, and viewing and downloading the analyzed data. The invention uses the B/S framework mode, is convenient to use, and does not need to carry out a large amount of operations when the webpage is collected and the webpage data is analyzed and configured. Dynamic data in the webpage can be conveniently acquired, and the webpage can be quickly acquired by using the protocol.

Description

Rule-configurable webpage data analysis method
Technical Field
The invention belongs to the field of webpage data processing, and particularly relates to a rule-configurable webpage data analysis method.
Background
In recent years, with the more and more clear big data strategy in China, data capture and information acquisition series products meet huge development opportunities, and the number of acquired products also rapidly increases. The web page analysis, namely, the program automatically analyzes the web page content and acquires the information, so as to further process the information, and the web page analysis is an indispensable and very important part in realizing the web crawler. However, the current webpage data analysis method is complex to operate when analyzing and configuring webpage data; or when dynamic data in a webpage is acquired, the speed is slower.
Disclosure of Invention
In order to solve the above problems, the present invention provides a rule-configurable web page data parsing method, which includes the following steps:
s1, Web end task creation: the Web application program sends a data request to a server side, a required webpage starting URL, a webpage collecting rule and a webpage analyzing rule are configured on a task configuration page, then data are lifted through an HTML (hypertext markup language) label to which the configuration data belong, and the configured information is submitted after the task configuration information is filled;
s2, web page collection: acquiring acquisition information configured through task configuration in Web, starting to grab a webpage according to an incoming URL by a background, and determining a grabbing mode according to a configured webpage acquisition rule, wherein the grabbing mode comprises an enhanced mode and a common mode, the enhanced mode combines and uses Selenium and ChromeDriver, and a mode of using a Python Useragent library to construct an access head to access a corresponding URL, and the common mode uses a mode of using a Python aiohttp library and a Useragent library to construct an access head to access a corresponding URL; after the access is successfully completed, saving the webpage information, the URL, the page number and the page grade into a list; after the access of the web pages is finished, storing the captured web page information into a server in the form of an HTML (hypertext markup language) file, and storing corresponding information into a database;
s3, webpage analysis: acquiring analysis information configured by task configuration in Web, acquiring list information after acquiring a webpage for data analysis, and analyzing the webpage through a Beautiful Soup library of Python; extracting data and related tags in a tag type and value mode according to HTML tags configured by a page during analysis; after the analysis is finished, storing the data into a database;
s4, data downloading: and viewing the task result through the task list, downloading the acquired webpage content in the task result, and viewing and downloading the analyzed data.
Further, the web page collecting rule of step S1 includes whether to collect a sub page, whether to collect a next page, and whether to use the enhanced mode.
Further, the webpage parsing rule of step S1 is at most three lines, and the webpage parsing rules in each line are used to parse the webpage individually, and are finally combined into a result, and the result is stored in the database.
Still further, the webpage parsing rule comprises four parameters, wherein a first parameter is used for selecting the webpage parsing rule, a second parameter and a fourth parameter are configuration information corresponding to the webpage parsing rule, a third parameter is a relationship between the second parameter configuration information and the fourth parameter configuration information, and the relationship is one of contained, not contained and contained.
Further, when the enhanced mode is selected for web page acquisition in step S2, if a sub-page needs to be captured, two chromedrivers are opened, one for accessing a first-level page and the other for accessing a sub-page; after a primary page is accessed, a sub-page URL link of the primary page is acquired through configured tag information, and then the sub-page is accessed; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing.
Further, when the common mode is selected for web page acquisition in step S2, if a sub-page needs to be captured, the first-level page is accessed, then the sub-page link is obtained and stored in the list through the configured tag information, and then the sub-page is accessed in a coroutine mode; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing.
The invention has the beneficial effects that:
1) the method adopts a B/S architecture mode, thereby avoiding downloading of a C/S architecture client and being convenient to use;
2) when the method is used for acquiring the webpage and analyzing and configuring the webpage data, the configuration can be performed only by knowing the HTML structure, and a large amount of operation is not required during the configuration;
3) the method can conveniently acquire the dynamic data in the webpage, and can quickly acquire the webpage by using the coroutine.
Drawings
FIG. 1 is a flow chart of a rule-based configurable web page data parsing method.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
The invention provides a rule-configurable webpage data analysis method, as shown in fig. 1, which specifically comprises the following steps:
firstly, a server side in a Win10 environment is started, a designated port is monitored, and Socket connection is waited.
Then, a Web end task is created, and the Web application program sends a data request to the server end. In the step, a required webpage starting URL, a webpage collecting rule and a webpage analyzing rule are configured on a task configuration page, then data are lifted through an HTML (hypertext markup language) label to which the configuration data belong, and the configured information is submitted after the task configuration information is filled. In this step, the webpage collecting rules include whether to collect a sub-page, whether to collect a next page, and whether to use an enhanced mode, specifically:
1) when selecting to collect a sub-page, a "get sub-page tag" must be configured, which is in the form of an HTML tag: class = "xxx" >, the background searches all links conforming to the label in the level one page according to the label for access;
2) when the next page is selected to be collected, a "tag for acquiring the next page" needs to be configured, and the tag is in an HTML tag form: class = 'next' > next page, the background can search a corresponding next page link according to the label for access;
3) the enhanced mode is used for accurately acquiring dynamic web pages, and the enhanced mode is selected to access the web pages by combining the Selenium and the ChromeDriver.
In addition, the webpage analysis rule is at most three lines, the webpage analysis rule of each line is used for analyzing the webpage independently, wherein the webpage analysis rule of each line comprises four parameters, the first parameter is used for selecting the webpage analysis rule, the second parameter and the fourth parameter are configuration information corresponding to the webpage analysis rule, the third parameter is a relation between the second parameter configuration information and the fourth parameter configuration information, and the relation is one of contained, not contained and only contained.
Regular expression rules can also be added in the configuration of the rules.
Then, after the Web-side task is created, Web page collection is started. Acquiring acquisition information configured through task configuration in Web, starting to capture a webpage according to an incoming URL by a background, and determining a capture mode according to a configured webpage acquisition rule, wherein the capture mode comprises an enhanced mode and a common mode, and specifically comprises the following steps:
1) the enhancement mode combines the mode of using Selenium and ChromeDriver and using a Python UserAgent library to construct an access head to access a corresponding URL, if a sub-page needs to be captured, two ChromeDrivers are opened, one is used for accessing a primary page, and the other is used for accessing a sub-page. After a primary page is accessed, a sub-page URL link of the primary page is acquired through configured tag information, and then the sub-page is accessed; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing. In particular, the primary page setting captures up to 10 pages;
2) the common mode uses a mode of constructing an access header by using an aiohttp library and a user agent library of Python to access a corresponding URL (uniform resource locator), if a sub-page needs to be captured, a first-level page is accessed, then a sub-page link is obtained through configured tag information and stored in a list, and then a mode of coroutine is used to access the sub-page; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing. In particular, the level one page setting grabs 10 pages at most.
And after the access is successfully completed, saving the webpage information, the URL, the page number and the page grade into a list. And after the web pages are completely accessed, storing the captured web page information into the server in the form of an HTML (hypertext markup language) file, and storing the corresponding information into the database.
Then, the web page is analyzed. In the step, analysis information configured through task configuration in the Web is obtained, list information after a webpage is collected is obtained for data analysis, and the webpage is analyzed through a Beautiful Soup library of Python; extracting data and related tags in a tag type and value mode according to HTML tags configured by a page during analysis; and after the analysis is finished, storing the data into the database.
And finally downloading the data. And viewing the task result through the task list, downloading the acquired webpage content in the task result, and viewing and downloading the analyzed data.
In the description of the present invention, it should be noted that the terms "first", "second", "third", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims (4)

1. A rule-configurable webpage data parsing method is characterized by comprising the following steps:
s1, Web end task creation: the Web application program sends a data request to a server side, a required webpage starting URL, a webpage collecting rule and a webpage analyzing rule are configured on a task configuration page, then data are lifted through an HTML (hypertext markup language) label to which the configuration data belong, and the configured information is submitted after the task configuration information is filled; the webpage collecting rules comprise whether sub-pages are collected or not, whether next pages are collected or not and whether an enhanced mode is used or not;
s2, web page collection: acquiring acquisition information configured through task configuration in Web, starting to grab a webpage according to an incoming URL by a background, and determining a grabbing mode according to a configured webpage acquisition rule, wherein the grabbing mode comprises an enhanced mode and a common mode, the enhanced mode combines and uses Selenium and ChromeDriver, and a mode of using a Python Useragent library to construct an access head to access a corresponding URL, and the common mode uses a mode of using a Python aiohttp library and a Useragent library to construct an access head to access a corresponding URL; after the access is successfully completed, saving the webpage information, the URL, the page number and the page grade into a list; after the access of the web pages is finished, storing the captured web page information into a server in the form of an HTML (hypertext markup language) file, and storing corresponding information into a database; when the step S2 selects the enhanced mode to perform web page acquisition, if a sub-page needs to be captured, two ChromeDriver are opened, one of them performs access to the first-level page, and the other performs access to the sub-page; after a primary page is accessed, a sub-page URL link of the primary page is acquired through configured tag information, and then the sub-page is accessed; if the next page needs to be captured, the next page link is obtained through the configured next page label to access;
s3, webpage analysis: acquiring analysis information configured by task configuration in Web, acquiring list information after acquiring a webpage for data analysis, and analyzing the webpage through a Beautiful Soup library of Python; extracting data and related tags in a tag type and value mode according to HTML tags configured by a page during analysis; after the analysis is finished, storing the data into a database;
s4, data downloading: and viewing the task result through the task list, downloading the acquired webpage content in the task result, and viewing and downloading the analyzed data.
2. The method as claimed in claim 1, wherein the number of the parsing rules of the web page in step S1 is at most three, and the parsing rules of the web page in each row are used to parse the web page separately, and finally merge the parsed web page into the result, and store the result in the database.
3. The method according to claim 2, wherein the web page parsing rule includes four parameters, a first parameter is used for selecting the web page parsing rule, a second parameter and a fourth parameter are configuration information corresponding to the web page parsing rule, a third parameter is a relationship between the configuration information of the second parameter and the configuration information of the fourth parameter, and the relationship is one of contained, not contained, and only contained.
4. The rule-configurable webpage data parsing method according to claim 1, wherein when a common mode is selected for webpage collection in step S2, if a sub-page needs to be captured, a first-level page is accessed, then a sub-page link is obtained through configured tag information and saved in a list, and then a coroutine mode is used to access the sub-page; and if the next page needs to be grabbed, acquiring a next page link through the configured next page tag for accessing.
CN201810701727.0A 2018-06-30 2018-06-30 Rule-configurable webpage data analysis method Active CN108959539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810701727.0A CN108959539B (en) 2018-06-30 2018-06-30 Rule-configurable webpage data analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810701727.0A CN108959539B (en) 2018-06-30 2018-06-30 Rule-configurable webpage data analysis method

Publications (2)

Publication Number Publication Date
CN108959539A CN108959539A (en) 2018-12-07
CN108959539B true CN108959539B (en) 2021-09-21

Family

ID=64484169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810701727.0A Active CN108959539B (en) 2018-06-30 2018-06-30 Rule-configurable webpage data analysis method

Country Status (1)

Country Link
CN (1) CN108959539B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109656925A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Application program data acquisition method and device and electronic equipment
CN110119423A (en) * 2019-05-17 2019-08-13 厦门商集网络科技有限责任公司 A kind of data analysis method and computer readable storage medium of configurableization
CN110297962B (en) * 2019-06-28 2021-08-24 北京金山安全软件有限公司 Website resource crawling method, device, system and computer equipment
CN111585956B (en) * 2020-03-31 2022-09-09 完美世界(北京)软件科技发展有限公司 Website anti-brushing verification method and device
CN117370635B (en) * 2023-12-08 2024-03-15 杭州实在智能科技有限公司 Method and system for extracting and processing RPA webpage content

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890692A (en) * 2011-07-22 2013-01-23 阿里巴巴集团控股有限公司 Webpage information extraction method and webpage information extraction system
CN103399908A (en) * 2013-07-30 2013-11-20 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN106528769A (en) * 2016-11-04 2017-03-22 乐视控股(北京)有限公司 Data acquisition method and apparatus
CN106959995A (en) * 2016-12-21 2017-07-18 四川长虹电器股份有限公司 Compatible two-way automatic web page contents acquisition method
CN107025296A (en) * 2017-04-17 2017-08-08 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN107391757A (en) * 2017-08-23 2017-11-24 绵阳美菱软件技术有限公司 A kind of appliance data acquisition method and device
CN107885777A (en) * 2017-10-11 2018-04-06 北京智慧星光信息技术有限公司 A kind of control method and system of the crawl web data based on collaborative reptile

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9158855B2 (en) * 2005-06-16 2015-10-13 Buzzmetrics, Ltd Extracting structured data from weblogs
US20090182788A1 (en) * 2008-01-14 2009-07-16 Zenbe, Inc. Apparatus and method for customized email and data management
CN105868968A (en) * 2016-04-21 2016-08-17 广州爱拼信息科技有限公司 Recruitment information analysis system and method based on machine learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890692A (en) * 2011-07-22 2013-01-23 阿里巴巴集团控股有限公司 Webpage information extraction method and webpage information extraction system
CN103399908A (en) * 2013-07-30 2013-11-20 北京北纬通信科技股份有限公司 Method and system for fetching business data
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN106528769A (en) * 2016-11-04 2017-03-22 乐视控股(北京)有限公司 Data acquisition method and apparatus
CN106959995A (en) * 2016-12-21 2017-07-18 四川长虹电器股份有限公司 Compatible two-way automatic web page contents acquisition method
CN107025296A (en) * 2017-04-17 2017-08-08 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN107391757A (en) * 2017-08-23 2017-11-24 绵阳美菱软件技术有限公司 A kind of appliance data acquisition method and device
CN107885777A (en) * 2017-10-11 2018-04-06 北京智慧星光信息技术有限公司 A kind of control method and system of the crawl web data based on collaborative reptile

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种智能网页数据采集系统设计;李世忠;《电子技术与软件工程》;20180327;169 *

Also Published As

Publication number Publication date
CN108959539A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN108959539B (en) Rule-configurable webpage data analysis method
CN105243159B (en) A kind of distributed network crawler system based on visualization script editing machine
CN107273409B (en) Network data acquisition, storage and processing method and system
US8468145B2 (en) Indexing of URLs with fragments
CN103279507B (en) Webpage spider operational method and system
CN104601573B (en) A kind of Android platform URL accesses result verification method and device
WO2018103488A1 (en) Webpage scraping method and server
CN103761279B (en) Method and system for scheduling network crawlers on basis of keyword search
CN106528769A (en) Data acquisition method and apparatus
CN103294732B (en) Webpage capture method and reptile
CN105095280A (en) Caching method and apparatus for browser
CN106776983B (en) Search engine optimization device and method
CN103942268B (en) Search for method, equipment and the application interface being combined with application
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
US11330035B2 (en) Method and server for HTTP protocol-based data request
CN104933168B (en) A kind of web page contents automatic acquiring method
CN107766509A (en) A kind of method and apparatus of webpage static backup
CN106599270B (en) Network data capturing method and crawler
CN106294885A (en) A kind of data collection towards isomery webpage and mask method
CN107819837A (en) A kind of method and log cache analysis system for lifting buffer service quality
CN103778156A (en) Method and device for searching for data and server for data search
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
CN105426407A (en) Web data acquisition method based on content analysis
CN109831491B (en) Invasive social data acquisition method based on agent
CN103117892B (en) Add method and the device of website visiting record

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant