CN105630673A - Automated test method and apparatus for web crawler rate - Google Patents

Automated test method and apparatus for web crawler rate Download PDF

Info

Publication number
CN105630673A
CN105630673A CN201510957702.3A CN201510957702A CN105630673A CN 105630673 A CN105630673 A CN 105630673A CN 201510957702 A CN201510957702 A CN 201510957702A CN 105630673 A CN105630673 A CN 105630673A
Authority
CN
China
Prior art keywords
object linking
crawlers
full rate
rate
candidate link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510957702.3A
Other languages
Chinese (zh)
Other versions
CN105630673B (en
Inventor
徐香联
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201510957702.3A priority Critical patent/CN105630673B/en
Publication of CN105630673A publication Critical patent/CN105630673A/en
Application granted granted Critical
Publication of CN105630673B publication Critical patent/CN105630673B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3612Software analysis for verifying properties of programs by runtime analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites

Abstract

Embodiments of the invention disclose an automated test method and apparatus for a web crawler rate. The method comprises the steps of accessing to a webpage associated with a seed website address read from a crawler seed library of a crawler program and obtaining a set number of candidate links from link attributes of the webpage; screening the candidate links to obtain a target link, and importing the target link into a test tool; and according to the target link imported into the test tool and crawler result data of the crawler program, determining the overall crawler rate of the crawler program. According to the technical scheme in the embodiments of the invention, compared with the prior art of manually verifying the performance of the crawler program, the automated test method and apparatus have the advantage that the test efficiency of the crawler program is improved.

Description

The automated testing method of a kind of web crawlers rate and device
Technical field
The present embodiments relate to software testing technology field, particularly relate to automated testing method and the device of a kind of web crawlers rate.
Background technology
In recent years, China Internet user is explosive growth, and various Websites are flourish like the mushrooms after rain. Information in the face of such magnanimity, it is necessary to carrying out valuable data extracting screening, government may be used for the analysis of public opinion, network security monitoring; Enterprise may be used for market survey, Media Analysis.
Under the overall situation of information explosive growth, spiders technology is particularly important. Can crawlers crawl the information wanted in time, and whether the web data crawled is complete, and whether information is correct, is the important indicator embodying properties of product. But it is timely, comprehensive, correct whether the data of thousands of webpages of manual authentication crawl, whether all put in storage, waste time and energy, therefore can climb the method for testing of full rate by automatic test spiders in the urgent need to a kind of, to improve the testing efficiency to crawlers.
Summary of the invention
The present invention provides automated testing method and the device of a kind of web crawlers rate, to improve the testing efficiency to crawlers.
First aspect, embodiments provides the automated testing method of a kind of web crawlers rate, including:
Access the webpage of the seed network address association read from the reptile seed bank of crawlers, and from the link attribute of described webpage, obtain setting numerical value candidate link;
Described candidate link is screened, to obtain Object linking, and described Object linking is imported in testing tool;
According to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate.
Second aspect, embodiments provides the automatic test device of a kind of web crawlers rate, including:
Candidate link module, for accessing the webpage of the seed network address association read from the reptile seed bank of crawlers, and obtains setting numerical value candidate link from the link attribute of described webpage;
Target-linked module, for described candidate link is screened, to obtain Object linking, and imports in testing tool by described Object linking;
Climb full rate module, for according to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate.
The technical scheme that the embodiment of the present invention provides, by the webpage of seed network address association in the reptile seed bank of access crawlers, Object linking is filtered out from the link attribute of webpage, reptile result data according to Object linking and crawlers, that determines crawlers climbs full rate, compared to the performance of prior art manual authentication crawlers, improve the testing efficiency to crawlers.
Accompanying drawing explanation
Fig. 1 a is the flow chart of the automated testing method of a kind of web crawlers rate in the embodiment of the present invention one;
Fig. 1 b is the schematic diagram of the seed network address in the embodiment of the present invention one and regular expression;
Fig. 1 c is the schematic diagram of the matching result of the Object linking association in the embodiment of the present invention one;
Fig. 1 d is the Object linking schematic diagram of the excel form in the embodiment of the present invention one;
Fig. 2 is the flow chart of the automated testing method of a kind of web crawlers rate in the embodiment of the present invention two;
Fig. 3 is the structural representation of the automatic test device of a kind of web crawlers rate in the embodiment of the present invention three.
Detailed description of the invention
Below in conjunction with drawings and Examples, the present invention is described in further detail. It is understood that specific embodiment described herein is used only for explaining the present invention, but not limitation of the invention. It also should be noted that, for the ease of describing, accompanying drawing illustrate only part related to the present invention but not entire infrastructure.
Embodiment one
Fig. 1 a is the flow chart of the automated testing method of a kind of web crawlers rate in the embodiment of the present invention one. The method can be performed by the automatic test device of web crawlers rate, this device can be realized by the mode of hardware and/or software, it is configured in test machine, described test machine is provided with automated test tool (such as QuickTestProfessional), data base's (such as oracle database or Full-text Database) and browser.
As shown in Figure 1a, described method specifically includes following steps:
The webpage of the seed network address association that step 11, access are read from the reptile seed bank of crawlers, and from the link attribute of described webpage, obtain setting numerical value candidate link.
In the present embodiment, crawlers is according to certain rule, automatically capture program or the script of web message, seed network address refers to crawlers to capture the network address of information affiliated web site, with reference to Fig. 1 b, if crawlers needs to crawl the information of 100 news websites or forum website, as crawled the information such as news, blog or forum, then seed network address comprises the network address of 100 news websites or forum website. Seed network address is stored in reptile seed bank. The link attribute of webpage refers to the link of article, news, blog article or model that webpage comprises, and as seed network address tieba.baidu.com, the webpage of association is mhkc forum of Baidu, the link of the model that mhkc forum of Baidu comprises constitutes link attribute.
Concrete, seed network address is obtained from the reptile seed bank of crawlers, the webpage of seed network address association is accessed by browser (such as IE browser), and from the link attribute of webpage, obtain setting numerical value candidate link, wherein set numerical value to be set according to testing requirement, can be 30 as set numerical value.
Step 12, described candidate link is screened, to obtain Object linking, and described Object linking is imported in testing tool.
Preferably, described candidate link is screened, to obtain Object linking, including: according to the regular expression obtained from the template base of described crawlers, candidate link is screened, and to obtain Object linking, wherein module library is for storing field and the regular expression of seed address information, regular expression refers to filtration ID (with reference to Fig. 1 b), filters for title. Candidate link is screened to obtain Object linking by the regular expression adopting crawlers in the present embodiment, as filtered out 8 Object linkings, thus improve candidate to determine, according to Object linking, the accuracy climbing full rate. It should be noted that candidate link can also be screened according to the screening rule except regular expression, wherein screening rule can be set according to testing requirement.
Further, also the Object linking obtained is imported in testing tool, as imported in the query frame (DataTable:Query) of tables of data in testing tool QuickTestProfessional.
Step 13, according to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate.
In the present embodiment, reptile result data refers to the data that crawlers captures. Concrete, can according to the matching degree between Object linking and reptile result data, that determines crawlers climbs full rate, with reference to Fig. 1 c, for each Object linking, if having the title with this Object linking or the reptile result data of URL (UniformResourceLocator, URL) address coupling, then determining that the matching result that this Object linking associates takes 1, otherwise matching result takes 0; According to all matching result sums, it is determined that described crawlers climb full rate.
The technical scheme that the present embodiment provides, by the webpage of seed network address association in the reptile seed bank of access crawlers, Object linking is filtered out from the link attribute of webpage, reptile result data according to Object linking and crawlers, that determines crawlers climbs full rate, compared to the performance of prior art manual authentication crawlers, improve the testing efficiency to crawlers.
Exemplary, with reference to Fig. 1 d, described candidate link is screened, after obtaining Object linking, it is also possible to including: the title of described Object linking and/or URL address are stored in excel form. As, the title of Object linking is existed in the B row of excel form, the title of Object linking is existed in the C row of excel form. The title of Object linking and/or URL address are stored in the excel form of test machine by the present embodiment, rather than only store in testing tool, it is to avoid the abnormal conditions such as testing tool power-off cause Object linking to be lost, and improve the safety of Object linking.
Embodiment two
The present embodiment provides the automated testing method of a kind of new web crawlers rate on the basis of above-described embodiment one. Fig. 2 is the flow chart of the automated testing method of a kind of web crawlers rate in the embodiment of the present invention two. The method can be performed by the automatic test device of web crawlers rate, and this device can be realized by the mode of hardware and/or software, is configured in test machine. As in figure 2 it is shown, described method specifically includes following steps:
The webpage of the seed network address association that step 21, access are read from the reptile seed bank of crawlers, and from the link attribute of described webpage, obtain setting numerical value candidate link.
Step 22, described candidate link is screened, to obtain Object linking, and described Object linking is imported in testing tool.
Step 23, in described testing tool import each Object linking, it is determined whether the reptile result data having the title with this Object linking or uniform resource position mark URL address to mate, if having, then match parameter is added 1, wherein initial matching parameter is 0.
Concrete, owing to initial matching parameter is 0, therefore the value of match parameter can be used for the matching degree between accurate response Object linking and reptile result data.
Step 24, value according to the total quantity of described Object linking and described match parameter, it is determined that crawlers climb full rate.
Exemplary, the value of the described quantity according to described Object linking and described match parameter, it is determined that crawlers climb full rate, it is possible to including:
According to equation below, what calculate crawlers climbs full rate:
K=n/m, wherein k is for climbing full rate, and n is the value of match parameter, and m is the total quantity of described Object linking.
The technical scheme that the present embodiment provides, by the webpage of seed network address association in the reptile seed bank of access crawlers, Object linking is filtered out from the link attribute of webpage, according to the matching relationship between Object linking and the reptile result data of crawlers, obtain the value of match parameter, total quantity according to the value of match parameter and Object linking, it is determined that crawlers climb full rate, improve the testing efficiency to crawlers.
Embodiment three
Fig. 3 is the structural representation of the automatic test device of a kind of web crawlers rate in the embodiment of the present invention three, and described device is configured at test machine, as it is shown on figure 3, the automatic test device of described web crawlers rate specifically may include that
Candidate link module 31, for accessing the webpage of the seed network address association read from the reptile seed bank of crawlers, and obtains setting numerical value candidate link from the link attribute of described webpage;
Target-linked module 32, for described candidate link is screened, to obtain Object linking, and imports in testing tool by described Object linking;
Climb full rate module 33, for according to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate.
Exemplary, target-linked module 32 specifically may be used for:
According to the regular expression obtained from the template base of described crawlers, candidate link is screened, to obtain Object linking.
Exemplary, described in climb full rate module 33 and may include that
Match parameter unit, for for each Object linking imported in described testing tool, it is determined whether the reptile result data having the title with this Object linking or URL address to mate, if having, then adds 1 to match parameter, and wherein initial matching parameter is 0;
Climb full rate unit, for the value according to the total quantity of described Object linking and described match parameter, it is determined that crawlers climb full rate.
Exemplary, described in climb full rate unit and specifically may be used for:
According to equation below, what calculate crawlers climbs full rate:
K=n/m, wherein k is for climbing full rate, and n is the value of match parameter, and m is the total quantity of described Object linking.
Exemplary, the automatic test device of above-mentioned web crawlers rate can also include:
Object linking memory module, for described candidate link is screened, after obtaining Object linking, stores the title of described Object linking and URL address in excel form.
The automatic test device of the web crawlers rate that the present embodiment provides, the automated testing method of the web crawlers rate provided with any embodiment of the present invention belongs to same inventive concept, the automated testing method of the web crawlers rate that any embodiment of the present invention provides can be performed, possess the corresponding functional module of automated testing method and beneficial effect that perform web crawlers rate. The not ins and outs of detailed description in the present embodiment, the automated testing method of the web crawlers rate that can provide referring to any embodiment of the present invention.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle. It will be appreciated by those skilled in the art that and the invention is not restricted to specific embodiment described here, various obvious change can be carried out for a person skilled in the art, readjust and substitute without departing from protection scope of the present invention. Therefore, although the present invention being described in further detail by above example, but the present invention is not limited only to above example, when without departing from present inventive concept, other Equivalent embodiments more can also be included, and the scope of the present invention is determined by appended right.

Claims (10)

1. the automated testing method of a web crawlers rate, it is characterised in that including:
Access the webpage of the seed network address association read from the reptile seed bank of crawlers, and from the link attribute of described webpage, obtain setting numerical value candidate link;
Described candidate link is screened, to obtain Object linking, and described Object linking is imported in testing tool;
According to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate.
2. method according to claim 1, it is characterised in that described candidate link is screened, to obtain Object linking, including:
According to the regular expression obtained from the template base of described crawlers, candidate link is screened, to obtain Object linking.
3. method according to claim 1, it is characterised in that according to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate, including:
For each Object linking imported in described testing tool, it is determined whether the reptile result data having the title with this Object linking or URL address to mate, if having, then adding 1 to match parameter, wherein initial matching parameter is 0;
Value according to the total quantity of described Object linking and described match parameter, it is determined that crawlers climb full rate.
4. method according to claim 3, it is characterised in that the value of the described quantity according to described Object linking and described match parameter, it is determined that crawlers climb full rate, including:
According to equation below, what calculate crawlers climbs full rate:
K=n/m, wherein k is for climbing full rate, and n is the value of match parameter, and m is the total quantity of described Object linking.
5. method according to claim 1, it is characterised in that described candidate link is screened, after obtaining Object linking, also includes:
The title of described Object linking and/or URL address are stored in excel form.
6. the automatic test device of a web crawlers rate, it is characterised in that including:
Candidate link module, for accessing the webpage of the seed network address association read from the reptile seed bank of crawlers, and obtains setting numerical value candidate link from the link attribute of described webpage;
Target-linked module, for described candidate link is screened, to obtain Object linking, and imports in testing tool by described Object linking;
Climb full rate module, for according to the Object linking imported in described testing tool and the reptile result data of described crawlers, it is determined that described crawlers climb full rate.
7. device according to claim 6, it is characterised in that target-linked module specifically for:
According to the regular expression obtained from the template base of described crawlers, candidate link is screened, to obtain Object linking.
8. device according to claim 6, it is characterised in that described in climb full rate module and include:
Match parameter unit, for for each Object linking imported in described testing tool, it is determined whether the reptile result data having the title with this Object linking or URL address to mate, if having, then adds 1 to match parameter, and wherein initial matching parameter is 0;
Climb full rate unit, for the value according to the total quantity of described Object linking and described match parameter, it is determined that crawlers climb full rate.
9. device according to claim 8, it is characterised in that described in climb full rate unit specifically for:
According to equation below, what calculate crawlers climbs full rate:
K=n/m, wherein k is for climbing full rate, and n is the value of match parameter, and m is the total quantity of described Object linking.
10. device according to claim 6, it is characterised in that also include:
Object linking memory module, for described candidate link is screened, after obtaining Object linking, stores the title of described Object linking and URL address in excel form.
CN201510957702.3A 2015-12-17 2015-12-17 A kind of automated testing method and device of web crawlers rate Active CN105630673B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510957702.3A CN105630673B (en) 2015-12-17 2015-12-17 A kind of automated testing method and device of web crawlers rate

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510957702.3A CN105630673B (en) 2015-12-17 2015-12-17 A kind of automated testing method and device of web crawlers rate

Publications (2)

Publication Number Publication Date
CN105630673A true CN105630673A (en) 2016-06-01
CN105630673B CN105630673B (en) 2018-12-25

Family

ID=56045643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510957702.3A Active CN105630673B (en) 2015-12-17 2015-12-17 A kind of automated testing method and device of web crawlers rate

Country Status (1)

Country Link
CN (1) CN105630673B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949852A (en) * 2020-08-31 2020-11-17 东华理工大学 Macroscopic economy analysis method and system based on internet big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN202383681U (en) * 2011-12-23 2012-08-15 江苏省现代企业信息化应用支撑软件工程技术研发中心 Webpage acquiring device based on gathered crawlers
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
CN102929920A (en) * 2012-09-19 2013-02-13 北京奇虎科技有限公司 Web-information-extraction-based monitoring method and device for software updating information
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN103984749A (en) * 2014-05-27 2014-08-13 电子科技大学 Focused crawler method based on link analysis
CN104462158A (en) * 2013-09-25 2015-03-25 北大方正集团有限公司 Data grabbing method and data grabbing system
CN104794193A (en) * 2015-04-17 2015-07-22 南京大学 Webpage increment capture method for valid link acquisition

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN202383681U (en) * 2011-12-23 2012-08-15 江苏省现代企业信息化应用支撑软件工程技术研发中心 Webpage acquiring device based on gathered crawlers
CN102662954A (en) * 2012-03-02 2012-09-12 杭州电子科技大学 Method for implementing topical crawler system based on learning URL string information
CN102929920A (en) * 2012-09-19 2013-02-13 北京奇虎科技有限公司 Web-information-extraction-based monitoring method and device for software updating information
CN102930059A (en) * 2012-11-26 2013-02-13 电子科技大学 Method for designing focused crawler
CN104462158A (en) * 2013-09-25 2015-03-25 北大方正集团有限公司 Data grabbing method and data grabbing system
CN103984749A (en) * 2014-05-27 2014-08-13 电子科技大学 Focused crawler method based on link analysis
CN104794193A (en) * 2015-04-17 2015-07-22 南京大学 Webpage increment capture method for valid link acquisition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TAO PENG 等: "Focused crawling enhanced by CBP-SLC", 《KNOWLEDGE-BASED SYSTEMS》 *
朱庆生 等: "一种基于链接和内容分析的自适应主题爬虫算法", 《计算机与现代化》 *
邹海亮: "可定制的聚焦网络爬虫", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949852A (en) * 2020-08-31 2020-11-17 东华理工大学 Macroscopic economy analysis method and system based on internet big data

Also Published As

Publication number Publication date
CN105630673B (en) 2018-12-25

Similar Documents

Publication Publication Date Title
CN103297435B (en) A kind of abnormal access behavioral value method and system based on WEB daily record
CN103605738B (en) Web page access data statistical method and device
US8719308B2 (en) Method and system to process unstructured data
CN103546326B (en) Website traffic statistic method
CN107957957A (en) The acquisition methods and device of test case
CN105404699A (en) Method, device and server for searching articles of finance and economics
CN103530365B (en) Obtain the method and system of the download link of resource
CN102663048B (en) Method and device for providing search result
CN103237094B (en) A kind of method and device identifying user
CN104182548B (en) Webpage updates processing method and processing device
CN102760151A (en) Implementation method of open source software acquisition and searching system
CN102663054A (en) Method and device for determining weight of website
CN102521232B (en) Distributed acquisition and processing system and method of internet metadata
CN103455758A (en) Method and device for identifying malicious website
CN102663052A (en) Method and device for providing search results of search engine
CN105718533A (en) Information pushing method and device
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN104462445A (en) Webpage access data processing method and webpage access data processing device
CN105184156A (en) Security threat management method and system
CN105468511A (en) Web page script error positioning method and apparatus
CN103248511B (en) A kind of analysis methods, devices and systems of single-point service feature
CN103902725B (en) The acquisition methods of search engine optimization information and device
WO2015149550A1 (en) Method and apparatus for determining grades of links within website
CN103605744A (en) Method and device for analyzing website searching engine traffic data
CN105630673A (en) Automated test method and apparatus for web crawler rate

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant