CN110377515B - Method for testing data quality of crawler - Google Patents

Method for testing data quality of crawler Download PDF

Info

Publication number
CN110377515B
CN110377515B CN201910632404.5A CN201910632404A CN110377515B CN 110377515 B CN110377515 B CN 110377515B CN 201910632404 A CN201910632404 A CN 201910632404A CN 110377515 B CN110377515 B CN 110377515B
Authority
CN
China
Prior art keywords
field
data
crawler
quality
dependency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910632404.5A
Other languages
Chinese (zh)
Other versions
CN110377515A (en
Inventor
陈双艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Haizhi Xingtu Technology Co ltd
Original Assignee
Beijing Haizhi Xingtu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Haizhi Xingtu Technology Co ltd filed Critical Beijing Haizhi Xingtu Technology Co ltd
Priority to CN201910632404.5A priority Critical patent/CN110377515B/en
Publication of CN110377515A publication Critical patent/CN110377515A/en
Application granted granted Critical
Publication of CN110377515B publication Critical patent/CN110377515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management

Abstract

The invention discloses a method for testing the data quality of a crawler, which comprises the following steps: the method comprises the steps of configuring a table field rule base, configuring field dependency relations, sampling crawler data, randomly extracting a specified number of data samples from a detected data source, calling rules, judging whether the field dependency relations exist, judging whether the current fields depend on the rules, checking the field dependency relations, verifying the value of a field B according to the value of the field A by a field dependency checking module, comparing the fields, verifying whether the corresponding fields meet the corresponding rules in the rule base or not by a field comparison module, outputting quality results, and outputting error samples for analysis by a quality result output module.

Description

Method for testing data quality of crawler
Technical Field
The invention relates to the technical field of internet science and technology industry, in particular to a method for testing crawler data quality.
Background
The web crawler is a program for automatically extracting web page data, and due to the diversity and uncertainty of web pages, the accuracy of the obtained crawler data is also greatly uncertain.
Disclosure of Invention
The invention aims to provide a method for testing the data quality of a crawler, which aims to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a method of testing crawler data quality, comprising:
step 1, configuring a table field rule base so as to form a table, field and regular matching relation;
step 2, configuring field dependency;
step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;
step 4, calling the rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;
step 5, judging whether a field dependency relationship exists or not, and judging whether a rule dependency exists in the current field or not, namely whether the field is contained in the dependency relationship in the step 2 or not;
step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;
step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;
and 8, outputting a quality result, and outputting an error sample for analysis by a quality result output module.
Preferably, when it is determined in step 5 that the field dependency relationship exists, a field dependency checking step 6 is invoked, otherwise, a field comparison module step 7 is invoked.
Preferably, the quality output module in step 8 outputs the deficiency rate, the null rate, the accuracy rate and the error rate of each field in an excel form, and the txt file outputs an error sample for analysis.
Preferably, the table field rules may be regular expressions configured according to the business meaning of the field itself.
Preferably, in the step 1, the company name can be classified as crawling.
Preferably, before sampling the crawler data, the crawler data is used as a certificate by using a user IP (Internet protocol) and a browser software and a remote network server are used as a connecting channel to form a database, and the database is compared with the sampled data to obtain the quality analysis condition.
Preferably, the database includes past crawler data and network classification data recorded; past crawler data is combined by browser software with search records, search results and website ID browsing record data and stored by a hard disk.
Preferably, the network classification data comprises website information and classified field information recorded in a case, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.
Compared with the prior art, the invention has the beneficial effects that: through the steps of crawler data sampling, table field rule base configuration, crawler data sampling, judgment of whether field dependency exists or not, field dependency verification, field comparison and quality result output, the method for testing the crawler data quality is efficient, fast and low in cost.
Drawings
FIG. 1 is a block diagram of a method for testing crawler data quality.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be configured in a specific orientation, and operate, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "connected," and the like are to be construed broadly, such as "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
As shown in FIG. 1, the method for testing the data quality of the crawler of the embodiment comprises the following steps
A method of testing crawler data quality, comprising:
step 1, configuring a table field rule base, where the rule is generally a regular expression configured according to the service meaning of the field itself, such as a crawled company name, and the company name generally ends with a limited company, a limited liability company, a factory, and the like, so as to form a table, a field, and a regular matching relationship, such as a = [ field1: regEx1, field2: regEx2] (a is a table name, field1 and field2 are fields in table a, regEx1 is a regular expression that the field1 field value should satisfy, regEx2 is a regular expression that the field2 field value should satisfy);
step 2, field dependency relationship is configured, and sometimes there is an association relationship between the extracted fields, for example, when the field1 value is A, the field2 value must be B; when the value of filtered 1 is C, the value of filtered 2 should be null, A = { field1 { 'value': a1, b1, C1], 'filtered 1': C1, d1, e1 }, filtered 2 { 'value': a2, b2, C2], 'filtered 3': C2, d2, e2 } } };
step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;
step 4, calling a rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;
step 5, judging whether a field dependency relationship exists, judging whether a current field has a rule dependency or not, namely whether the dependency relationship in the step 2 contains the field or not, if the field dependency relationship exists, calling a field dependency checking step 6, otherwise calling a field comparison module step 7;
step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;
step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;
and 8, outputting a quality result, outputting the error sample by a quality result output module for analysis, outputting the deficiency rate, the null value rate, the accuracy rate and the error rate of each field in an excel form, and outputting the error sample by a txt file for analysis.
In the above embodiment, before sampling the crawler data, the user IP is used as a credential, and the browser software and the remote network server are used as a connection channel to form a database, which is compared with the sampled data to obtain the quality analysis condition.
Specifically, the database comprises past crawler data and network classification data recorded in the past; past crawler data is combined by browser software with search records, search results and website ID browsing record data and stored by a hard disk.
Specifically, the network classification data comprises website information and classified field information recorded in a case, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (8)

1. A method for testing data quality of a crawler is characterized by comprising the following steps:
step 1, configuring a table field rule base so as to form a table, field and regular matching relation;
step 2, configuring field dependency;
step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;
step 4, calling a rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;
step 5, judging whether a field dependency relationship exists, judging whether a current field has rule dependency, specifically, whether the dependency relationship in the step 2 contains the field;
step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;
step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;
and 8, outputting a quality result, and outputting an error sample for analysis by a quality result output module.
2. The method of claim 1, wherein the crawler data quality is measured by: and (5) calling a field dependency checking step 6 when judging that the field dependency exists in the step 5, otherwise calling a field comparison module step 7.
3. The method for testing data quality of a crawler according to claim 1, wherein: in the step 8, the quality output module outputs the missing rate, the null rate, the accuracy rate and the error rate of each field in an excel form, and the txt file outputs an error sample for analysis.
4. The method of claim 1, wherein the crawler data quality is measured by: the table field rules are regular expressions configured according to the business meaning of the field itself.
5. The method of claim 4, wherein the crawler data quality is tested by: in the step 1, company names are classified as crawling.
6. The method of claim 1, wherein the crawler data quality is measured by: before the crawler data is sampled, user IP is used as a certificate, browser software and a remote network server are used as a connecting channel to form a database, and the database is compared with the sampled data to obtain the quality analysis condition.
7. The method of claim 6, wherein the step of testing the crawler data quality comprises: the database comprises past crawler data and network classification data which are recorded; past crawler data is composed of browser software combined with search records, search results and website ID browsing record data, and stored by a hard disk.
8. The method of claim 7, wherein the step of testing the crawler data quality comprises: the network classification data comprises website information recorded in a case and classified field information, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.
CN201910632404.5A 2019-07-13 2019-07-13 Method for testing data quality of crawler Active CN110377515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910632404.5A CN110377515B (en) 2019-07-13 2019-07-13 Method for testing data quality of crawler

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910632404.5A CN110377515B (en) 2019-07-13 2019-07-13 Method for testing data quality of crawler

Publications (2)

Publication Number Publication Date
CN110377515A CN110377515A (en) 2019-10-25
CN110377515B true CN110377515B (en) 2022-10-21

Family

ID=68252998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910632404.5A Active CN110377515B (en) 2019-07-13 2019-07-13 Method for testing data quality of crawler

Country Status (1)

Country Link
CN (1) CN110377515B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2407877A1 (en) * 2010-07-14 2012-01-18 Fujitsu Limited Methods and systems for extensive crawling of web applications
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN109933514A (en) * 2017-12-18 2019-06-25 北京京东尚科信息技术有限公司 A kind of data test method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2407877A1 (en) * 2010-07-14 2012-01-18 Fujitsu Limited Methods and systems for extensive crawling of web applications
CN103942335A (en) * 2014-05-07 2014-07-23 武汉大学 Construction method of uninterrupted crawler system oriented to web page structure change
CN109933514A (en) * 2017-12-18 2019-06-25 北京京东尚科信息技术有限公司 A kind of data test method and apparatus

Also Published As

Publication number Publication date
CN110377515A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
WO2020119430A1 (en) Protocol interface test method, device, computer device and storage medium
CN109241461B (en) User portrait construction method and device
Wang et al. Data quality requirements analysis and modeling
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
CN108875757B (en) Information auditing method, server and system
US7577641B2 (en) Computer-implemented system and method for analyzing search queries
KR20050115238A (en) Data integration method
CN113706176B (en) Information anti-fraud processing method and service platform system combined with cloud computing
CN111339151B (en) Online examination method, device, equipment and computer storage medium
CN106230602A (en) The integrity detection system of the certificate chain of digital certificate and method
Major et al. No WAN's land: Mapping US broadband coverage with millions of address queries to ISPs
CN110020550B (en) Assessment method, device and equipment for verification platform
CN115982012A (en) Evaluation model and method for interface management capability maturity
CN114785710A (en) Method and system for evaluating service capability of industrial internet identification analysis secondary node
CN111427613A (en) Application program interface API management method and device
CN108809896A (en) A kind of information calibration method, device and electronic equipment
CN110377515B (en) Method for testing data quality of crawler
CN103812887A (en) File opening method and system
CN111160500B (en) Method and device for generating two-dimension code of contract, and method and device for acquiring contract
KR102315350B1 (en) Method and apparatus for automatic process of query
CN107391551B (en) Web service data analysis method and system based on data mining
US20030120614A1 (en) Automated e-commerce authentication method and system
US20140337069A1 (en) Deriving business transactions from web logs
CN113672233B (en) Server out-of-band management method, device and equipment based on Redfish
CN112579436B (en) Micro-service software architecture identification and measurement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant