CN110377515B - Method for testing data quality of crawler - Google Patents
Method for testing data quality of crawler Download PDFInfo
- Publication number
- CN110377515B CN110377515B CN201910632404.5A CN201910632404A CN110377515B CN 110377515 B CN110377515 B CN 110377515B CN 201910632404 A CN201910632404 A CN 201910632404A CN 110377515 B CN110377515 B CN 110377515B
- Authority
- CN
- China
- Prior art keywords
- field
- data
- crawler
- quality
- dependency
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
Abstract
The invention discloses a method for testing the data quality of a crawler, which comprises the following steps: the method comprises the steps of configuring a table field rule base, configuring field dependency relations, sampling crawler data, randomly extracting a specified number of data samples from a detected data source, calling rules, judging whether the field dependency relations exist, judging whether the current fields depend on the rules, checking the field dependency relations, verifying the value of a field B according to the value of the field A by a field dependency checking module, comparing the fields, verifying whether the corresponding fields meet the corresponding rules in the rule base or not by a field comparison module, outputting quality results, and outputting error samples for analysis by a quality result output module.
Description
Technical Field
The invention relates to the technical field of internet science and technology industry, in particular to a method for testing crawler data quality.
Background
The web crawler is a program for automatically extracting web page data, and due to the diversity and uncertainty of web pages, the accuracy of the obtained crawler data is also greatly uncertain.
Disclosure of Invention
The invention aims to provide a method for testing the data quality of a crawler, which aims to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a method of testing crawler data quality, comprising:
step 1, configuring a table field rule base so as to form a table, field and regular matching relation;
step 2, configuring field dependency;
step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;
step 4, calling the rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;
step 5, judging whether a field dependency relationship exists or not, and judging whether a rule dependency exists in the current field or not, namely whether the field is contained in the dependency relationship in the step 2 or not;
step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;
step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;
and 8, outputting a quality result, and outputting an error sample for analysis by a quality result output module.
Preferably, when it is determined in step 5 that the field dependency relationship exists, a field dependency checking step 6 is invoked, otherwise, a field comparison module step 7 is invoked.
Preferably, the quality output module in step 8 outputs the deficiency rate, the null rate, the accuracy rate and the error rate of each field in an excel form, and the txt file outputs an error sample for analysis.
Preferably, the table field rules may be regular expressions configured according to the business meaning of the field itself.
Preferably, in the step 1, the company name can be classified as crawling.
Preferably, before sampling the crawler data, the crawler data is used as a certificate by using a user IP (Internet protocol) and a browser software and a remote network server are used as a connecting channel to form a database, and the database is compared with the sampled data to obtain the quality analysis condition.
Preferably, the database includes past crawler data and network classification data recorded; past crawler data is combined by browser software with search records, search results and website ID browsing record data and stored by a hard disk.
Preferably, the network classification data comprises website information and classified field information recorded in a case, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.
Compared with the prior art, the invention has the beneficial effects that: through the steps of crawler data sampling, table field rule base configuration, crawler data sampling, judgment of whether field dependency exists or not, field dependency verification, field comparison and quality result output, the method for testing the crawler data quality is efficient, fast and low in cost.
Drawings
FIG. 1 is a block diagram of a method for testing crawler data quality.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be configured in a specific orientation, and operate, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "connected," and the like are to be construed broadly, such as "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
As shown in FIG. 1, the method for testing the data quality of the crawler of the embodiment comprises the following steps
A method of testing crawler data quality, comprising:
step 1, configuring a table field rule base, where the rule is generally a regular expression configured according to the service meaning of the field itself, such as a crawled company name, and the company name generally ends with a limited company, a limited liability company, a factory, and the like, so as to form a table, a field, and a regular matching relationship, such as a = [ field1: regEx1, field2: regEx2] (a is a table name, field1 and field2 are fields in table a, regEx1 is a regular expression that the field1 field value should satisfy, regEx2 is a regular expression that the field2 field value should satisfy);
step 2, field dependency relationship is configured, and sometimes there is an association relationship between the extracted fields, for example, when the field1 value is A, the field2 value must be B; when the value of filtered 1 is C, the value of filtered 2 should be null, A = { field1 { 'value': a1, b1, C1], 'filtered 1': C1, d1, e1 }, filtered 2 { 'value': a2, b2, C2], 'filtered 3': C2, d2, e2 } } };
step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;
step 4, calling a rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;
step 5, judging whether a field dependency relationship exists, judging whether a current field has a rule dependency or not, namely whether the dependency relationship in the step 2 contains the field or not, if the field dependency relationship exists, calling a field dependency checking step 6, otherwise calling a field comparison module step 7;
step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;
step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;
and 8, outputting a quality result, outputting the error sample by a quality result output module for analysis, outputting the deficiency rate, the null value rate, the accuracy rate and the error rate of each field in an excel form, and outputting the error sample by a txt file for analysis.
In the above embodiment, before sampling the crawler data, the user IP is used as a credential, and the browser software and the remote network server are used as a connection channel to form a database, which is compared with the sampled data to obtain the quality analysis condition.
Specifically, the database comprises past crawler data and network classification data recorded in the past; past crawler data is combined by browser software with search records, search results and website ID browsing record data and stored by a hard disk.
Specifically, the network classification data comprises website information and classified field information recorded in a case, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Claims (8)
1. A method for testing data quality of a crawler is characterized by comprising the following steps:
step 1, configuring a table field rule base so as to form a table, field and regular matching relation;
step 2, configuring field dependency;
step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;
step 4, calling a rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;
step 5, judging whether a field dependency relationship exists, judging whether a current field has rule dependency, specifically, whether the dependency relationship in the step 2 contains the field;
step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;
step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;
and 8, outputting a quality result, and outputting an error sample for analysis by a quality result output module.
2. The method of claim 1, wherein the crawler data quality is measured by: and (5) calling a field dependency checking step 6 when judging that the field dependency exists in the step 5, otherwise calling a field comparison module step 7.
3. The method for testing data quality of a crawler according to claim 1, wherein: in the step 8, the quality output module outputs the missing rate, the null rate, the accuracy rate and the error rate of each field in an excel form, and the txt file outputs an error sample for analysis.
4. The method of claim 1, wherein the crawler data quality is measured by: the table field rules are regular expressions configured according to the business meaning of the field itself.
5. The method of claim 4, wherein the crawler data quality is tested by: in the step 1, company names are classified as crawling.
6. The method of claim 1, wherein the crawler data quality is measured by: before the crawler data is sampled, user IP is used as a certificate, browser software and a remote network server are used as a connecting channel to form a database, and the database is compared with the sampled data to obtain the quality analysis condition.
7. The method of claim 6, wherein the step of testing the crawler data quality comprises: the database comprises past crawler data and network classification data which are recorded; past crawler data is composed of browser software combined with search records, search results and website ID browsing record data, and stored by a hard disk.
8. The method of claim 7, wherein the step of testing the crawler data quality comprises: the network classification data comprises website information recorded in a case and classified field information, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910632404.5A CN110377515B (en) | 2019-07-13 | 2019-07-13 | Method for testing data quality of crawler |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910632404.5A CN110377515B (en) | 2019-07-13 | 2019-07-13 | Method for testing data quality of crawler |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110377515A CN110377515A (en) | 2019-10-25 |
CN110377515B true CN110377515B (en) | 2022-10-21 |
Family
ID=68252998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910632404.5A Active CN110377515B (en) | 2019-07-13 | 2019-07-13 | Method for testing data quality of crawler |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110377515B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2407877A1 (en) * | 2010-07-14 | 2012-01-18 | Fujitsu Limited | Methods and systems for extensive crawling of web applications |
CN103942335A (en) * | 2014-05-07 | 2014-07-23 | 武汉大学 | Construction method of uninterrupted crawler system oriented to web page structure change |
CN109933514A (en) * | 2017-12-18 | 2019-06-25 | 北京京东尚科信息技术有限公司 | A kind of data test method and apparatus |
-
2019
- 2019-07-13 CN CN201910632404.5A patent/CN110377515B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2407877A1 (en) * | 2010-07-14 | 2012-01-18 | Fujitsu Limited | Methods and systems for extensive crawling of web applications |
CN103942335A (en) * | 2014-05-07 | 2014-07-23 | 武汉大学 | Construction method of uninterrupted crawler system oriented to web page structure change |
CN109933514A (en) * | 2017-12-18 | 2019-06-25 | 北京京东尚科信息技术有限公司 | A kind of data test method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN110377515A (en) | 2019-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020119430A1 (en) | Protocol interface test method, device, computer device and storage medium | |
CN109241461B (en) | User portrait construction method and device | |
Wang et al. | Data quality requirements analysis and modeling | |
CN103888490B (en) | A kind of man-machine knowledge method for distinguishing of full automatic WEB client side | |
CN108875757B (en) | Information auditing method, server and system | |
US7577641B2 (en) | Computer-implemented system and method for analyzing search queries | |
KR20050115238A (en) | Data integration method | |
CN113706176B (en) | Information anti-fraud processing method and service platform system combined with cloud computing | |
CN111339151B (en) | Online examination method, device, equipment and computer storage medium | |
CN106230602A (en) | The integrity detection system of the certificate chain of digital certificate and method | |
Major et al. | No WAN's land: Mapping US broadband coverage with millions of address queries to ISPs | |
CN110020550B (en) | Assessment method, device and equipment for verification platform | |
CN115982012A (en) | Evaluation model and method for interface management capability maturity | |
CN114785710A (en) | Method and system for evaluating service capability of industrial internet identification analysis secondary node | |
CN111427613A (en) | Application program interface API management method and device | |
CN108809896A (en) | A kind of information calibration method, device and electronic equipment | |
CN110377515B (en) | Method for testing data quality of crawler | |
CN103812887A (en) | File opening method and system | |
CN111160500B (en) | Method and device for generating two-dimension code of contract, and method and device for acquiring contract | |
KR102315350B1 (en) | Method and apparatus for automatic process of query | |
CN107391551B (en) | Web service data analysis method and system based on data mining | |
US20030120614A1 (en) | Automated e-commerce authentication method and system | |
US20140337069A1 (en) | Deriving business transactions from web logs | |
CN113672233B (en) | Server out-of-band management method, device and equipment based on Redfish | |
CN112579436B (en) | Micro-service software architecture identification and measurement method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |