CN110377515B

CN110377515B - Method for testing data quality of crawler

Info

Publication number: CN110377515B
Application number: CN201910632404.5A
Authority: CN
Inventors: 陈双艳
Original assignee: Beijing Haizhi Xingtu Technology Co ltd
Current assignee: Beijing Haizhi Xingtu Technology Co ltd
Priority date: 2019-07-13
Filing date: 2019-07-13
Publication date: 2022-10-21
Anticipated expiration: 2039-07-13
Also published as: CN110377515A

Abstract

The invention discloses a method for testing the data quality of a crawler, which comprises the following steps: the method comprises the steps of configuring a table field rule base, configuring field dependency relations, sampling crawler data, randomly extracting a specified number of data samples from a detected data source, calling rules, judging whether the field dependency relations exist, judging whether the current fields depend on the rules, checking the field dependency relations, verifying the value of a field B according to the value of the field A by a field dependency checking module, comparing the fields, verifying whether the corresponding fields meet the corresponding rules in the rule base or not by a field comparison module, outputting quality results, and outputting error samples for analysis by a quality result output module.

Description

Method for testing data quality of crawler

Technical Field

The invention relates to the technical field of internet science and technology industry, in particular to a method for testing crawler data quality.

Background

The web crawler is a program for automatically extracting web page data, and due to the diversity and uncertainty of web pages, the accuracy of the obtained crawler data is also greatly uncertain.

Disclosure of Invention

The invention aims to provide a method for testing the data quality of a crawler, which aims to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a method of testing crawler data quality, comprising:

step 1, configuring a table field rule base so as to form a table, field and regular matching relation;

step 2, configuring field dependency;

step 3, sampling crawler data, and randomly extracting a specified number of data samples from a detected data source;

step 4, calling the rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;

step 5, judging whether a field dependency relationship exists or not, and judging whether a rule dependency exists in the current field or not, namely whether the field is contained in the dependency relationship in the step 2 or not;

step 6, field dependence checking, wherein a field dependence checking module verifies the value of the field B according to the value of the field A;

step 7, comparing the fields, wherein a field comparison module verifies whether the corresponding fields meet the corresponding rules in the rule base;

and 8, outputting a quality result, and outputting an error sample for analysis by a quality result output module.

Preferably, when it is determined in step 5 that the field dependency relationship exists, a field dependency checking step 6 is invoked, otherwise, a field comparison module step 7 is invoked.

Preferably, the quality output module in step 8 outputs the deficiency rate, the null rate, the accuracy rate and the error rate of each field in an excel form, and the txt file outputs an error sample for analysis.

Preferably, the table field rules may be regular expressions configured according to the business meaning of the field itself.

Preferably, in the step 1, the company name can be classified as crawling.

Preferably, before sampling the crawler data, the crawler data is used as a certificate by using a user IP (Internet protocol) and a browser software and a remote network server are used as a connecting channel to form a database, and the database is compared with the sampled data to obtain the quality analysis condition.

Preferably, the database includes past crawler data and network classification data recorded; past crawler data is combined by browser software with search records, search results and website ID browsing record data and stored by a hard disk.

Preferably, the network classification data comprises website information and classified field information recorded in a case, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.

Compared with the prior art, the invention has the beneficial effects that: through the steps of crawler data sampling, table field rule base configuration, crawler data sampling, judgment of whether field dependency exists or not, field dependency verification, field comparison and quality result output, the method for testing the crawler data quality is efficient, fast and low in cost.

Drawings

FIG. 1 is a block diagram of a method for testing crawler data quality.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "front", "rear", "both ends", "one end", "the other end", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be configured in a specific orientation, and operate, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "disposed," "connected," and the like are to be construed broadly, such as "connected," which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

As shown in FIG. 1, the method for testing the data quality of the crawler of the embodiment comprises the following steps

A method of testing crawler data quality, comprising:

step 1, configuring a table field rule base, where the rule is generally a regular expression configured according to the service meaning of the field itself, such as a crawled company name, and the company name generally ends with a limited company, a limited liability company, a factory, and the like, so as to form a table, a field, and a regular matching relationship, such as a = [ field1: regEx1, field2: regEx2] (a is a table name, field1 and field2 are fields in table a, regEx1 is a regular expression that the field1 field value should satisfy, regEx2 is a regular expression that the field2 field value should satisfy);

step 2, field dependency relationship is configured, and sometimes there is an association relationship between the extracted fields, for example, when the field1 value is A, the field2 value must be B; when the value of filtered 1 is C, the value of filtered 2 should be null, A = { field1 { 'value': a1, b1, C1], 'filtered 1': C1, d1, e1 }, filtered 2 { 'value': a2, b2, C2], 'filtered 3': C2, d2, e2 } } };

step 4, calling a rule, circulating each piece of sample data, and then judging whether the value corresponding to the field in the table in the step 1 meets the rule or not;

step 5, judging whether a field dependency relationship exists, judging whether a current field has a rule dependency or not, namely whether the dependency relationship in the step 2 contains the field or not, if the field dependency relationship exists, calling a field dependency checking step 6, otherwise calling a field comparison module step 7;

and 8, outputting a quality result, outputting the error sample by a quality result output module for analysis, outputting the deficiency rate, the null value rate, the accuracy rate and the error rate of each field in an excel form, and outputting the error sample by a txt file for analysis.

In the above embodiment, before sampling the crawler data, the user IP is used as a credential, and the browser software and the remote network server are used as a connection channel to form a database, which is compared with the sampled data to obtain the quality analysis condition.

Specifically, the database comprises past crawler data and network classification data recorded in the past; past crawler data is combined by browser software with search records, search results and website ID browsing record data and stored by a hard disk.

Specifically, the network classification data comprises website information and classified field information recorded in a case, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method for testing data quality of a crawler is characterized by comprising the following steps:

step 2, configuring field dependency;

step 5, judging whether a field dependency relationship exists, judging whether a current field has rule dependency, specifically, whether the dependency relationship in the step 2 contains the field;

2. The method of claim 1, wherein the crawler data quality is measured by: and (5) calling a field dependency checking step 6 when judging that the field dependency exists in the step 5, otherwise calling a field comparison module step 7.

3. The method for testing data quality of a crawler according to claim 1, wherein: in the step 8, the quality output module outputs the missing rate, the null rate, the accuracy rate and the error rate of each field in an excel form, and the txt file outputs an error sample for analysis.

4. The method of claim 1, wherein the crawler data quality is measured by: the table field rules are regular expressions configured according to the business meaning of the field itself.

5. The method of claim 4, wherein the crawler data quality is tested by: in the step 1, company names are classified as crawling.

6. The method of claim 1, wherein the crawler data quality is measured by: before the crawler data is sampled, user IP is used as a certificate, browser software and a remote network server are used as a connecting channel to form a database, and the database is compared with the sampled data to obtain the quality analysis condition.

7. The method of claim 6, wherein the step of testing the crawler data quality comprises: the database comprises past crawler data and network classification data which are recorded; past crawler data is composed of browser software combined with search records, search results and website ID browsing record data, and stored by a hard disk.

8. The method of claim 7, wherein the step of testing the crawler data quality comprises: the network classification data comprises website information recorded in a case and classified field information, and the website information and the classified field information are compared with the sampled crawler information to obtain the information coincidence rate.