CN112364040B

CN112364040B - Data checking method, device, medium and electronic equipment

Info

Publication number: CN112364040B
Application number: CN202011386405.5A
Authority: CN
Inventors: 项志坚; 谢永恒; 程强
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2024-05-10
Anticipated expiration: 2040-12-01
Also published as: CN112364040A

Abstract

The embodiment of the application discloses a data checking method, a data checking device, a medium and electronic equipment. The method comprises the following steps: if a data access event is detected, extracting a characterization field of the data according to a characterization field extraction rule, and determining access identification information of the data according to the characterization field; if a data warehousing event is detected, determining a characterization field of the data based on the characterization field extraction rule, and determining warehousing identification information of the data according to the characterization field; and according to the access identification information and the warehouse-in identification information, the data and the data content are checked, and a data checking result is obtained. According to the embodiment of the application, the data and the data content are checked according to the access identification information and the warehouse-in identification information. By executing the scheme, accurate monitoring of the data warehousing process can be realized.

Description

Data checking method, device, medium and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of Internet, in particular to a data checking method, a device, a medium and electronic equipment.

Background

With the rapid development of internet technology, statistics of data is very difficult in the process of warehousing of mass data, on one hand, statistics of data volume of mass data is too huge, on the other hand, data content per se may have a changed condition, even if the number of data in the statistical process can be corresponding, the data content may be destroyed in the transmission and processing processes, so that the data warehousing process cannot be accurately monitored.

Disclosure of Invention

The embodiment of the application provides a data checking method, a device, a medium and electronic equipment, which can accurately check under the condition of inputting a large amount of large data centers, and can ensure the accuracy and the effectiveness of statistical results.

In a first aspect, an embodiment of the present application provides a method for checking data, where the method includes:

If a data access event is detected, extracting a characterization field of the data according to a characterization field extraction rule, and determining access identification information of the data according to the characterization field;

if a data warehousing event is detected, determining a characterization field of the data based on the characterization field extraction rule, and determining warehousing identification information of the data according to the characterization field;

And according to the access identification information and the warehouse-in identification information, the data and the data content are checked, and a data checking result is obtained.

Further, the characterization field extraction rule is determined based on key fields of the data; wherein the key field is used for recording the difference information of the data.

Further, determining access identification information of the data according to the characterization field includes:

Obtaining an MD5 value from the content of the characterization field through an MD5 encryption algorithm;

Taking the MD5 value as access identification information of data;

Determining the warehousing identification information of the data according to the characterization field, wherein the method comprises the following steps:

and taking the MD5 value as warehousing identification information of the data.

Further, after taking the MD5 value as the access identification information of the data, the method further comprises:

creating a storage directory on an HDFS disk;

Generating an access identification information record file in the storage catalog so as to record the access identification information one by one;

after taking the MD5 value as the warehouse entry identification information of the data, the method further comprises:

creating a storage directory on an HDFS disk;

And generating a storage identification information record file in the storage catalog so as to record the storage identification information one by one.

Further, according to the access identification information and the warehouse-in identification information, the data strip data and the data content are calibrated, and a data calibration result is obtained, which comprises the following steps:

reading access identification information from the access identification information record file, and reading warehousing identification information from the warehousing identification information record file;

The number and the content of the access identification information and the warehouse-in identification information are checked;

and generating a proofreading report according to the proofreading result.

Further, a spark memory calculation engine is adopted for reading and checking.

Further, generating a collation report according to the collation results includes: and exporting access identification information failing to be checked and the warehouse-in identification information into a check result table.

In a second aspect, an embodiment of the present application provides a device for checking data, including:

The data access identification information determining module is used for extracting the characterization field of the data according to the characterization field extraction rule if the data access event is detected, and determining the access identification information of the data according to the characterization field;

The data warehouse-in identification information determining module is used for determining a characterization field of the data based on the characterization field extraction rule and determining warehouse-in identification information of the data according to the characterization field if a data warehouse-in event is detected;

and the data correction result determining module is used for correcting the data and the data content according to the access identification information and the warehouse-in identification information to obtain a data correction result.

In a third aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for collation of data according to embodiments of the present application.

In a fourth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and capable of being executed by the processor, where the processor executes the computer program to implement a method for checking data according to the embodiment of the present application.

According to the technical scheme provided by the embodiment of the application, if the data access event is detected, the characterization field of the data is extracted according to the characterization field extraction rule, and the access identification information of the data is determined according to the characterization field; if a data warehousing event is detected, determining a characterization field of the data based on the characterization field extraction rule, and determining warehousing identification information of the data according to the characterization field; and according to the access identification information and the warehouse-in identification information, the data and the data content are checked, and a data checking result is obtained. According to the embodiment of the application, the data and the data content are checked according to the access identification information and the warehousing identification information, so that the accurate monitoring of the data warehousing process is realized.

Drawings

FIG. 1 is a flow chart of a method for collating data provided in accordance with a first embodiment of the present application;

FIG. 2 is a flow chart of a method for checking data according to a second embodiment of the present application;

FIG. 3 is a schematic structural diagram of a data collation apparatus according to a third embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present application are shown in the drawings.

Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example 1

Fig. 1 is a flowchart of a data collation method provided in an embodiment of the present application, where the embodiment is applicable to a case of monitoring a process of warehousing massive data, the method may be performed by a data collation apparatus provided in an embodiment of the present application, and the apparatus may be implemented by software and/or hardware, and may be integrated in an electronic device.

As shown in fig. 1, the method for checking the data includes:

s110, if a data access event is detected, extracting a characterization field of the data according to a characterization field extraction rule, and determining access identification information of the data according to the characterization field.

The data access event refers to integrating various external sources and various types of scattered data together, and the scattered data is incorporated into a unified large data center. The detection of the data access event may be the detection of data to be accessed, or the reception of a data access instruction.

Wherein the characterization field is a field for representing a core feature of the content of the data record to be accessed, the features of the data including the data content, the data format and the data source. The characterization field extraction rule refers to a rule for extracting a characterization field of the data, and the characterization field extraction rule includes a data key field selection standard and a characterization field generation method.

In this scheme, optionally, the characterization field extraction rule is determined based on key fields of the data; wherein the key field is used for recording the difference information of the data.

The key field of the data comprises a core of the data record content, the key field is used for recording the difference information of the data, and the data with similar record content can be effectively distinguished through the key field. The number of the key subfields is at least one, and the specific number of the key fields is not limited herein, and is specifically determined according to practical situations. It can be appreciated that the more the number of key fields, the more accurately the data core content is reflected by the relevant token fields of the data extracted by the token field extraction rules.

Illustratively, in the case where the data to be accessed is data in a real-time news data set, the real-time news data generally includes the following fields: news headlines, summaries, news content, sources, journalists, release dates, audits, and final reviews. Different real-time news data records may differ in news headlines, summaries, news content, sources, reporters and release dates, and optionally all or part of the above fields are used as key fields of real-time news.

And determining access identification information of the data according to the characterization field, wherein the access identification information of the data is the identification information of the determined data when the data is accessed to the large data center. The access identification information of the data is determined according to the characterization field, and the characterization field is used as a core characteristic field for representing the content of the data record to be accessed, so that the access identification information can identify one piece of access data. And determining the data accessed to the large data center according to the access identifier of the data.

And S120, if a data warehousing event is detected, determining a characterization field of the data based on the characterization field extraction rule, and determining warehousing identification information of the data according to the characterization field.

The data warehouse-in event refers to the step of storing the warehouse-in data into a database of a large data center. The detection of the data access event may be the detection of data to be put in storage, or the reception of a data put instruction.

The characterization field extraction rules used for determining the characterization field during data warehousing are consistent with the characterization field extraction rules used during data access. The key fields used to determine the token extraction rules are also consistent. That is, in the case that the data to be accessed is real-time news, news headlines, summaries, news contents, sources, journalists and release dates are used as key fields of the real-time news when the data is accessed. Then the key fields of the real-time news data are news headlines, summaries, news content, sources, journalists and release dates when the real-time news data is put in the warehouse.

And determining the warehousing identification information of the data according to the characterization field, wherein the warehousing identification information of the data is the determined data identification information when the data is stored in a database of a large data center. The data warehouse-in identification information is determined according to the characterization field, and the characterization field is used as a core characteristic field for representing the content of the data record to be warehouse-in, so that the warehouse-in identification information can identify a piece of warehouse-in data. And determining the data which are put in storage according to the identification of the data put in storage.

On the basis of the above technical solutions, optionally, determining access identification information of the data according to the characterization field includes: obtaining an MD5 value from the content of the characterization field through an MD5 encryption algorithm; and taking the MD5 value as access identification information of the data.

The content of the characterization field is specific content corresponding to the key field. Continuing with the above example, keywords determined at the time of real-time news data access are news headlines, summaries, news content, sources, journalists, and release dates. If the contents of the fields corresponding to the key fields of one piece of real-time news data are respectively: "New record-! 8874.4 m, ' 10.24 m, ' smoothly complete drilling, complete drilling depth 8874.4 m, ' break Asian records again after three months apart-! The oil field was drilled successfully at No. 10.24, a depth of 8874.4 meters in the drilled well was completed, and asian deepest directional well records were created "," xnet "," wang somebody ","2020.10.29". The content of the characterization field can be determined to be the field content corresponding to the keyword. In this way, the content of the characterization field of the overall access data is determined.

The MD5 (MD 5 Message-Digest Algorithm) encryption Algorithm is a widely used cryptographic hash function, and can generate a 128-bit (16-byte) hash value (hash value) to ensure that the information transmission is completely consistent. The hash value is the MD5 value.

Obtaining an MD5 value through an MD5 encryption algorithm according to the content of the representation field; and taking the MD5 value as access identification information of the data. Specifically, the values of the field contents of the data records are combined to obtain a combined result, the combined result is calculated through an MD5 encryption algorithm to obtain an MD5 value, and the MD5 value is used as access identification information of the data. Optionally, an MD5 value is appended to the end of each data record.

Determining the warehousing identification information of the data according to the characterization field, wherein the method comprises the following steps: obtaining an MD5 value from the content of the characterization field through an MD5 encryption algorithm; and taking the MD5 value as warehousing identification information of the data.

On the basis of the above technical solutions, optionally, after taking the MD5 value as access identification information of the data, the method further includes: creating a storage directory on an HDFS disk; and generating an access identification information record file in the storage catalog so as to record the access identification information one by one.

Among them, HDFS (Hadoop Distributed FILE SYSTEM ) refers to a distributed file system designed to fit on general-purpose hardware. HDFS is a highly fault tolerant system suitable for deployment on inexpensive machines. HDFS can provide high throughput data access, and is well suited for applications on large data sets. HDFS is characterized by high fault tolerance and is designed to be deployed on inexpensive hardware. And it provides high throughput access to data of applications suitable for those with very large data sets. The HDFS disk is a disk in the Hadoop distributed file system.

The storage directory refers to a path for storing access data, and optionally, a storage directory with a structure of reconciliation/access/data set name is created on the HDFS disk. After the access program of the big data center reads a batch of data of a certain data set, an access identification information record file is generated under a storage catalog in the HDFS disk, and the access identification information of each piece of data is written into the access identification information record file. The access identification information record file may be in text format.

On the basis of the above technical solutions, optionally, after taking the MD5 value as the warehousing identification information of the data, the method further includes: creating a storage directory on an HDFS disk; and generating a storage identification information record file in the storage catalog so as to record the storage identification information one by one.

A storage directory structured as reconciliation/binning/data set names is created on the HDFS disk. After a lot of data of a warehousing program of a large data center is successfully written into a database, a warehousing identification information record file is generated under a storage catalog in an HDFS disk, and warehousing identification information of each piece of data is written into the warehousing identification information record file. The binning identification information record file may be in text format.

The embodiment has the advantages that the access data and the warehouse-in data are respectively stored, so that the condition of data access and warehouse-in can be clearly and accurately recorded, and confusion is avoided. The access data and the warehouse-in data are recorded one by one, so that the access data and the warehouse-in data are ensured not to be missed.

S130, according to the access identification information and the warehouse-in identification information, the data strip data and the data content are checked, and a data checking result is obtained.

And according to the access identification information and the warehouse-in identification information, the data and the data content are checked to obtain a data number check result and a data content check result, wherein the data number check result can reflect whether the data is lost or not, and can reflect whether the data content is changed or not. The data content calibration result and the data number calibration result are integrated to obtain a data calibration result, so as to monitor whether the data is lost or the data content is changed when the data is accessed or put in storage.

Example two

Fig. 2 is a flowchart of a data collation method provided in the second embodiment of the present application, and the present embodiment is optimized based on the above embodiment, specifically: according to the access identification information and the warehouse-in identification information, the data strip data and the data content are checked, and a data checking result is obtained, wherein the method comprises the following steps: reading access identification information from the access identification information record file, and reading warehousing identification information from the warehousing identification information record file; the number and the content of the access identification information and the warehouse-in identification information are checked; and generating a proofreading report according to the proofreading result.

As shown in fig. 2, the method for checking the data includes:

and S210, if a data access event is detected, extracting a characterization field of the data according to a characterization field extraction rule, and determining access identification information of the data according to the characterization field.

S220, if a data warehousing event is detected, determining a characterization field of the data based on the characterization field extraction rule, and determining warehousing identification information of the data according to the characterization field.

S230, reading access identification information from the access identification information record file, and reading warehouse identification information from the warehouse identification information record file.

The access identification information record file records the access identification information of the access data, and the access identification information can be read from the access identification information record file, or can be read one by one or in batches. The storage identification information can be read from the storage identification information record file one by one or in batches. Specifically, an access identification information record file in a storage directory with a structure of reconciliation/access/data set name is created on an HDFS disk to read access identification information; and creating a warehouse entry identification information record file in a storage directory with a structure of checking/warehouse entry/data set name on the HDFS disk to read the warehouse entry identification information.

The reading order of the access identification information and the warehouse entry identification information is not limited herein, and the access identification information and the warehouse entry identification information can be read in any order or simultaneously.

S240, checking the number and the content of the access identification information and the warehouse-in identification information.

The data number and the data content of the access identification information and the warehouse-in identification information are calibrated, and the number of the access identification information and the number of the warehouse-in identification information can be counted respectively and compared.

Because, when the warehouse-in identification information and the access identification information of the same piece of data are determined, the key fields used for determining the characterization field extraction rule are consistent, namely, the characterization field extraction rule is consistent, the characterization field of the extracted data is consistent according to the characterization field extraction rule. Therefore, under the condition that the big data center works normally, the determined warehousing identification information and the access identification information are consistent.

The data content is calibrated, for example, one piece of the read warehousing identification information is selected as target warehousing identification information, access identification information consistent with the target warehousing identification information is determined as target access identification information from the read identification information according to the target warehousing identification information, then the data content identified by the target warehousing identification information is matched with the data content identified by the target access identification information, the data number calibration result and the data content calibration result are combined, and finally the calibration result of the data is determined.

Wherein, the checking result includes: the number of the access identification information is the same as that of the warehousing identification information, the number of failed matching of the access identification information in the access identification information record file and the number of failed matching of the warehousing identification information in the warehousing identification information record file.

S250, generating a proofreading report according to the proofreading result.

Wherein the collation report comprises: the method comprises the steps of data set names, access number, library entry number, number of access identification information identical to the library entry identification information, number of failed matching of the access identification information in the access identification information record file, number of failed matching of the library entry identification information in the library entry identification information record file and the like. The data access and warehousing conditions are reflected by the whole of the proofreading report, and the data access and warehousing conditions can be intuitively acquired according to the proofreading report.

In the scheme, optionally, a spark memory calculation engine is adopted for reading and checking.

The spark is a fast and universal computing engine spark which is specially designed for large-scale data processing and supports iterative operation on a distributed data set, is complementary to Hadoop and can run in parallel in an HDFS. HDFS is one of the data sources of spark and is the most tightly bound data source to spark. The spark memory calculation engine is adopted for reading and checking, so that the efficiency of data reading and checking can be improved.

Based on the above technical solutions, optionally, generating a collation report according to a collation result based on the above technical solutions, including: and exporting access identification information failing to be checked and the warehouse-in identification information into a check result table.

The failed access identification information refers to access identification information which is failed to be matched in the access identification information record file. The failed warehousing identification information refers to the warehousing identification information which is failed to be matched in the warehousing identification information record file. The checking result table comprises identification information of content change data and identification information of storage failure. Through the arrangement, the data which are changed or put in storage can be rapidly positioned according to the checking result table, and the corresponding data can be retrieved according to the identification information for manual checking.

According to the technical scheme provided by the embodiment of the application, if the data access event is detected, the characterization field of the data is extracted according to the characterization field extraction rule, and the access identification information of the data is determined according to the characterization field; if a data warehousing event is detected, determining a characterization field of the data based on the characterization field extraction rule, and determining warehousing identification information of the data according to the characterization field; reading access identification information from the access identification information record file, and reading warehousing identification information from the warehousing identification information record file; the number and the content of the access identification information and the warehouse-in identification information are checked; and generating a proofreading report according to the proofreading result. According to the embodiment of the application, the data bar and the data content are checked according to the access identification information and the warehousing identification information, so that the accurate monitoring of the data warehousing process is realized; and a proofreading report is generated according to the proofreading result, so that the data access and storage conditions can be reflected more intuitively, and the readability of the proofreading result is improved.

Example III

Fig. 3 is a schematic structural diagram of a data collating device according to a third embodiment of the present application. As shown in fig. 3, the data collation apparatus includes: the data access identification information determining module 310, the data warehouse identification information determining module 320 and the data calibration result determining module 330.

The data access identification information determining module 310 is configured to extract a characterization field of data according to a characterization field extraction rule if a data access event is detected, and determine access identification information of the data according to the characterization field;

The data-warehousing identification information determining module 320 is configured to determine a characterization field of data based on the characterization field extraction rule and determine warehousing identification information of the data according to the characterization field if a data-warehousing event is detected;

and the data checking result determining module 330 is configured to check the data number and the data content according to the access identification information and the warehouse-in identification information, so as to obtain a data checking result.

Optionally, the characterization field extraction rule is determined based on key fields of the data; wherein the key field is used for recording the difference information of the data.

Optionally, the data access identification information determining module 310 includes: the characterization field extraction submodule and the data access identification information determination submodule. The characterization field extraction sub-module is used for extracting the characterization field of the data according to the characterization field extraction rule if the data access event is detected. And the data access identification information determining submodule is used for determining the access identification information of the data according to the characterization field.

Wherein, data access identification information confirms submodule, include: the characterization field content encryption unit is used for obtaining an MD5 value from the content of the characterization field through an MD5 encryption algorithm;

And the data access identification information determining unit is used for taking the MD5 value as access identification information of data.

Optionally, the data-in identification information determining module 320 includes: the characterization field extraction submodule and the data warehouse identification information determination submodule. And the characterization field extraction sub-module is used for determining the characterization field of the data based on the characterization field extraction rule if the data warehousing event is detected. And the data warehouse-in identification information determining submodule is used for determining warehouse-in identification information of the data according to the characterization field.

The data warehouse identification information determining submodule comprises: the characterization field content encryption unit is used for obtaining an MD5 value from the content of the characterization field through an MD5 encryption algorithm;

And the data warehouse-in identification information determining unit is used for taking the MD5 value as warehouse-in identification information of data.

Optionally, the apparatus further includes: and the first storage catalog creation module is used for creating a storage catalog on the HDFS disk after taking the MD5 value as access identification information of data.

And the access identification information record file generation module is used for generating an access identification information record file in the storage directory so as to record the access identification information one by one.

The second storage catalog creation module is used for creating a storage catalog on the HDFS disk after taking the MD5 value as the warehousing identification information of the data;

and the storage identification information record file generation module is used for generating a storage identification information record file in the storage directory so as to record the storage identification information one by one.

Optionally, the data collation result determination module 330 includes: the identification information reading sub-module is used for reading the access identification information from the access identification information record file and reading the warehousing identification information from the warehousing identification information record file;

The identification information checking sub-module is used for checking the number and the content of the access identification information and the warehouse-in identification information;

and the collation report generation module is used for generating a collation report according to the collation result.

Optionally, a spark memory calculation engine is used for reading and checking.

Optionally, the collation report generating module is specifically configured to export access identification information and the warehouse entry identification information that fail to be collated into a collation result table.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

A fourth embodiment of the present application also provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are for performing a method of collation of data, the method comprising:

Storage media refers to any of various types of memory electronic devices or storage electronic devices. The term "storage medium" is intended to include: mounting media such as CD-ROM, floppy disk or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, lanbas (Rambus) RAM, etc.; nonvolatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in a computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different unknowns (e.g., in different computer systems connected by a network). The storage medium may store program instructions (e.g., embodied as a computer program) executable by one or more processors.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present application is not limited to the above-mentioned data collation operation, and may also perform the relevant operations in the data collation method provided in any embodiment of the present application.

Example five

The fifth embodiment of the present application provides an electronic device, in which the data collating device provided in the present application may be integrated, where the electronic device may be configured in a system, or may be a device that performs some or all of the functions in the system. Fig. 4 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application. As shown in fig. 4, the present embodiment provides an electronic device 400, which includes: one or more processors 420; a storage device 410, configured to store one or more programs that, when executed by the one or more processors 420, cause the one or more processors 420 to implement a method for collating data provided by an embodiment of the present application, the method comprising:

Of course, those skilled in the art will appreciate that the processor 420 also implements aspects of the method for collating data provided by any of the embodiments of the present application.

The electronic device 400 shown in fig. 4 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 4, the electronic device 400 includes a processor 420, a storage device 410, an input device 430, and an output device 440; the number of processors 420 in the electronic device may be one or more, one processor 420 being taken as an example in fig. 4; the processor 420, the storage device 410, the input device 430, and the output device 440 in the electronic device may be connected by a bus or other means, as exemplified by connection via a bus 450 in fig. 4.

The storage device 410 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and program instructions corresponding to a method for checking data in an embodiment of the present application.

The storage device 410 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, the storage 410 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, storage device 410 may further include memory located remotely from processor 420, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic device. The output device 440 may include an electronic device such as a display screen, a speaker, etc.

The electronic equipment provided by the embodiment of the application can realize the purpose of accurately managing the computing resources.

The data checking device, the medium and the electronic equipment provided in the above embodiments can execute the data checking method provided in any embodiment of the present application, and have the corresponding functional modules and beneficial effects of executing the method. Technical details not described in detail in the above embodiments may be referred to the method for checking data provided in any embodiment of the present application.

Note that the above is only a preferred embodiment of the present application and the technical principle applied. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, while the application has been described in connection with the above embodiments, the application is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the application, which is set forth in the following claims.

Claims

1. A method of collation of data, comprising:

According to the access identification information and the warehouse-in identification information, the data bar data and the data content are checked, and a data checking result is obtained;

And according to the access identification information and the warehouse-in identification information, the data strip data and the data content are calibrated to obtain a data calibration result, which comprises the following steps:

Reading the access identification information from an access identification information record file, and reading the warehousing identification information from a warehousing identification information record file;

generating a proofreading report according to the proofreading result;

Wherein the collation report comprises: the method comprises the steps of data set names, access numbers, warehouse entry numbers, numbers of access identification information which are the same as the warehouse entry identification information, numbers of failed matching of the access identification information in the access identification information record file and numbers of failed matching of the warehouse entry identification information in the warehouse entry identification information record file;

The characterization field is used as a core characteristic field for representing the content of the data record to be accessed, so that the access identification information identifies one piece of access data;

The checking the number and the content of the access identification information and the warehouse-in identification information comprises the following steps:

Selecting one piece of the read warehousing identification information as target warehousing identification information, and determining access identification information consistent with the target warehousing identification information from the read access identification information according to the target warehousing identification information as target access identification information;

matching the data content identified by the target warehouse-in identification information with the data content identified by the target access identification information;

And integrating the data piece number and data content proofreading results to determine the final proofreading result of the piece of data.

2. The method of claim 1, wherein the characterization field extraction rule is determined based on key fields of data; wherein the key field is used for recording the difference information of the data.

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

Determining access identification information of the data according to the characterization field comprises the following steps:

Taking the MD5 value as access identification information of data;

and taking the MD5 value as warehousing identification information of the data.

4. The method of claim 3, wherein the step of,

After the MD5 value is taken as the access identification information of the data, the method further comprises:

creating a storage directory on an HDFS disk;

5. The method of claim 1, wherein the reading and proofing is performed using a spark memory computing engine.

6. The method of claim 1, wherein generating a collation report based on collation results comprises:

and exporting access identification information failing to be checked and the warehouse-in identification information into a check result table.

7. A data collation apparatus, comprising:

The data warehousing identification information determining module is used for determining a characterization field of the data based on the characterization field extraction rule and determining warehousing identification information of the data according to the characterization field if a data warehousing event is detected;

the data checking result determining module is used for checking the data bar number and the data content according to the access identification information and the warehouse-in identification information to obtain a data checking result;

the data proofreading result determining module includes:

the identification information reading sub-module is used for reading the access identification information from the access identification information record file and reading the warehousing identification information from the warehousing identification information record file;

the correction report generation module is used for generating a correction report according to the correction result;

The identification information checking sub-module is specifically configured to select one piece of the read warehousing identification information as target warehousing identification information, and determine, according to the target warehousing identification information, access identification information consistent with the target warehousing identification information from the read access identification information as target access identification information; matching the data content identified by the target warehouse-in identification information with the data content identified by the target access identification information; and integrating the data piece number and data content proofreading results to determine the final proofreading result of the piece of data.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements a method of collating data according to any of claims 1-6.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method of collating data according to any one of claims 1-6 when executing the computer program.