CN116701381A

CN116701381A - Multistage verification system and method for distributed data acquisition and warehousing

Info

Publication number: CN116701381A
Application number: CN202310967006.5A
Authority: CN
Inventors: 姚含; 方红渊; 崔冬祥; 李鸿羽; 黄少意; 王惠云
Original assignee: Nanjing Mochou Intelligent Information Technology Co ltd
Current assignee: Nanjing Mochou Intelligent Information Technology Co ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-09-05
Anticipated expiration: 2043-08-03
Also published as: CN116701381B

Abstract

The invention belongs to the technical field of data verification, and particularly relates to a multi-stage verification system and a verification method for distributed data acquisition and warehousing. The invention utilizes the multistage verification of the logarithmic input data, can clean and normalize the data rapidly and flexibly, and can follow the priority of each verification condition to carry out the stage-by-stage verification in the verification process, ensure the validity of the data, and simultaneously can not cause verification errors, thereby being capable of effectively reducing the difficulty of subsequent data processing, and in addition, flexible configuration information is added, thereby solving the application degree of the framework.

Description

Multistage verification system and method for distributed data acquisition and warehousing

Technical Field

The invention belongs to the technical field of data verification, and particularly relates to a multi-stage verification system and a verification method for distributed data acquisition and warehousing.

Background

With the advent of the information age, more and more online data are required to be integrated and transmitted, but due to the fact that social information explosion exists at present, data sources are more, data repeatability is high, and especially when data sources exist, uniformity of data processing in a short time is required, and high efficiency of data processing is required to be met, the importance is achieved, meanwhile, repeated data occupy more storage space, not only can slow data transmission speed be dragged, but also data retrieval can be affected.

The existing framework structure cannot meet the requirements, cannot clean and normalize data rapidly and flexibly, redundant data are more and more, timeliness of the data is guaranteed, a large amount of failure data can be accumulated, and for solving the problem, the scheme provides a multistage verification system which utilizes three-stage verification, reduces difficulty of subsequent data processing, and solves the problem of applicability of a framework by adding flexible configuration information.

Disclosure of Invention

The invention aims to provide a multi-stage verification system and a verification method for distributed data acquisition and warehousing, which can reduce the difficulty of subsequent data processing by utilizing three-stage verification and solve the problem of the applicability of a framework by adding flexible configuration information.

The technical scheme adopted by the invention is as follows:

a multi-stage verification method for distributed data acquisition and warehousing comprises the following steps:

acquiring input data, and carrying out disassembly processing on the input data to obtain a plurality of message bodies;

adding identification information into the message body to obtain data to be verified, wherein the identification information comprises date, source, destination, size, section, name, row ID and file name;

inputting the data to be checked into a multi-level check model, and judging whether the data to be checked passes the check;

if yes, uploading the data to be checked to an online database through a database operation engine;

if not, word segmentation is carried out on the data to be verified to obtain data to be optimized, and the data to be optimized is synchronously uploaded to an offline database;

inputting the data to be optimized into a data conversion model to obtain unique data, carrying out cluster calculation on the unique data, carrying out cluster combination on calculation results, and uploading combined results to an online database.

In a preferred embodiment, the input data is disassembled in rows.

In a preferred embodiment, the step of inputting the data to be verified into a multi-level verification model to determine whether the data to be verified passes verification includes:

acquiring data to be verified;

invoking a verification condition from the verification model, wherein the verification condition comprises content repetition verification, content deletion verification and content query verification;

and sequentially inputting the data to be checked into the check conditions, determining that the data accords with the check conditions pass the check, synchronously uploading the data to an online database, determining that the data does not accord with the check conditions do not pass the check, and synchronously uploading the data to an offline database.

In a preferred scheme, the priority of the content duplicate check is higher than the priority of the content deletion check, and the priority of the content deletion check is higher than the priority of the content query check;

and the data to be checked is checked step by step according to the priorities of the content repeated check, the content missing check and the content query check, and when the data to be checked passes the check condition with high priority, the check condition with low priority is not executed.

In a preferred embodiment, the step of repeatedly checking the content includes:

acquiring data to be checked, uploading the data to an online database, and judging whether repeated data consistent with the data to be checked exist in the online database;

if the data to be checked exist, the data to be checked are reserved, and the repeated data are screened out from the online database;

if the field is not added, obtaining the structural information of the data to be checked, calibrating the structural information into primary check data, and judging whether a new field exists in the primary check data;

if the newly added field exists in the primary check data, inquiring the total data reporting time according to the structure change time, and judging whether repeated reporting records exist or not;

if yes, the repeated data before the time node is cleaned, the first-level check data is reserved, and is summarized into a first-level data set, otherwise, the first-level check data is directly summarized into the first-level data set;

if no new field exists in the primary check data, acquiring date information of the primary check data, setting a date approval field based on a primary data set, and judging whether only the approval field exists in the primary check data and is inconsistent with the date information;

if yes, judging that the data to be checked passes the content repeated check, and summarizing the data to be checked into a primary data set;

if not, judging that the data to be checked do not pass through the content repeated check, calibrating the data to be checked as second-level check data, and summarizing the data to be checked as a second-level data set.

In a preferred embodiment, the step of performing the content deletion check includes:

acquiring secondary check data and corresponding missing fields thereof from the secondary data set;

acquiring a key field and an identification field corresponding to the secondary check data from the online database, and comparing the key field and the identification field with the secondary check data;

if the missing field in the second-level check data is a key field, judging that the content missing check is not passed, calibrating the content missing check as third-level check data, and summarizing the third-level check data as a third-level data set;

if the missing field in the secondary check data is an identification field or a non-key field, judging that the content missing check is passed, and supplementing identification information and non-key field information into the secondary check data.

In a preferred embodiment, the step of performing the content query verification includes:

acquiring three-level check data from the three-level data set;

counting the number of missing key fields in the three-level check data, and calibrating the number of missing key fields as parameters to be compared;

acquiring an evaluation threshold value and comparing the evaluation threshold value with the parameter to be compared;

if the parameter to be compared is greater than or equal to an evaluation threshold, the third-level verification data is indicated to pass through the content query verification, and the third-level verification data is uploaded to an offline database;

and if the parameter to be compared is smaller than the evaluation threshold, indicating that the three-level check data passes the content query check, and supplementing key fields into the three-level check data.

In a preferred embodiment, the step of inputting the data to be optimized into a data conversion model to obtain unique data includes:

obtaining data to be optimized from the offline database;

calling a conversion algorithm from the data conversion model, inputting the data to be optimized into the conversion algorithm, and calibrating a conversion result into unique data;

wherein the conversion algorithm is a hash algorithm.

The invention also provides a multi-stage verification system for distributed data acquisition and storage, which is applied to the multi-stage verification method for distributed data acquisition and storage, and comprises the following steps:

the acquisition module is used for acquiring input data and carrying out disassembly processing on the input data to obtain a plurality of message bodies;

the identification module is used for adding identification information into the message body to obtain data to be verified, wherein the identification information comprises a date, a source, a destination, a size, a section, a name, a row ID and a file name;

the verification module is used for inputting the data to be verified into a multi-level verification model and judging whether the data to be verified passes the verification;

the data conversion module is used for inputting the data to be optimized into a data conversion model to obtain unique data, carrying out cluster calculation on the unique data, carrying out cluster combination on calculation results, and uploading combined results to an online database.

And a multi-stage verification terminal for distributed data acquisition and warehousing, comprising:

at least one processor;

and a memory communicatively coupled to the at least one processor;

the memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor, so that the at least one processor can execute the multi-stage verification method for distributed data acquisition and warehousing.

The invention has the technical effects that:

the invention utilizes the multistage verification of the logarithmic input data, can clean and normalize the data rapidly and flexibly, and can follow the priority of each verification condition to carry out the stage-by-stage verification in the verification process, ensure the validity of the data, and simultaneously can not cause verification errors, thereby being capable of effectively reducing the difficulty of subsequent data processing, and in addition, flexible configuration information is added, thereby solving the application degree of the framework.

Drawings

FIG. 1 is a flow chart of a method provided by the present invention;

fig. 2 is a block diagram of a system provided by the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one preferred embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Referring to fig. 1 and 2, the present invention provides a multi-stage verification method for distributed data acquisition and storage, which includes:

s1, acquiring input data, and carrying out disassembly processing on the input data to obtain a plurality of message bodies;

s2, adding identification information into the message body to obtain data to be checked, wherein the identification information comprises date, source, destination, size, section, name, row ID and file name;

s3, inputting the data to be checked into a multi-stage check model, and judging whether the data to be checked passes the check;

if not, word segmentation is carried out on the data to be checked to obtain data to be optimized, and the data to be optimized is synchronously uploaded to an offline database;

s4, inputting the data to be optimized into a data conversion model to obtain unique data, performing cluster calculation on the unique data, performing cluster combination on calculation results, and uploading combined results to an online database.

As described in the above steps S1 to S4, since the social information explodes now, the data sources are more, the data repeatability is high, and especially when there are multiple data sources, the uniformity of processing data in a short time is required, and the high efficiency of processing data is also required to be satisfied. In this embodiment, the data is disassembled by rows by adopting an acquisition server, packaged into a message body, relevant information such as a source, a destination, a size, a section, a name, a row ID, a file name and the like of the data is added in a message header, and the message is sent to a message queue, wherein the disassembly mode of the input data is that the input data is disassembled by rows, then a multi-stage verification model is utilized to perform verification work on the input data, after a database operation engine collects certain data, the data is submitted to an online database, another part of the data is reprocessed by word segmentation and the like, and is submitted to an offline database, the data in the offline database is calibrated to be data to be optimized, then the data to be optimized is uploaded to a data conversion model, so that unique data can be obtained, the unique data is then transmitted to a spark cluster for calculation, the calculated results are clustered and combined, and the combined results are uploaded to the online database.

In a preferred embodiment, the step of inputting the data to be verified into the multi-level verification model to determine whether the data to be verified passes the verification includes:

s301, acquiring data to be verified;

s302, calling a verification condition from a verification model, wherein the verification condition comprises content repetition verification, content deletion verification and content query verification;

s303, sequentially inputting the data to be checked into the check conditions, determining that the data meets the check conditions is checked, synchronously uploading the data to the online database, determining that the data does not meet the check conditions is not checked, and synchronously uploading the data to the offline database.

As described in the above steps S301-S303, after the data to be verified is determined, the data to be verified is directly input into a multi-stage verification model, and a plurality of verification conditions are set in the multi-stage verification model, wherein the plurality of verification conditions are respectively content repeated verification, content missing verification and content query verification, so that the data to be verified which passes the verification is uploaded to an online database, the data to be verified which does not pass the verification is uploaded to an offline database, and the data in the offline database needs further optimization and conversion subsequently, so that the conditions of uploading to the online database are satisfied.

In a preferred embodiment, the priority of the content duplication check is higher than the priority of the content deletion check, and the priority of the content deletion check is higher than the priority of the content query check;

and when the data to be checked passes the check condition with high priority, the check condition with low priority is not executed.

In this embodiment, among the plurality of verification conditions, the priority of content repetition verification is highest, content deletion verification is sequentially performed, and content query verification is performed finally, when verification is performed on data to be verified, the priority of the plurality of verification conditions is performed from high to low, and when the data to be verified passes through the verification condition with higher priority, a subsequent verification process is not performed any more, so that not only can the accuracy of data verification be ensured, but also the smoothness of input data verification can be ensured.

In a preferred embodiment, the step of performing the content repetition check includes:

stp1, acquiring data to be checked, uploading the data to an online database, and judging whether repeated data consistent with the data to be checked exist in the online database;

stp2, if the data to be checked exist, reserving the data to be checked, and screening the repeated data from the online database;

stp3, if not, obtaining the structural information of the data to be checked, calibrating the structural information into first-level check data, and judging whether a new field exists in the first-level check data;

stp4, if the newly added field exists in the primary check data, inquiring the total data reporting time according to the structural change time, and judging whether repeated reporting records exist or not;

stp5, if yes, cleaning the repeated data before the time node, reserving first-level check data, summarizing the first-level check data into a first-level data set, and otherwise, summarizing the first-level check data into the first-level data set directly;

stp6, if no new field exists in the primary check data, acquiring date information of the primary check data, setting a date approval field based on the primary data set, and judging whether only the approval field exists in the primary check data and is inconsistent with the date information;

stp7, if yes, judging that the data to be checked passes the content repeated check, and summarizing the data to be checked to a first-level data set;

and Stp8, if not, judging that the data to be checked does not pass through the content repeated check, calibrating the data to be the secondary check data, and summarizing the data to be checked to be a secondary data set.

As described in the above steps Stp1-Stp8, when performing content repetition verification, it is first required to determine whether there is repeated data consistent with the repeated data in the online database, if so, the repeated data with a preceding date is screened out from the online database, the repeated data with a succeeding date is retained in the online database, then the structure information of the data to be verified is verified, whether there is a newly added field in the primary verification data is determined, the total data reporting time is queried according to the structure change time, and whether there is a repeated reporting record is determined, and when there is no repeated reporting record, the date information of the primary verification data is approved, so that the repeated reporting of data due to the inconsistent date information is avoided, and when the date information and the structure information are inconsistent, it is determined that the repeated verification of the content is not passed, and the repeated verification is marked as the secondary verification data, and then the verification is continued under the verification condition of the next priority.

stp9, acquiring secondary check data and corresponding missing fields thereof from the secondary data set;

stp10, acquiring a key field and an identification field corresponding to the secondary check data from an online database, and comparing the key field and the identification field with the secondary check data;

stp11, if the missing field in the second-level check data is a key field, judging that the missing field does not pass the content missing check, and calibrating the missing field as third-level check data, and summarizing the missing field as a third-level data set;

stp12, if the missing field in the second-level check data is an identification field or a non-key field, judging that the second-level check data passes the content missing check, and supplementing identification information and non-key field information into the second-level check data.

As described in the above steps Stp8-Stp12, when the content missing verification is performed, it is necessary to determine whether the missing field existing in the secondary verification data is a critical field, and for the secondary verification data whose missing field is a critical field, it is determined that the content missing verification is not passed, and it is marked as the tertiary verification data, and for the secondary verification data whose missing field is a non-critical field or an identification field, it is determined that the missing part is complemented, and the complemented secondary verification data is determined as passing the verification.

stp13, acquiring three-level check data from the three-level data set;

stp14, counting the missing quantity of key fields in the three-level verification data, and calibrating the missing quantity as a parameter to be compared;

stp15, acquiring an evaluation threshold value, and comparing the evaluation threshold value with parameters to be compared;

stp16, if the parameter to be compared is greater than or equal to the evaluation threshold, indicating that the three-level verification data do not pass the content query verification, and uploading the three-level verification data to an offline database;

stp17, if the parameter to be compared is smaller than the evaluation threshold, shows that the third-level check data passes the content query check, and supplements key fields to the third-level check data.

As described in the above steps Stp13-Stp17, when performing content query verification, it is required to determine the content query verification according to the number of fields in the three-level verification data, in this embodiment, the content query verification is determined as a parameter to be compared, and when the parameter to be compared is greater than or equal to the evaluation threshold, the three-level verification data is determined to be not verified and is synchronously uploaded to the offline database, otherwise, the key fields are supplemented to the three-level verification data, and the supplemented three-level verification data is determined to be verified and is uploaded to the online database.

In a preferred embodiment, the step of inputting the data to be optimized into the data conversion model to obtain the unique data includes:

s401, acquiring data to be optimized from an offline database;

s402, calling a conversion algorithm from the data conversion model, inputting data to be optimized into the conversion algorithm, and calibrating a conversion result into unique data;

wherein the conversion algorithm is a hash algorithm.

As described in the above steps S41-S402, the data to be optimized after the hash algorithm is submitted to the spark cluster for calculation, and then the calculated results are clustered and uploaded to the online database, and in the uploading process, the repeated data in the online database are screened out, and the currently uploaded data is reserved as updated online data.

the verification module is used for inputting the data to be verified into the multi-level verification model and judging whether the data to be verified passes the verification;

the data conversion module is used for inputting the data to be optimized into the data conversion model to obtain the unique data, carrying out cluster calculation on the unique data, carrying out cluster combination on calculation results, and uploading the combination results to the online database.

In the above, when the on-track system is executed, firstly, input data is acquired through the acquisition module, the input data is disassembled according to the disassembly template according to rows, so that a plurality of message bodies can be obtained, then, identification information is added to the message bodies through the identification module, the data to be verified is obtained, so that the uniqueness of the input data is ensured, the phenomenon of data disorder cannot occur when the input data is transmitted and verified later, then, the verification module is combined to verify the data to be verified, when the verification module is executed, the verification module executes multiple verification steps on the data to be verified, the smoothness in the verification process can be ensured, so that the data to be verified which does not pass through verification can be determined as data to be optimized, the data to be optimized can be uploaded to the data conversion module to be subjected to data conversion, then, the converted data are subjected to cluster calculation and cluster combination, and finally, the combination result is uploaded to the on-line database.

at least one processor;

and a memory communicatively coupled to the at least one processor;

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention. Structures, devices and methods of operation not specifically described and illustrated herein, unless otherwise indicated and limited, are implemented according to conventional means in the art.

Claims

1. A multi-stage verification method for distributed data acquisition and warehousing is characterized by comprising the following steps of: comprising the following steps:

2. The multi-stage verification method for distributed data acquisition and warehousing of claim 1, wherein the method comprises the steps of: the input data is disassembled in a row-by-row manner.

3. The multi-stage verification method for distributed data acquisition and warehousing of claim 1, wherein the method comprises the steps of: the step of inputting the data to be checked into a multi-level check model and judging whether the data to be checked passes the check comprises the following steps:

acquiring data to be verified;

4. A multi-stage verification method for distributed data acquisition and warehousing as set forth in claim 3, wherein: the priority of the content repetition check is higher than the priority of the content deletion check, and the priority of the content deletion check is higher than the priority of the content query check;

5. A multi-stage verification method for distributed data acquisition and warehousing as set forth in claim 3, wherein: the step of performing the content repetition check includes:

6. The multi-stage verification method for distributed data acquisition and warehousing of claim 5, wherein the method comprises the steps of: the step of performing the content deletion verification includes:

7. The multi-stage verification method for distributed data acquisition and warehousing of claim 6, wherein the steps of: the step of performing the content query verification includes:

acquiring three-level check data from the three-level data set;

8. The multi-stage verification method for distributed data acquisition and warehousing of claim 1, wherein the method comprises the steps of: the step of inputting the data to be optimized into a data conversion model to obtain unique data comprises the following steps:

obtaining data to be optimized from the offline database;

wherein the conversion algorithm is a hash algorithm.

9. A multi-stage verification system for distributed data acquisition and storage, which is applied to the multi-stage verification method for distributed data acquisition and storage as claimed in any one of claims 1 to 8, and is characterized in that: comprising the following steps:

10. A multistage check-up terminal that distributed data acquisition put in storage was used, its characterized in that: comprising the following steps:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the multi-level verification method for distributed data acquisition and warehousing of any one of claims 1 to 8.