CN113656395B - Data quality control method, device, equipment and storage medium - Google Patents

Data quality control method, device, equipment and storage medium Download PDF

Info

Publication number
CN113656395B
CN113656395B CN202111203090.0A CN202111203090A CN113656395B CN 113656395 B CN113656395 B CN 113656395B CN 202111203090 A CN202111203090 A CN 202111203090A CN 113656395 B CN113656395 B CN 113656395B
Authority
CN
China
Prior art keywords
data segment
new input
input data
database
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111203090.0A
Other languages
Chinese (zh)
Other versions
CN113656395A (en
Inventor
周文明
花霖
冯建设
陈军
刘桂芬
王春洲
张挺军
杨欢
朱瑜鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xinrun Fulian Digital Technology Co Ltd
Original Assignee
Shenzhen Xinrun Fulian Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xinrun Fulian Digital Technology Co Ltd filed Critical Shenzhen Xinrun Fulian Digital Technology Co Ltd
Priority to CN202111203090.0A priority Critical patent/CN113656395B/en
Publication of CN113656395A publication Critical patent/CN113656395A/en
Application granted granted Critical
Publication of CN113656395B publication Critical patent/CN113656395B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution

Abstract

The invention discloses a data quality control method, a device, equipment and a storage medium, wherein the method comprises the following steps: extracting a new input data segment of a new input database, disassembling feature points of the new input data segment, performing similar feature matching on the feature points of the disassembled new input data segment and feature points of an original data segment of the database, obtaining the similarity level of the new input data segment according to the result of the similar feature matching, extracting the new input data segment with the similarity level higher than a first threshold value, calculating the new input data segment with the similarity level higher than the first threshold value through a data similarity calculation method to obtain a data similarity result, and integrating the new input data segment with the similarity higher than a second threshold value into the database according to the data similarity result. The invention has the advantages of high efficiency of data integration and quality control, low error rate and high integration reaction speed.

Description

Data quality control method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a data quality control method, a data quality control device, data quality control equipment and a storage medium.
Background
The data quality management means that a series of management activities such as identification, measurement, monitoring, early warning and the like are carried out on various data quality problems possibly caused in each stage of a planning, obtaining, storing, sharing, maintaining, applying and eliminating life cycle of data, and the data quality is further improved by improving and improving the management level of an organization. In the process of adding new main data to a database, data similarity measurement needs to be carried out, and main data matching needs to be repeated, but the existing data quality management method has the defects of high error rate, low integration reaction speed and low integration efficiency of new data and original data.
Disclosure of Invention
In view of the above technical problems, the present invention provides a method, an apparatus, a device and a storage medium for data quality management, so as to provide a technical scheme with low error rate and fast integration reaction speed.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present invention, a data quality management method is disclosed, the method comprising:
extracting a new input data segment of a new input database;
disassembling the characteristic points of the new input data segment;
performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database;
obtaining the similarity grade of the new input data segment according to the result of the similar feature matching;
extracting the new input data segment with the similarity level higher than a first threshold value;
calculating the new input data segment with the similarity level higher than the first threshold value by a data similarity algorithm to obtain a data similarity result;
and according to the data similarity result, integrating the newly input data segment with the similarity higher than a second threshold value into the database.
Further, the extracting the new input data segment of the new input database includes: detecting in real time the new input data segment newly entered into the database; when a new input of the new input data segment is detected; and extracting and marking the new input data segment.
Further, the disassembling the feature points of the new input data segment includes: and adopting a splitting algorithm to disassemble and extract the characteristic points of the new input data segment.
Further, the performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database includes: performing similar feature matching on the feature points of the disassembled new input data segment and a plurality of feature points of the original data segment in the database; and sorting the similar feature matching results, and arranging the similar feature matching results according to the sequence from higher matching degree to lower matching degree.
Further, the obtaining the similarity level of the new input data segment according to the result of the similar feature matching includes: analyzing the result of the similar feature matching; and determining the similarity grade of the new input data segment and other original data segments in the database according to the different matching degrees of the similar features of the new input data segment and the plurality of original data segments in the database.
Further, the integrating the newly input data segment with the similarity higher than the second threshold into the database includes: and integrating the bytes similar to the original data segment and the new input data segment in the database one by one.
Further, the method further comprises: acquiring a record of the new input data segment integrated into the database; generating a quality governance report from the record; storing the quality improvement report in a log.
According to a second aspect of the present disclosure, there is provided a data quality improvement device, including: the identification module is used for extracting a new input data segment of a new input database; the first data processing module is used for disassembling the characteristic points of the new input data segment; the characteristic point matching module is used for carrying out similar characteristic matching on the characteristic points of the disassembled new input data segment and the characteristic points of the original data segment of the database; the analysis module is used for obtaining the similarity grade of the new input data segment according to the result of the similar feature matching; the extracting module is used for extracting the new input data segment with the similarity level higher than a first threshold value; the second data processing module is used for calculating the new input data segment with the similarity level higher than the first threshold value through a data similarity algorithm to obtain a data similarity result; and the integration module is used for integrating the newly input data segment with the similarity higher than a second threshold into the database according to the data similarity result.
According to a third aspect of the present disclosure, there is provided a data quality improvement apparatus comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to: extracting a new input data segment of a new input database; disassembling the characteristic points of the new input data segment; performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database; obtaining the similarity grade of the new input data segment according to the result of the similar feature matching; extracting the new input data segment with the similarity level higher than a first threshold value; calculating the new input data segment with the similarity level higher than the first threshold value by a data similarity algorithm to obtain a data similarity result; and according to the data similarity result, integrating the newly input data segment with the similarity higher than a second threshold value into the database.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, performs the above-described data quality improvement method.
The technical scheme of the disclosure has the following beneficial effects:
according to the data quality control scheme developed based on the design, the newly-transmitted data segment is matched with the plurality of data segments in the database at characteristic points, so that the metadata segment with high matching degree is determined, and then the metadata segment with high matching degree and the newly-transmitted data segment are calculated, so that the metadata with high similarity and the newly-transmitted data segment can be integrated, the data integration efficiency of the data quality control method is high, the data quality control efficiency is high, the error rate of the method is low, and the integration reaction speed is increased.
Drawings
Fig. 1 is a flowchart of a data quality control method in an embodiment of the present specification;
FIG. 2 is a diagram showing an example of cumulative square difference in the embodiment of the present specification;
FIG. 3 is a flow chart of extracting a new input data segment of a new input database in an embodiment of the present description;
FIG. 4 is a flow chart of matching similar features between a new input data segment and an original data segment in an embodiment of the present disclosure;
FIG. 5 is a flow chart of obtaining a similarity level of a new input data segment in an embodiment of the present description;
FIG. 6 is a flow chart of obtaining a quality remediation report in an embodiment of the present description;
fig. 7 is a block diagram showing a structure of a data quality control device in the embodiment of the present specification;
fig. 8 is a terminal device of a data quality management method in an embodiment of the present specification;
fig. 9 is a computer-readable storage medium of a data quality improvement method in an embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are only schematic illustrations of the present disclosure. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
As shown in fig. 1, an embodiment of the present disclosure provides a data quality management method, where an execution subject of the method may be a server. The method specifically comprises the following steps S101-S107:
in step S101, a new input data segment of the new input database is extracted.
The new input data segment of the new input database may be data entered into the database from a foreground interface, or data imported into the database through the EXEL, or data transmitted into the database through the data acquisition interface, that is, data that has not been processed by the data quality management method is transmitted into the database, that is, new input data.
In step S102, the feature points of the new input data segment are disassembled.
And the feature points contained in the new input data segment are disassembled, namely the feature points contained in the new input data segment are extracted one by one.
Exemplarily, after the features are extracted, the features of the data may be scaled by using the unit variance to complete data preprocessing, different features may be scaled to the same interval, and the data preprocessing process may be expressed by the following equation:
Figure 473718DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 973969DEST_PATH_IMAGE002
and
Figure 204706DEST_PATH_IMAGE003
respectively the mean and standard deviation of a certain characteristic column of the data set,
Figure 739592DEST_PATH_IMAGE004
for the data in a certain characteristic column,
Figure 400512DEST_PATH_IMAGE005
is the variance of the feature column
Figure 184797DEST_PATH_IMAGE006
It represents a certain characteristic point of the data.
Mean value
Figure 877947DEST_PATH_IMAGE007
The quotient of the sum of all data in a certain characteristic line data and the number of the characteristic line data is:
Figure 486914DEST_PATH_IMAGE008
standard deviation of
Figure 364740DEST_PATH_IMAGE009
Is the square root of the arithmetic mean of the standard value of a certain characteristic row and the square of its mean deviation;
Figure 246108DEST_PATH_IMAGE010
in step S103, similar feature matching is performed on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database.
After the feature points of the new input data segment are extracted, the extracted feature points are matched with the feature points of the original data segment in the database one by one, the matching degree of the feature points of the new input data segment and the feature points of the original data segment in the database is determined, the new input data segment is determined to be similar to a certain original data segment in the database, the range of data similarity calculation is reduced, and therefore the data quality control method provides suggestions for the metadata segment similar to the new input data segment and greatly improves the efficiency of matching and combining with the main data.
Specifically, after the feature point decomposition of S102 is completed, a covariance matrix of the data sample is constructed, for example, two features of the data are analyzed
Figure 618315DEST_PATH_IMAGE011
And
Figure 534318DEST_PATH_IMAGE012
the covariance between them can be calculated by the following formula
Figure 317467DEST_PATH_IMAGE013
Figure 686131DEST_PATH_IMAGE014
Figure 796782DEST_PATH_IMAGE015
And
Figure 691926DEST_PATH_IMAGE016
is the mean of the features j and k, the covariance between the two features is positive, indicating that they will increase and decrease simultaneously, and a negative covariance, indicating that the two features will move in opposite directions, is a measure of the linear correlation of the random variable joint distribution, the covariance
Figure 521341DEST_PATH_IMAGE017
The more absolute valueIf the similarity is large, the higher the similarity of the data is, otherwise, the data similarity is low, the covariance is zero, the data has no similarity or low linearity, and so on, the matching degree of the feature points of the new input data segment and the feature points of the original data segment in the database can be obtained.
In step S104, a similarity level of the new input data segment is obtained according to the result of the similar feature matching.
The similarity grade of the data is determined according to different matching degrees of similar characteristic points of the new input data segment and a plurality of original data segments in the database, and is exemplarily divided into descending series such as A1, A2 and A3, wherein the grade of A1 is the maximum.
In step S105, the new input data segment whose similarity level is higher than the first threshold is extracted.
Wherein the first threshold is preset, for example, the first threshold may be a level of A3, and new input data segments with similar levels higher than the level of A3 are extracted according to the level set in step S104.
In step S106, the new input data segment with the similarity level higher than the first threshold is calculated by a data similarity algorithm, so as to obtain a data similarity result.
Wherein, the similarity calculation is performed on the new input data extracted in step S105 and the original data in the database, specifically, the importance of the principal component feature is analyzed and evaluated by calculating the feature of the covariance matrix, exemplarily, the covariance matrix including four features
Figure 518247DEST_PATH_IMAGE018
Expressed in the following manner:
Figure 356890DEST_PATH_IMAGE019
covariance matrix
Figure 840961DEST_PATH_IMAGE020
The feature vector of (a) represents the relevant component,the corresponding eigenvalue values, a large value indicates that the importance of the eigenvector is large, the contribution to the data similarity is large, a small value indicates that the importance of the eigenvector is small, the contribution to the data similarity is small, and the covariance is obtained in step S103.
Exemplarily, a numpy.cov function in Python language is applied to calculate a covariance matrix of a data set
Figure 106858DEST_PATH_IMAGE021
Eig function is used, the vectors of 4 eigenvalues and corresponding eigenvalues are obtained through characteristic decomposition, then the eigenvectors are stored in a 4X 4-dimensional matrix in a row mode, and the eigenvalues can be arranged in a descending order because the importance of the eigenvectors is determined by the size of the eigenvalues.
By calculating the variance contribution rate of the eigenvalue, the eigenvalue of the correlated component can be plotted
Figure 263164DEST_PATH_IMAGE022
Has a variance contribution rate of
Figure 30131DEST_PATH_IMAGE023
Is defined as a characteristic value
Figure 509654DEST_PATH_IMAGE024
Ratio to the sum of all characteristic values.
Figure 821818DEST_PATH_IMAGE025
Specifically, the cumulative variance may be calculated by using a cumsum function of Python programming language, an example graph shown in fig. 2 is generated by a step function of matplotlib, according to the example graph shown in fig. 2, the first principal component accounts for about 40% of the total variance, and the first three principal components account for nearly 90% of the total variance, so that the important principal component composition of data similarity may be further analyzed to obtain a data similarity result.
In step S107, according to the data similarity result, the newly input data segment with similarity higher than the second threshold is integrated into the database. And the second threshold is preset, the data similarity result obtained in the step S106 is compared with the second threshold, and the newly input data segment with the similarity higher than the second threshold is integrated into the database according to the comparison result, so that the data governance of the newly input data segment is completed.
Additionally, as shown in fig. 3, in another embodiment, when step S101 is executed, the following steps S201 to S203 are specifically executed:
in step S201, the new input data segment newly input into the database is detected in real time.
In step S202, when a new input of the new input data segment is detected.
In step S203, the new input data segment is extracted and marked.
Through the steps S201-S203, the new data input segment is detected in real time, and is extracted, so that the original data segment similar to the new data input segment is conveniently integrated.
Additionally, in another embodiment, in step S102, a splitting algorithm is used to disassemble and extract the feature points of the new input data segment.
The characteristic points of the newly-input data segment are disassembled and extracted by using a splitting algorithm, so that the characteristic points are conveniently matched with the characteristic points of the original data segment in the database, the original data segment with high matching degree is quickly determined, and the calculation efficiency by using a similarity algorithm in the subsequent steps is improved.
Additionally, as shown in FIG. 4, in another embodiment, when step S103 is executed, the method includes the following steps S301 to S302:
in step S301, similar feature matching is performed on the feature points of the disassembled new input data segment and the plurality of feature points of the original data segment in the database.
In step S302, the similar feature matching results are sorted and arranged in the order from higher matching degree to lower matching degree.
Through the steps S301-S302, the newly input data segment is determined to be similar to a certain original data segment in the database, and the range of data similarity calculation is narrowed, so that the method provides suggestions for the original data segment similar to the newly input data segment, and the efficiency of matching and merging with the main data is greatly improved.
Additionally, as shown in FIG. 5, in another embodiment, when step S104 is executed, the method includes the following steps S401 to S402:
in step S401, the result of the similar feature matching is analyzed.
In step S402, determining a similarity level between the new input data segment and some of the original data segments in the database according to a difference in matching degree between the new input data segment and the similar features of the original data segments in the database.
Through the steps S401-S402, the similarity grade of the new input data segment and the original data segment in the database is convenient to determine, and the original data segment with high similarity degree is quickly obtained through the similarity grade, so that the efficiency of proposing suggestions for the original data segment similar to the new input data segment, matching and merging the original data segment with the main data is improved.
In an exemplary embodiment, when step S107 is executed, the integrating the newly input data segment with the similarity higher than the second threshold into the database specifically includes: and integrating the bytes similar to the original data segment and the new input data segment in the database one by one.
In an alternative embodiment, as shown in fig. 6, after the step S107 is executed, the method further includes steps S501 to S503:
in step S501, a record of the new input data segment integrated into the database is obtained.
In step S502, a quality governance report is generated from the record.
In step S503, the quality improvement report is stored in a log.
Through the steps S501-S503, the integrated data records can be conveniently checked by the staff, and the subsequent maintenance and use of the database are facilitated.
Based on the same idea, an exemplary embodiment of the present disclosure also provides a data quality improvement apparatus, as shown in fig. 7, the data quality improvement apparatus 600 includes: the identification module 601 is used for extracting a new input data segment of a new input database; a first data processing module 602, configured to disassemble the feature points of the new input data segment; a feature point matching module 603, configured to perform similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database; an analysis module 604, configured to obtain a similarity level of the new input data segment according to the result of the similar feature matching; an extracting module 605, configured to extract the new input data segment with a similarity level higher than a first threshold; the second data processing module 606 is configured to calculate, by using a data similarity algorithm, the new input data segment with a similarity level higher than the first threshold to obtain a data similarity result; and an integrating module 607, configured to integrate, according to the data similarity result, the newly input data segment with the similarity higher than the second threshold into the database.
In an alternative embodiment, the identification module 601 specifically includes: the detection unit detects the new input data segment newly input into the database in real time; and the reading unit is used for extracting and marking the new input data segment when the detection unit detects the newly input new input data segment.
In an alternative embodiment, the first data processing module 602 controls the module to disassemble and extract the feature points of the new input data segment by using a splitting algorithm.
In an optional embodiment, the feature point matching module 603 specifically includes a matching unit, configured to perform similar feature matching on the feature points of the new input data segment after being disassembled and a plurality of feature points of the original data segment in the database; and the sorting unit is used for sorting the similar feature matching results and arranging the similar feature matching results according to the sequence from higher matching degree to lower matching degree.
In an alternative embodiment, the analysis module 604 specifically includes: the analysis unit analyzes the result of the similar feature matching; and the marking unit is used for determining the similarity grade of the new input data segment and other original data segments in the database according to the different matching degrees of the similar features of the new input data segment and the plurality of original data segments in the database.
In an alternative embodiment, the integration module 607 is specifically configured to integrate the bytes of the original data segment in the database similar to the new input data segment one by one.
In an alternative embodiment, the data quality abatement device 600 further comprises a storage module 608, the storage module 608 comprising: the acquisition unit is used for acquiring a record of the new input data segment integrated into the database; the report generating unit is used for generating a quality control report from the record; and the storage unit is used for storing the quality control report in a log.
The embodiment of the specification provides a data quality control device, which is characterized in that a new data transmission segment is matched with a plurality of data segments in a database at characteristic points, a metadata segment with high matching degree is determined, and then the metadata segment with high matching degree and the new data transmission segment are calculated, so that metadata with high similarity and the new data transmission segment can be integrated, the data integration efficiency of the data quality control method is high, the data quality control efficiency is high, the error rate of the method is too low, and the integration reaction speed is increased.
The specific details of each module/unit in the above-mentioned apparatus have been described in detail in the method section, and the details that are not disclosed may refer to the contents of the method section, and thus are not described again.
Based on the same idea, the embodiment of the present specification further provides a data quality management device, as shown in fig. 8.
The data quality management device may be the terminal device or the server provided in the above embodiments.
The data quality abatement device may vary significantly due to configuration or performance, and may include one or more processors 701 and a memory 702, where the memory 702 may store one or more stored applications or data. Memory 702 may include readable media in the form of volatile memory units, such as random access memory units (RAM) and/or cache memory units, among others, and may further include read-only memory units. The application programs stored in memory 702 may include one or more program modules (not shown), including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. Still further, processor 701 may be configured to communicate with memory 702 to execute a series of computer-executable instructions in memory 702 on the data quality abatement device. The data quality governance device may also include one or more power supplies 703, one or more wired or wireless network interfaces 704, one or more I/O interfaces (input output interfaces) 705, one or more external devices 706 (e.g., keyboard, pointing device, bluetooth device, etc.), may also communicate with one or more devices that enable a user to interact with the device, and/or with any device (e.g., router, modem, etc.) that enables the device to communicate with one or more other computing devices. Such communication may occur via I/O interface 705. Also, the device may communicate with one or more networks (e.g., a Local Area Network (LAN)) via a wired or wireless interface 704.
Specifically, in this embodiment, the data quality abatement device includes a memory 702, and one or more programs, where the one or more programs are stored in the memory 702, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data quality abatement device, and the one or more programs configured to be executed by the one or more processors 701 include computer-executable instructions for:
extracting a new input data segment of a new input database; disassembling the characteristic points of the new input data segment; performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database; obtaining the similarity grade of the new input data segment according to the result of the similar feature matching; extracting the new input data segment with the similarity level higher than a first threshold value; calculating the new input data segment with the similarity level higher than the first threshold value by a data similarity algorithm to obtain a data similarity result; and according to the data similarity result, integrating the newly input data segment with the similarity higher than a second threshold value into the database.
The extracting of the new input data segment of the new input database comprises: detecting in real time the new input data segment newly entered into the database; when a new input of the new input data segment is detected; and extracting and marking the new input data segment.
The disassembling the feature points of the new input data segment includes: and adopting a splitting algorithm to disassemble and extract the characteristic points of the new input data segment.
The performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database includes: performing similar feature matching on the feature points of the disassembled new input data segment and a plurality of feature points of the original data segment in the database; and sorting the similar feature matching results, and arranging the similar feature matching results according to the sequence from higher matching degree to lower matching degree.
The obtaining the similarity level of the new input data segment according to the result of the similar feature matching includes: analyzing the result of the similar feature matching; and determining the similarity grade of the new input data segment and other original data segments in the database according to the different matching degrees of the similar features of the new input data segment and the plurality of original data segments in the database.
The integrating the newly input data segment with the similarity higher than the second threshold into the database comprises: and integrating the bytes similar to the original data segment and the new input data segment in the database one by one.
Acquiring a record of the new input data segment integrated into the database; generating a quality governance report from the record; storing the quality improvement report in a log.
Based on the same idea, the exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which a program product capable of implementing the above-described method of the present specification is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.
Referring to fig. 9, a program product 800 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the exemplary embodiments of the present disclosure.
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to exemplary embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A data quality governance method, characterized in that the method comprises:
extracting a new input data segment of a new input database;
and (3) decomposing the feature points contained in the new input data segment by using the unit variance, wherein the method comprises the following steps: subtracting the data section from the data in the data section to obtain a difference value, and dividing the difference value by the standard deviation of the data section to obtain the characteristic point of the data section;
performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database, wherein the similar feature matching comprises calculating the covariance of the feature points of the new input data segment and the feature points of the original data segment of the database, and determining the result of the feature similar feature matching according to the absolute value of the covariance;
obtaining the similarity grade of the new input data segment according to the result of the similar feature matching;
extracting the new input data segment with the similarity level higher than a first threshold value;
calculating the new input data segment with the similarity level higher than the first threshold value by a data similarity algorithm, wherein the calculation comprises the steps of calculating the covariance of the characteristic points of the new input data segment with the similarity level higher than the first threshold value and the characteristic points of the original data segment, forming a covariance matrix by the obtained covariance, calculating the eigenvector of the covariance matrix, and obtaining a data similarity result according to the eigenvector;
and according to the data similarity result, integrating the newly input data segment with the similarity higher than a second threshold value into the database.
2. The data quality governance method of claim 1, wherein said extracting new input data segments of a new input database comprises:
detecting in real time the new input data segment newly entered into the database;
when a new input of the new input data segment is detected;
and extracting and marking the new input data segment.
3. The data quality governance method of claim 1, wherein the disassembling of the feature points of the new input data segment comprises: and adopting a splitting algorithm to disassemble and extract the characteristic points of the new input data segment.
4. The data quality governance method according to claim 1, wherein the performing similar feature matching on the feature points of the disassembled new input data segment and the feature points of the original data segment of the database comprises:
performing similar feature matching on the feature points of the disassembled new input data segment and a plurality of feature points of the original data segment in the database;
and sorting the similar feature matching results, and arranging the similar feature matching results according to the sequence from higher matching degree to lower matching degree.
5. The data quality governance method according to claim 1, wherein the obtaining the similarity level of the new input data segment according to the result of the similar feature matching comprises:
analyzing the result of the similar feature matching;
and determining the similarity grade of the new input data segment and other original data segments in the database according to the different matching degrees of the similar features of the new input data segment and the plurality of original data segments in the database.
6. The data quality governance method according to claim 1, wherein said integrating the newly entered data segment with a similarity above a second threshold into the database comprises: and integrating the bytes similar to the original data segment and the new input data segment in the database one by one.
7. The data quality governance method according to claim 1, further comprising:
acquiring a record of the new input data segment integrated into the database;
generating a quality governance report from the record;
storing the quality improvement report in a log.
8. A data quality governance device, comprising:
the identification module is used for extracting a new input data segment of a new input database;
the first data processing module is used for disassembling the feature points contained in the new input data segment by using the unit variance, and comprises the following steps: subtracting the data section from the data in the data section to obtain a difference value, and dividing the difference value by the standard deviation of the data section to obtain the characteristic point of the data section;
the characteristic point matching module is used for carrying out similar characteristic matching on the characteristic points of the disassembled new input data segment and the characteristic points of the original data segment of the database, and comprises the steps of calculating the covariance of the characteristic points of the new input data segment and the characteristic points of the original data segment of the database, and determining the result of the characteristic similar characteristic matching according to the absolute value of the covariance;
the analysis module is used for obtaining the similarity grade of the new input data segment according to the result of the similar feature matching;
the extracting module is used for extracting the new input data segment with the similarity level higher than a first threshold value;
the second data processing module is used for calculating the new input data segment with the similarity level higher than the first threshold value through a data similarity algorithm, and comprises the steps of calculating the covariance of the characteristic points of the new input data segment with the similarity level higher than the first threshold value and the characteristic points of the original data segment, forming a covariance matrix by the obtained covariance, calculating the eigenvector of the covariance matrix, and obtaining a data similarity result according to the eigenvector;
and the integration module is used for integrating the newly input data segment with the similarity higher than a second threshold into the database according to the data similarity result.
9. A data quality governance device, comprising:
one or more processors;
a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the data quality remediation method of any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a data quality governance method according to any one of claims 1 to 7.
CN202111203090.0A 2021-10-15 2021-10-15 Data quality control method, device, equipment and storage medium Active CN113656395B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111203090.0A CN113656395B (en) 2021-10-15 2021-10-15 Data quality control method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111203090.0A CN113656395B (en) 2021-10-15 2021-10-15 Data quality control method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113656395A CN113656395A (en) 2021-11-16
CN113656395B true CN113656395B (en) 2022-03-15

Family

ID=78494578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111203090.0A Active CN113656395B (en) 2021-10-15 2021-10-15 Data quality control method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113656395B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6816848B1 (en) * 2000-06-12 2004-11-09 Ncr Corporation SQL-based analytic algorithm for cluster analysis
CN105989174B (en) * 2015-03-05 2019-11-01 欧姆龙株式会社 Region-of-interest extraction element and region-of-interest extracting method
JP6521053B2 (en) * 2015-03-06 2019-05-29 富士通株式会社 Search program, search method and search device
CN108509771B (en) * 2018-03-27 2020-12-22 华南理工大学 Multi-group chemical data association relation discovery method based on sparse matching
JP2021140694A (en) * 2020-03-09 2021-09-16 富士フイルムビジネスイノベーション株式会社 Information management apparatus and information management program
CN111581298B (en) * 2020-04-29 2023-11-14 北华航天工业学院 Heterogeneous data integration system and method for large data warehouse
CN113312673A (en) * 2020-12-01 2021-08-27 李孔雀 Data processing method applied to big data and big data server

Also Published As

Publication number Publication date
CN113656395A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
US10032114B2 (en) Predicting application performance on hardware accelerators
US20140068768A1 (en) Apparatus and Method for Identifying Related Code Variants in Binaries
CN109726391B (en) Method, device and terminal for emotion classification of text
RU2722692C1 (en) Method and system for detecting malicious files in a non-isolated medium
US20180336272A1 (en) Generation of natural language processing events using machine intelligence
Cerqueira et al. Vest: Automatic feature engineering for forecasting
Rohini et al. Domain based sentiment analysis in regional Language-Kannada using machine learning algorithm
CN110516210B (en) Text similarity calculation method and device
D’Addario et al. A modular computational framework for automated peak extraction from ion mobility spectra
CN111080117A (en) Method and device for constructing equipment risk label, electronic equipment and storage medium
CN113076734A (en) Similarity detection method and device for project texts
Basile et al. Diachronic analysis of the italian language exploiting google ngram
Hadi et al. Aobtm: Adaptive online biterm topic modeling for version sensitive short-texts analysis
CN115146282A (en) AST-based source code anomaly detection method and device
Rajaratnam et al. Influence diagnostics for high-dimensional lasso regression
CN113111305A (en) Abnormity detection method and device, storage medium and electronic equipment
US20180068017A1 (en) Providing known distribution patterns associated with specific measures and metrics
CN115913710A (en) Abnormality detection method, apparatus, device and storage medium
CN111639493A (en) Address information standardization method, device, equipment and readable storage medium
CN107066302A (en) Defect inspection method, device and service terminal
US20200279148A1 (en) Material structure analysis method and material structure analyzer
Li et al. Detection of SQL injection attacks based on improved TFIDF algorithm
CN113656395B (en) Data quality control method, device, equipment and storage medium
CN114925757B (en) Multisource threat information fusion method, device, equipment and storage medium
Barbosa et al. Using performance profiles for the analysis and design of benchmark experiments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant