CN114693144A - Interface data quality determination method, device and system - Google Patents

Interface data quality determination method, device and system Download PDF

Info

Publication number
CN114693144A
CN114693144A CN202210361932.3A CN202210361932A CN114693144A CN 114693144 A CN114693144 A CN 114693144A CN 202210361932 A CN202210361932 A CN 202210361932A CN 114693144 A CN114693144 A CN 114693144A
Authority
CN
China
Prior art keywords
data
clustering
interface
category
categories
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210361932.3A
Other languages
Chinese (zh)
Inventor
王昊达
李冉
陈震宇
刘国华
李少波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Postal Savings Bank of China Ltd
Original Assignee
Postal Savings Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Postal Savings Bank of China Ltd filed Critical Postal Savings Bank of China Ltd
Priority to CN202210361932.3A priority Critical patent/CN114693144A/en
Publication of CN114693144A publication Critical patent/CN114693144A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Abstract

The application provides a method, a device and a system for determining interface data quality, wherein the method comprises the following steps: acquiring interface data quality data of a plurality of interfaces in a plurality of time periods; clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein one first clustering result comprises a plurality of categories and category centers corresponding to the categories, and the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is less than a first preset threshold value; determining the category corresponding to each interface according to the plurality of first clustering results; clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, and the sum of squares of errors of any group of interface data quality data and the corresponding data pile centers is less than a second preset threshold value; a score is obtained for each data heap. The method solves the problem of large human intervention degree of interface data quality scoring in the prior art.

Description

Interface data quality determination method, device and system
Technical Field
The present application relates to the field of software development technologies, and in particular, to a method, an apparatus, a processor, a computer-readable storage medium, and a system for determining interface data quality.
Background
Most of the prior art methods related to the grading of the data quality interface level in the current market and academia are methods of grading and summarizing according to index classification, and a current evaluation method is briefly introduced as follows: (1) the automatic scoring module reduces the data quality problem into three major primary indexes of accuracy, integrity and timeliness; the accuracy index can be refined into secondary indexes such as correctness, accuracy, uniqueness, effectiveness, consistency and the like, the integrity index can be refined into secondary indexes such as data record filling rate, data table filling rate, data item filling rate and the like, and the timeliness index can be refined into secondary indexes such as data timeliness and standard timeliness; the third-level index is a service evaluation point after the second-level index is further refined, the third-level index is an evaluation point after the second-level index is further refined, the evaluation point is often strongly related to the service, and the measurement can be conveniently carried out, for example: key data item 1, key data item 2, etc. under the correctness indexes. (2) M primary measurement indexes are provided, and n secondary measurement indexes are provided under each primary index. The comprehensive measurement calculation formula is as follows:
Figure BDA0003585632620000011
wherein CER represents the integrated metric result, ωiRepresents the ith primary index weight, ωijAnd the weight of the jth secondary index of the ith primary index is represented, and Cij represents the comprehensive result of the jth secondary index of the ith primary index. (3) The data quality scoring logic calculation mode is as follows: when a problem occurs in a certain index, partial or all scores of the index weight are deducted according to factors such as the urgency and importance degree of the problem, the quantity of problem data or the problem occurrence rate, delay time and the like.
The scoring formula is currently used:
interface score (day) is 100-the sum of all the indicators.
System score (day) is the sum of scores of all interfaces/number of interfaces in the Σ system.
The final interface/system score is the interface/system score + the adjustment score.
(adjustment scoring items include, but are not limited to, resolution of problem timeliness, response time, degree of engagement, etc.)
The disadvantages of the existing scoring methods are summarized as follows:
(1) the rules are established by people for clapping so that the scoring subjectivity is strong, for example, the requirement of the consistency index rule on the number and the ratio of the problem data volume is quite limited.
(2) The adaptability is poor, and the consensus on the scoring results is insufficient. For example, in the case of upstream and downstream systems, since upstream and downstream systems do not participate in the scoring work, it is often difficult to persuade the upstream system or the downstream system.
(3) The scoring mode is single, the same index scores all systems according to the same rule and parameter, and individuation and customization are lacked, so that the method is very rough.
(4) The grading workload is large, manual intervention on collected data is more, and objectivity and rationality are to be enhanced.
Based on the defects of the existing method, in order to comprehensively improve the comprehensiveness, accuracy and objectivity of scoring, increase the credibility of communication between an upstream system and a downstream system and reduce the difficulty of scoring, a brand new method based on a cluster analysis technology is invented through deep analysis and thinking to evaluate the data quality condition of an interface.
The above information disclosed in this background section is only for enhancement of understanding of the background of the technology described herein and, therefore, certain information may be included in the background that does not form the prior art that is already known in this country to a person of ordinary skill in the art.
Disclosure of Invention
The application mainly aims to provide a method, a device, a processor, a computer-readable storage medium and a system for determining interface data quality, so as to solve the problem of high human intervention degree of interface data quality scoring in the prior art.
According to an aspect of the embodiments of the present invention, there is provided a method for determining quality of interface data, the method including: acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, wherein in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface in one time period; clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of the square errors between any group of interface data quality data and the corresponding category centers is less than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same; determining the category corresponding to each interface according to the plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name; clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of squares of errors between any group of interface data quality data and the corresponding data pile center is less than a second preset threshold, and the number of the data piles of each second clustering result is the same; a score is obtained for each of the data heaps.
Optionally, clustering the interfaces of each time period to obtain a plurality of first clustering results, including: a first clustering step, selecting a first clustering number, wherein the first clustering number is the number of the selected categories, clustering the interface of a target time period for multiple times to obtain a plurality of first preliminary clustering results, clustering each time into a first clustering number of the categories, and the target time period is any time period of the time periods; a first determining step of determining that the first preliminary clustering result is the first clustering result corresponding to the target time period when the first preliminary clustering result meets a first clustering termination condition, wherein the first clustering termination condition is that the positions of category centers corresponding to all the categories after clustering are not changed; and repeating the first clustering step and the first determining step until all the interface clustering of the time period is finished, and obtaining a plurality of first clustering results.
Optionally, before clustering each of the categories respectively to obtain a plurality of second clustering results, the method further includes: and deleting normal data in each category to obtain problem data in each category, wherein the normal data are interface data quality data with the accuracy data, the integrity data and the timeliness data of the interface data all meeting the interface data requirements, and the problem data are interface data quality data with at least one of the accuracy data, the integrity data and the timeliness data of the interface data not meeting the interface data requirements.
Optionally, the clustering each of the categories to obtain a plurality of second clustering results includes: a second clustering step of selecting a second cluster number, wherein the second cluster number is the number of the selected data piles of the categories, and clustering the interface data quality data of the target category for multiple times to obtain a plurality of second preliminary clustering results, each clustering is performed to form a second cluster number of the data piles, and the target category is any of the plurality of categories; a second determining step of determining, when the second preliminary clustering result satisfies a second clustering termination condition, that the second preliminary clustering result is the second clustering result corresponding to the target class, and the second clustering termination condition is that the position of the data pile center corresponding to each data pile is not changed; and repeating the second clustering step and the second determining step until all the interface data quality data clusters of the classes are finished, and obtaining a plurality of second clustering results.
Optionally, obtaining a score for each of the data piles comprises: obtaining a plurality of scores for each of said data heaps; deleting the lowest score of the highest sum of scores in the plurality of scores of each data pile to obtain a plurality of groups of target scores, wherein one group of target scores corresponds to one data pile; calculating the average of the target scores of each group to obtain the score of each data pile.
Optionally, after obtaining the score for each of the data piles, the method further comprises: deleting all the interface data quality data of each data pile to obtain a plurality of first target data piles, wherein data pile centers corresponding to the first target data piles are in one-to-one correspondence with data pile centers corresponding to the data piles; acquiring the interface data quality data of at least one time period; calculating the sum of squared errors of the data pile centers corresponding to each interface data quality data and each first target data pile in each time period to obtain a plurality of groups of target distances, wherein one group of target distances correspond to one group of interface data quality data, one group of target distances comprise a plurality of target distances, and the target distances correspond to the first target data piles one by one; determining a plurality of minimum target distances according to the plurality of groups of target distances, wherein the minimum target distance is the minimum target distance in any group of target distances, and one minimum target distance corresponds to one group of interface data quality data; dividing each interface data quality data into a corresponding optimal first target data pile to obtain a plurality of second target data piles, wherein the optimal first target data pile is the first target data pile corresponding to the minimum target distance corresponding to the interface data quality data; a score is obtained for each of the second target data heaps.
According to another aspect of the embodiments of the present invention, there is provided an apparatus for determining quality of interface data, the apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, one interface corresponds to a group of interface data quality data in one time period, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface data in one time period; a first clustering unit, configured to cluster the interfaces of each time period to obtain a plurality of first clustering results, where the first clustering results are in one-to-one correspondence with the time periods, one first clustering result includes a plurality of categories and category centers corresponding to the categories, a sum of squared errors between any one group of interface data quality data and the corresponding category center is smaller than a first predetermined threshold, one category includes at least one interface, and the number of the categories and the category names of any two first clustering results are the same; the classification unit is used for determining the category corresponding to each interface according to a plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name; a second clustering unit, configured to cluster the interface data quality data of each of the categories to obtain a plurality of second clustering results, where the second clustering results are in one-to-one correspondence with the categories, one of the second clustering results includes a plurality of data piles and data pile centers corresponding to the data piles, one of the data piles includes at least one group of the interface data quality data, a sum of squares of errors between any one group of the interface data quality data and the corresponding data pile center is smaller than a second predetermined threshold, and the number of the data piles of each of the second clustering results is the same; and the scoring unit is used for acquiring the score of each data pile.
According to still another aspect of embodiments of the present invention, there is provided a computer-readable storage medium including a stored program, wherein the program performs any one of the methods.
According to a further aspect of the embodiments of the present invention, there is provided a processor for executing a program, wherein the program executes to perform any one of the methods.
According to an aspect of embodiments of the present invention there is provided a system for determining quality of interface data, comprising one or more processors, memory, display means and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any one of the methods, and the display means is for presenting expert scores for each of the data heaps.
In the embodiment of the present invention, the method for determining the quality of the interface data includes first obtaining interface data quality data of a plurality of interfaces in a plurality of time periods, where in one of the time periods, one of the interfaces corresponds to a group of interface data quality data, and the group of interface data quality data includes accuracy data, integrity data, and timeliness data of the interface in the one time period; then, clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same; then, determining the category corresponding to each interface according to a plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name; clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of the interface data quality data and the corresponding data pile centers is smaller than a second preset threshold, and the number of the data piles of each second clustering result is the same; and finally, obtaining the scores of the data piles. The method includes the steps that interface data in each time period are clustered, interfaces with high similarity of interface data quality data in each time period are divided into the same category, the interface data quality data in each category are clustered, the data with high similarity in each category are divided into the same data pile, expert scores of each data pile are obtained, the obtained interface data quality data come from real data, scoring is more real, scoring of each interface data quality data is avoided, scoring of each data pile in each category is carried out instead, scoring workload is reduced, manual intervention is reduced, scoring is more objective, and the problem that interface data quality scoring in the prior art is high in manual intervention degree is solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 shows a flow diagram of a method of determining interface data quality according to an embodiment of the present application;
FIG. 2 illustrates a flow chart of a method of determining interface data quality according to a particular embodiment of the present application;
FIG. 3 illustrates a flow chart of a method of determining interface data quality according to another particular embodiment of the present application;
FIG. 4 illustrates a flow chart of a method of determining interface data quality according to yet another particular embodiment of the present application;
fig. 5 shows a schematic diagram of an interface data quality determination apparatus according to an embodiment of the present application.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" another element, it can be directly on the other element or intervening elements may also be present. Also, in the specification and claims, when an element is described as being "connected" to another element, the element may be "directly connected" to the other element or "connected" to the other element through a third element.
For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:
data: the symbol that records and discriminates an objective event is a physical symbol or a combination of physical symbols that describes the nature, state, and interrelationship of an objective object. In computer science, data is a generic term of all media of symbols which can be input into a computer and processed by a computer program, and is a generic term of numbers, letters, symbols, analog quantities and the like which have certain meanings and are used for being input into an electronic computer for processing, namely data is a carrier after information is collected and stored by electronic equipment. Data is increasingly becoming the most valuable asset and the most determining factor for financial institutions and enterprises.
Data quality: the degree that the data accord with objective reality is also the degree that the data meet the use requirement and reflect the use value.
And (3) data quality management: the method is used for carrying out a series of management activities such as identification, measurement, monitoring, early warning and the like on various data quality problems possibly caused in each stage of a life cycle of planning, obtaining, storing, sharing, maintaining, applying and eliminating data, and further improving the data quality by improving and improving the management level of an organization.
Is characterized in that: in machine learning, a feature is an independently observable property or characteristic of an observed object. Features are generally numerical rather than textual and other forms, and are mainly for processing and statistical analysis. Is characterized in that: the method has the advantages of information content, distinctiveness and independence.
Clustering analysis: by definition, a classification method is researched according to the characteristics of a large amount of data and samples according to the data, the data are reasonably classified according to the classification method, and finally similar data are classified into one group, namely the data are 'same in class and different in class'. The clustering is not classification, the classification is judged and divided according to the existing standard or mode, the classification has the basis of the classification, and the classification only needs to judge whether the classification meets the basis; clustering is that we do not know specific division standards, and rely on algorithms to analyze similarity, putting similar data together.
K-Means clustering algorithm: the method is a simple iterative clustering analysis algorithm, distance is used as a similarity index, K classes in a given data set are found, the center of each class is obtained according to the mean value of all values in the class, and each class is described by a clustering center.
Average number: also called arithmetic mean, refers to the quotient of the sum of all data divided by the total, which represents the overall level of a certain set of data, and is characterized by the concentration, which is the overall value of the characterization with the smallest error. Meanwhile, the method is simple and visual in expression and can be used for further algebraic operation. Therefore, the application range of the arithmetic mean is the most extensive in all the centralized quantities, and the overall level of batch data can be effectively reflected. But there are disadvantages to the arithmetic mean in nature. The data is extremely sensitive because the operation requires all the data in the entire set, where any change in the data causes a change in the mean. Therefore, extreme values in the data set, i.e., the highest and lowest values, should be removed when using arithmetic means in the scoring. Therefore, the influence of abnormal values on the average score can be avoided, the personal evaluation tendency of a judger can be reflected, and the scheme is a reasonable scheme.
As mentioned in the background of the invention, in order to solve the above-mentioned problems, in the prior art, the interface data quality scoring has a large degree of human intervention, and in an exemplary embodiment of the present application, a method, an apparatus, a processor, a computer-readable storage medium, and a system for determining the interface data quality are provided.
According to an embodiment of the application, a method for determining interface data quality is provided.
Fig. 1 is a flowchart of a method for determining interface data quality according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S101, acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, wherein in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface in one time period;
step S102, clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same;
step S103 of determining the category corresponding to each of the interfaces according to the plurality of first clustering results, where the category corresponding to the interface is the category that appears most frequently among the categories having the same name;
step S104, clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of interface data quality data and the corresponding data pile center is less than a second preset threshold, and the number of the data piles of each second clustering result is the same;
in step S105, a score for each of the data piles is acquired.
In the embodiment of the present invention, the method for determining the quality of the interface data includes first obtaining interface data quality data of a plurality of interfaces in a plurality of time periods, where in one of the time periods, one of the interfaces corresponds to a group of interface data quality data, and the group of interface data quality data includes accuracy data, integrity data, and timeliness data of the interface in the one time period; then, clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same; then, determining the category corresponding to each interface according to a plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name; clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of the interface data quality data and the corresponding data pile centers is smaller than a second preset threshold, and the number of the data piles of each second clustering result is the same; and finally, obtaining the scores of the data piles. The method includes the steps that interface data in each time period are clustered, interfaces with high similarity of interface data quality data in each time period are divided into the same category, the interface data quality data in each category are clustered, the data with high similarity in each category are divided into the same data pile, expert scores of each data pile are obtained, the obtained interface data quality data come from real data, scoring is more real, scoring of each interface data quality data is avoided, scoring of each data pile in each category is carried out instead, scoring workload is reduced, manual intervention is reduced, scoring is more objective, and the problem that interface data quality scoring in the prior art is high in manual intervention degree is solved.
It should be noted that, the above-mentioned obtaining of the interface data quality data of a plurality of interfaces in a plurality of time periods, specifically, the interface data quality data of each interface for at least two months needs to be obtained, as shown in fig. 2, the accuracy data of the interface data includes correctness data, accuracy data, uniqueness data, validity data and consistency data of the interface data, where the accuracy data of the interface data is used to reflect the degree of correctly describing the logical relationship between the data by the content of the interface data, and includes that a last tag in a file is not found, a tag in the file is illegal, the number of pages in the file is not matched, the number of total records in the file is not matched, the number of rows is not matched, the number of columns is not matched, the number of data files is not matched, an XML error in a suffix of the data file is not matched, and a field sequential name is not matched; the accuracy data of the interface data is used for reflecting whether the key field indexes, the data accuracy and the like meet requirements or not, and comprises whether the number of key fields, the field length and the data accuracy meet the requirements or not, the number of fields in the file is not matched, and the field length (overlength or deficiency) in the file is inconsistent with the table structure; the unique data of the interface data is used for reflecting the data repetition condition in each data table, and comprises records with complete repetition in each data table and records with main key repetition in each data table; the validity data of the interface data is used for reflecting that the data content meets the standard requirements, and comprises that the value of the data item of the related data table is not in the range of the value range, does not meet the technical specification of the data standard, the code value is out of range, and invalid characters appear; the consistency data of the interface data is used for reflecting the condition that the data volume of the self-contained platform is consistent with that of the production system, and comprises the condition that the data volume of the big data platform is inconsistent with that of the source service system and the reference integrity is insufficient; the timeliness data of the interface data is used for reflecting the degree that the data content can be provided for use at the required time node, and comprises the actual receiving time of the data which is later than the receivable time; the integrity data of the interface data is used for reflecting the conditions of the key index vacancy rate and the data filling rate, and comprises the vacancy rate of the key indexes in each data table and the data filling rate in each data table.
It should be noted that clustering algorithms such as K-Means, K-medoids, K-models, CLARA, PAM, etc. may be adopted to cluster the interfaces of each time period and to cluster the interface data quality data of each category.
It should be further noted that, the advantages of the method for determining the quality of the interface data described above in the present application include:
the quality is improved: the evaluation of different experts of multidimension degree is synthesized, results are obtained after a reasonable summarizing and integrating mode is adopted, scoring is more reasonable, a source system/downstream/management layer can participate in, the scoring process can be clearly known, and final results can be understood.
The objective authenticity is increased: the acquired interface data quality data of the plurality of interfaces in a plurality of time periods are derived from real data, no processing, modification or statistical change is performed, manual intervention is reduced, the interface data quality data with high feature similarity are finally divided into the same data pile by a clustering method, the authenticity and accuracy of grading are increased, and in order to expand the target range, the interface data quality data can have new dimensionality;
the operation difficulty is reduced: the grading is simple and easy to operate, each expert needs to grade each data heap, but obvious comparison difference can be seen by clustering each data heap, and the grading difficulty is low, so that the method not only reduces the workload, but also reduces the grading difficulty and human intervention, embodies the objectivity, accuracy and comprehensiveness of data quality evaluation, and reduces the subjectivity and error rate of evaluation to a certain extent.
The new functions are added: the system can be used for scoring aiming at a single system, and can be used for performing targeted scoring after all data are piled to experts, so that the inapplicability of all systems applying a set of formula is avoided; the collected index features are more and more detailed, and the evaluation terms are refined, so that the result is more accurate.
The universality is improved: the method is not only suitable for evaluating the interface data quality condition of an upstream system and a downstream system, but also can be used for reference in other data quality evaluation scenes; the method is not only suitable for banking industry, but also can be directly used or used for reference of ideas by other financial institutions; the method can be used in other industries, such as business scenes related to data quality assessment in IT industry, consulting industry and the like.
It should also be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
In an embodiment of the application, clustering the interfaces of each time period to obtain a plurality of first clustering results includes: a first clustering step of selecting a first cluster number, where the first cluster number is the number of the selected categories, and performing multiple clustering on the interface of a target time period to obtain a plurality of first preliminary clustering results, where each clustering is performed to obtain a first cluster number of the categories, and the target time period is any of the time periods; a first determining step of determining that the first preliminary clustering result is the first clustering result corresponding to the target time period when the first preliminary clustering result satisfies a first clustering termination condition, the first clustering termination condition being that positions of category centers corresponding to all the categories after clustering are not changed; and repeating the first clustering step and the first determining step until all the interface clustering of the time period is finished, and obtaining a plurality of first clustering results. In this embodiment, as shown in fig. 3, for example, clustering data quality data of 60-day interfaces of 100 interfaces, first, in a first clustering step, a first cluster number of 20 is selected, a K-Means clustering algorithm is used to cluster 100 interfaces of any one day of a target time period of 60 days for multiple times, each clustering clusters 100 interfaces into 20 classes, in a first determining step, when a position of a class center corresponding to each class of a first preliminary clustering result is not changed from a position of a class center corresponding to each class corresponding to a first clustering result obtained by the previous clustering, it is indicated that an interface with high similarity of interface data quality data in 100 interfaces of the day is already in the same class, the first clustering step and the first determining step are repeated, clustering of 100 interfaces of each day of 60 days is completed, and 100 interfaces of 60 days are all classified into 20 classes, the average number of the interfaces in each category is 5, so that the interfaces with high interface data quality data similarity in 100 interfaces per day of 60 days are classified into the same category.
It should be noted that, when the interfaces in each time period are clustered, the interfaces are classified according to the characteristics of the interface data quality data themselves, the abnormal values (such as online abnormality) of the individual differences are ignored, the normalcy of the table is recorded and classified by using the full data, the special fields added in the clustering include the base table size, the file size transmitted on the day, the line number and other base table basic information, and the interface data quality data of each interface includes the problem data and the normal data, so as to ensure that each category can be integrally divided.
It should be noted that, as shown in fig. 3, after 100 interfaces per day of 60 days are clustered into 20 categories, the categories corresponding to the interfaces are determined, first, the categories of each day of 60 days are numbered, a category in the first day is named as a1 randomly, the categories of the interfaces having the most same name as that of a1 in each day of 2-60 days are named as a2 and A3 … … a60 respectively, the same operation is performed on the remaining categories of the first day in the same way, B1 and C1 … … T1 are defined until B60 and C60 … … T60, the serial numbers of all categories are defined, all a (, the range is 1-60) are defined as a categories, and analogy is performed in turn to obtain a-T20 categories, then, the occurrence number of each interface in each category of a-T categories is recorded, each interface is classified into the category corresponding to the maximum occurrence number of the interface, and finishing the classification work of all the interfaces in sequence, so that the interfaces with high interface data quality data similarity are classified into the same class in the A-T classes.
In an embodiment of the application, before clustering each of the categories to obtain a plurality of second clustering results, the method further includes: and deleting normal data in each category to obtain problem data in each category, wherein the normal data is the interface data quality data of which the accuracy data, the integrity data and the timeliness data of the interface data all meet the interface data requirements, and the problem data is the interface data quality data of which at least one of the accuracy data, the integrity data and the timeliness data of the interface data does not meet the interface data requirements. In this embodiment, as shown in fig. 4, only the problem data in each category needs to be scored, so that the normal data in each category needs to be deleted before the interface data quality data of each category is clustered, and only the problem data is left, so that the problem data of the interface data quality data can be scored more accurately and reasonably.
In an embodiment of the present application, the clustering the categories respectively to obtain a plurality of second clustering results includes: a second clustering step of selecting a second cluster number, where the second cluster number is the number of the selected data piles of the categories, and performing multiple clustering on the interface data quality data of a target category to obtain a plurality of second preliminary clustering results, where each clustering is performed to cluster the data piles of the second cluster number, and the target category is any of the plurality of categories; a second determination step of determining, when the second preliminary clustering result satisfies a second clustering termination condition, that the second preliminary clustering result is the second clustering result corresponding to the target class, and the second clustering termination condition is that the position of the data pile center corresponding to each data pile is not changed; and repeating the second clustering step and the second determining step until all the interface data quality data of the types are clustered to end, so as to obtain a plurality of second clustering results. In this embodiment, as shown in fig. 4, taking clustering of 60-day interface data quality data of 100 interfaces as an example, after classifying each interface, firstly, a second clustering step selects the number of second clusters to be 15, and performs multiple clustering on interface data quality data of any one of a-T20 categories, which is a target category, by using a K-Means clustering algorithm, where each clustering clusters the interface data quality data of the target category into 15 data piles, and a second determining step, where when the position of the center of the data pile corresponding to each data pile of the second preliminary clustering result does not change from the position of the center of the category corresponding to each data pile of the second preliminary clustering result obtained in the previous clustering, the interface with high similarity of interface data quality data in the target category is already classified into the same category, and the first clustering step and the first determining step are repeated, and finishing clustering of the interface data quality data in each of the A-T20 categories, and clustering each of the A-T20 categories into 15 data heaps, so that the interface data quality data with high feature similarity in each of the A-T20 categories are divided into the same data heap.
In an embodiment of the present application, obtaining the score of each of the data piles includes: obtaining a plurality of scores of each data pile; deleting the lowest score of the highest sum of the scores of the data piles to obtain a plurality of groups of target scores, wherein one group of target scores corresponds to one data pile; calculating the average of the target scores of each group to obtain the score of each data pile. In the embodiment, for example, the data quality data of 60-day interfaces of 100 interfaces are clustered, each expert scores 300 data piles of 15 data piles of each class of 20 classes, and when the experts score each data pile, the final score of one data pile is judged by adopting a method of removing the highest score and the lowest score in a plurality of scores of each data pile and calculating the average number of target scores of each group corresponding to each data pile.
It should be noted that the expert is an expert in the field of interface data quality evaluation, and the expert invitation rules need to be met when the expert is selected, and the expert invitation rules include quantity rules, range rules and mode rules, wherein the quantity rules are that the more experts having scoring qualifications (such as working years, certificates and the like meet requirements), the better theoretically, the range rules are that the experts include source system personnel to which the data belong, data application team personnel, data quality & management team personnel, business personnel and the like, the mode rules are fully divided into 10 points, the expert can evaluate the interface data quality data scores in multiple angles and multiple levels according to professional knowledge, working experience, industry consensus and the like, in order to embody the objectivity of the scoring rules, the highest score and the lowest score in multiple scores of each data pile are removed, and (4) judging a final score by a method of averaging the residual scores.
In an embodiment of the application, after obtaining the score of each of the data piles, the method further includes: deleting all the interface data quality data of each data pile to obtain a plurality of first target data piles, wherein data pile centers corresponding to the first target data piles are in one-to-one correspondence with data pile centers corresponding to the data piles; acquiring the quality data of the interface data in at least one time period; calculating the sum of squares of errors of the interface data quality data of each time period and the data pile center corresponding to each first target data pile to obtain a plurality of groups of target distances, wherein one group of target distances corresponds to one group of interface data quality data, one group of target distances comprises a plurality of target distances, and the target distances correspond to the first target data piles one by one; determining a plurality of minimum target distances according to a plurality of sets of the target distances, wherein the minimum target distance is the smallest target distance in any set of the target distances, and one minimum target distance corresponds to one set of the interface data quality data; dividing each interface data quality data into a corresponding optimal first target data pile to obtain a plurality of second target data piles, wherein the optimal first target data pile is the first target data pile corresponding to the minimum target distance corresponding to the interface data quality data; and obtaining the score of each second target data pile. In this embodiment, for example, clustering 60-day interface data quality data of 100 interfaces, deleting all interface data quality data of 300 data piles to obtain 300 first target data piles, obtaining interface data quality data of at least one time period, calculating the sum of squares of errors from each interface data quality data to a data pile center corresponding to each data pile of 300 data piles to obtain 300 target distances from each interface data quality data, wherein a smaller target distance indicates a higher degree of similarity between each interface data quality data and a feature of a data pile center corresponding to each data pile of each data pile, dividing each interface data quality data into the first target data piles corresponding to the minimum target distances, thereby dividing each interface data quality data into the same data pile with a higher degree of similarity to obtain a plurality of second target data, thereby obtaining the scores of the experts on the second target data piles.
The embodiment of the present application further provides a device for determining quality of interface data, and it should be noted that the device for determining quality of interface data according to the embodiment of the present application may be used to execute the method for determining quality of interface data according to the embodiment of the present application. The following describes an apparatus for determining quality of interface data according to an embodiment of the present application.
Fig. 5 is a schematic diagram of an interface data quality determination apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
an obtaining unit 10, configured to obtain interface data quality data of a plurality of interfaces in a plurality of time periods, where in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data includes accuracy data, integrity data, and timeliness data of the interface in one time period;
a first clustering unit 20, configured to cluster the interfaces of each time period to obtain a plurality of first clustering results, where the first clustering results correspond to the time periods one to one, one first clustering result includes a plurality of categories and category centers corresponding to the categories, a sum of squares of errors between any one group of the interface data quality data and the corresponding category center is smaller than a first predetermined threshold, one category includes at least one interface, and the number and the category name of the categories of any two first clustering results are the same;
a classification unit 30 configured to determine the category corresponding to each of the interfaces, which is the category having the largest number of occurrences among the categories having the same name, based on the plurality of first clustering results;
a second clustering unit 40, configured to cluster the interface data quality data of each of the categories to obtain a plurality of second clustering results, where the second clustering results are in one-to-one correspondence with the categories, one second clustering result includes a plurality of data piles and data pile centers corresponding to the data piles, one data pile includes at least one group of the interface data quality data, a sum of squares of errors between any one group of the interface data quality data and the corresponding data pile center is smaller than a second predetermined threshold, and the number of the data piles of each second clustering result is the same;
and a scoring unit 50 for obtaining a score for each of the data piles.
In this embodiment of the present invention, in the apparatus for determining quality of interface data, the obtaining unit obtains quality data of interface data of a plurality of interfaces in a plurality of time periods, in one of the time periods, one of the interfaces corresponds to a group of quality data of the interface data, and the group of quality data of the interface data includes accuracy data, integrity data, and timeliness data of the interface in the one time period; a first clustering unit clusters the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same; a classification unit that determines the category corresponding to each of the interfaces according to the plurality of first clustering results, the category corresponding to the interface being the category that appears most frequently among the categories having the same name; a second clustering unit clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of the interface data quality data and the corresponding data pile centers is less than a second preset threshold, and the number of the data piles of each second clustering result is the same; the scoring unit acquires a score of each data pile. The device divides the interfaces with high similarity of the interface data quality data in each time period into the same category by clustering the interface data quality data in each time period, divides the data with high similarity in each category into the same data piles by clustering the interface data quality data in each category, and obtains expert scores of each data pile, wherein the obtained interface data quality data come from real data, so that the scores are more real, and the situation that each interface data quality data is directly scored is avoided, but each data pile in each category is scored is avoided, so that the scoring workload is reduced, the manual intervention is reduced, the scoring is more objective, and the problem of high manual intervention degree of the interface data quality scoring in the prior art is solved.
It should be noted that, the above-mentioned obtaining of the interface data quality data of a plurality of interfaces in a plurality of time periods, specifically, the interface data quality data of each interface for at least two months needs to be obtained, as shown in fig. 2, the accuracy data of the interface data includes correctness data, accuracy data, uniqueness data, validity data and consistency data of the interface data, where the accuracy data of the interface data is used to reflect the degree of correctly describing the logical relationship between the data by the content of the interface data, and includes that a last tag in a file is not found, a tag in the file is illegal, the number of pages in the file is not matched, the number of total records in the file is not matched, the number of rows is not matched, the number of columns is not matched, the number of data files is not matched, an XML error in a suffix of the data file is not matched, and a field sequential name is not matched; the accuracy data of the interface data is used for reflecting whether the key field indexes, the data accuracy and the like meet requirements or not, and comprises whether the number of key fields, the field length and the data accuracy meet the requirements or not, the number of fields in the file is not matched, and the field length (overlength or deficiency) in the file is inconsistent with the table structure; the unique data of the interface data is used for reflecting the data repetition condition in each data table, and comprises records with complete repetition in each data table and records with main key repetition in each data table; the validity data of the interface data is used for reflecting that the data content meets the standard requirements, and comprises that the value of the data item of the related data table is not in the range of the value range, does not meet the technical specification of the data standard, the code value is out of range, and invalid characters appear; the consistency data of the interface data is used for reflecting the condition that the data volume of the self-owned platform is consistent with that of the production system, and comprises the condition that the data volume of the big data platform is inconsistent with that of the source service system and the reference integrity is insufficient; the timeliness data of the interface data is used for reflecting the degree that the data content can be provided for use at the required time node, and comprises the actual receiving time of the data which is later than the receivable time; the integrity data of the interface data is used for reflecting the conditions of the vacancy rate and the data filling rate of the key indexes, and comprises the vacancy rate of the key indexes in each data table and the data filling rate in each data table.
It should be noted that clustering algorithms such as K-Means, K-medoids, K-models, CLARA, PAM, etc. may be adopted to cluster the interfaces of each time period and to cluster the interface data quality data of each category.
It should be further noted that, the advantages of the method for determining the quality of the interface data described above in the present application include:
the quality is improved: the evaluation of different experts of multidimension degree is synthesized, results are obtained after a reasonable summarizing and integrating mode is adopted, scoring is more reasonable, a source system/downstream/management layer can participate in, the scoring process can be clearly known, and final results can be understood.
The objective authenticity is increased: the acquired interface data quality data of the plurality of interfaces in a plurality of time periods are derived from real data, no processing, modification or statistical change is performed, manual intervention is reduced, the interface data quality data with high feature similarity are finally divided into the same data pile by a clustering method, the authenticity and accuracy of grading are increased, and in order to expand the target range, the interface data quality data can have new dimensionality;
the operation difficulty is reduced: the grading is simple and easy to operate, each expert needs to grade each data heap, but obvious comparison difference can be seen by clustering each data heap, and the grading difficulty is low, so that the method not only reduces the workload, but also reduces the grading difficulty and human intervention, embodies the objectivity, accuracy and comprehensiveness of data quality evaluation, and reduces the subjectivity and error rate of evaluation to a certain extent.
The new functions are added: the system can be used for scoring aiming at a single system, and can be used for performing targeted scoring after all data are piled to experts, so that the inapplicability of all systems applying a set of formula is avoided; the collected index features are more and more detailed, and the evaluation terms are refined, so that the result is more accurate.
The universality is improved: the method is not only suitable for evaluating the interface data quality condition of an upstream system and a downstream system, but also can be used for reference in other data quality evaluation scenes; the method is not only suitable for banking industry, but also can be directly used or used for reference of ideas by other financial institutions; the method can be used in other industries, such as business scenes related to data quality assessment in IT industry, consulting industry and the like.
In an embodiment of the present application, the first clustering unit includes a first clustering module, a first determining module, and a first iteration module, where the first clustering module is configured to select a first cluster number, where the first cluster number is the number of the selected categories, perform multiple clustering on the interface of a target time period to obtain a plurality of first preliminary clustering results, where each clustering is performed to obtain a first cluster number of the categories, and the target time period is any of the time periods; the first determining module is configured to determine that the first preliminary clustering result is the first clustering result corresponding to the target time period when the first preliminary clustering result meets a first clustering termination condition, where the first clustering termination condition is that positions of category centers corresponding to all the categories after clustering are not changed; the first iteration module is configured to repeat the first clustering step and the first determining step until all the interface clusters of the time period are finished, and obtain a plurality of first clustering results. In this embodiment, as shown in fig. 3, for example, clustering data quality data of 60-day interfaces of 100 interfaces, first, in a first clustering step, a first cluster number of 20 is selected, a K-Means clustering algorithm is used to cluster 100 interfaces of any one day of a target time period of 60 days for multiple times, each clustering clusters 100 interfaces into 20 classes, in a first determining step, when a position of a class center corresponding to each class of a first preliminary clustering result is not changed from a position of a class center corresponding to each class corresponding to a first clustering result obtained by the previous clustering, it is indicated that an interface with high similarity of interface data quality data in 100 interfaces of the day is already in the same class, the first clustering step and the first determining step are repeated, clustering of 100 interfaces of each day of 60 days is completed, and 100 interfaces of 60 days are all classified into 20 classes, the average number of the interfaces in each category is 5, so that the interfaces with high interface data quality data similarity in 100 interfaces per day of 60 days are classified into the same category.
It should be noted that, when the interfaces in each time period are clustered, the interfaces are classified according to the characteristics of the interface data quality data themselves, the abnormal values (such as online abnormality) of the individual differences are ignored, the normalcy of the table is recorded and classified by using the full data, the special fields added in the clustering include the base table size, the file size transmitted on the day, the line number and other base table basic information, and the interface data quality data of each interface includes the problem data and the normal data, so as to ensure that each category can be integrally divided.
It should be noted that, as shown in fig. 3, after 100 interfaces per day of 60 days are clustered into 20 categories, the categories corresponding to the interfaces are determined, first, the categories of each day of 60 days are numbered, a category in the first day is named as a1 randomly, the categories of the interfaces having the most same name as that of a1 in each day of 2-60 days are named as a2 and A3 … … a60 respectively, the same operation is performed on the remaining categories of the first day in the same way, B1 and C1 … … T1 are defined until B60 and C60 … … T60, the serial numbers of all categories are defined, all a (, the range is 1-60) are defined as a categories, and analogy is performed in turn to obtain a-T20 categories, then, the occurrence number of each interface in each category of a-T categories is recorded, each interface is classified into the category corresponding to the maximum occurrence number of the interface, and finishing the classification work of all the interfaces in sequence, so that the interfaces with high interface data quality data similarity are classified into the same class in the A-T classes.
In an embodiment of the application, the device for determining the quality of the interface data further includes a deleting unit, where the deleting unit is configured to delete normal data in each of the categories to obtain problem data in each of the categories, the normal data is the interface data quality data whose accuracy data, integrity data, and timeliness data all meet the requirements of the interface data, and the problem data is the interface data quality data whose at least one of the accuracy data, integrity data, and timeliness data does not meet the requirements of the interface data. In this embodiment, as shown in fig. 4, only the problem data in each category needs to be scored, so that the normal data in each category needs to be deleted before the interface data quality data of each category is clustered, and only the problem data is left, so that the problem data of the interface data quality data can be scored more accurately and reasonably.
In one embodiment of the present application, the second classification unit comprises: the second clustering module is used for selecting a second clustering number, the second clustering number is the number of the selected data piles of the types, the interface data quality data of the target types are clustered for multiple times to obtain a plurality of second preliminary clustering results, each clustering is carried out to form a second clustering number of the data piles, and the target types are any types of the plurality of types; the second determining module is configured to determine, when the second preliminary clustering result satisfies a second clustering termination condition, that the second preliminary clustering result is the second clustering result corresponding to the target category, and the second clustering termination condition is that the position of the data pile center corresponding to each data pile is not changed; the second iteration module is configured to repeat the second clustering step and the second determining step until all the interface data quality data clusters of the classes are finished, and obtain a plurality of second clustering results. In this embodiment, as shown in fig. 4, taking clustering of 60-day interface data quality data of 100 interfaces as an example, after classifying each interface, firstly, a second clustering step selects the number of second clusters to be 15, and performs multiple clustering on interface data quality data of any one of a-T20 categories, which is a target category, by using a K-Means clustering algorithm, where each clustering clusters the interface data quality data of the target category into 15 data piles, and a second determining step, where when the position of the center of the data pile corresponding to each data pile of the second preliminary clustering result does not change from the position of the center of the category corresponding to each data pile of the second preliminary clustering result obtained in the previous clustering, the interface with high similarity of interface data quality data in the target category is already classified into the same category, and the first clustering step and the first determining step are repeated, and finishing the clustering of the interface data quality data in each of the A-T20 categories, and clustering each of the A-T20 categories into 15 data heaps, so that the interface data quality data with high feature similarity in each of the A-T20 categories are divided into the same data heap.
In an embodiment of the present application, the scoring unit includes a first obtaining module, a second deleting module, and a third calculating module: the first acquisition module is used for acquiring a plurality of scores of each data pile; the second deleting module is configured to delete a lowest score of a highest sum of the plurality of scores of each data pile to obtain a plurality of sets of target scores, where one set of target scores corresponds to one data pile; the third calculating module is used for calculating the average of the target scores of all groups to obtain the score of each data pile. In the embodiment, for example, the data quality data of 60-day interfaces of 100 interfaces are clustered, each expert scores 300 data piles of 15 data piles of each class of 20 classes, and when the experts score each data pile, the final score of one data pile is judged by adopting a method of removing the highest score and the lowest score in a plurality of scores of each data pile and calculating the average number of target scores of each group corresponding to each data pile.
It should be noted that the expert is an expert in the field of interface data quality evaluation, and the expert invitation rules need to be met when the expert is selected, and the expert invitation rules include quantity rules, range rules and mode rules, wherein the quantity rules are that the more experts having scoring qualifications (such as working years, certificates and the like meet requirements), the better theoretically, the range rules are that the experts include source system personnel to which the data belong, data application team personnel, data quality & management team personnel, business personnel and the like, the mode rules are fully divided into 10 points, the expert can evaluate the interface data quality data scores in multiple angles and multiple levels according to professional knowledge, working experience, industry consensus and the like, in order to embody the objectivity of the scoring rules, the highest score and the lowest score in multiple scores of each data pile are removed, and (4) judging a final score by a method of averaging the residual scores.
In an embodiment of the application, the device for determining the quality of the interface data further includes a first scoring unit, where the first scoring unit includes a fourth deleting module, a fifth acquiring module, a sixth calculating module, a seventh determining module, an eighth classifying module, and a ninth scoring module, the fourth deleting module is configured to delete all the interface data quality data of each data pile to obtain a plurality of first target data piles, and data pile centers corresponding to the first target data piles are the same as data pile centers corresponding to the data piles in a one-to-one correspondence manner; the fifth obtaining module is configured to obtain the interface data quality data for at least one time period; the sixth calculating module is configured to calculate a sum of squares of errors between each of the interface data quality data of each time period and a data pile center corresponding to each of the first target data piles, to obtain a plurality of sets of target distances, where one set of the target distances corresponds to one set of the interface data quality data, and one set of the target distances includes a plurality of target distances, and the target distances correspond to the first target data piles one to one; the seventh determining module is configured to determine a plurality of minimum target distances according to a plurality of sets of the target distances, where the minimum target distance is a minimum target distance among any set of the target distances, and one minimum target distance corresponds to one set of the interface data quality data; the eighth classification module is configured to classify each of the interface data quality data into a corresponding optimal first target data pile to obtain a plurality of second target data piles, where the optimal first target data pile is the first target data pile corresponding to the minimum target distance corresponding to the interface data quality data; the ninth scoring module is configured to obtain a score of each of the second target data piles. In this embodiment, for example, clustering 60-day interface data quality data of 100 interfaces, deleting all interface data quality data of 300 data piles to obtain 300 first target data piles, obtaining interface data quality data of at least one time period, calculating the sum of squares of errors from each interface data quality data to a data pile center corresponding to each data pile of 300 data piles to obtain 300 target distances from each interface data quality data, wherein a smaller target distance indicates a higher degree of similarity between each interface data quality data and a feature of a data pile center corresponding to each data pile of each data pile, dividing each interface data quality data into the first target data piles corresponding to the minimum target distances, thereby dividing each interface data quality data into the same data pile with a higher degree of similarity to obtain a plurality of second target data, thereby obtaining the scores of the experts on the second target data piles.
The device for determining the quality of the interface data comprises a processor and a memory, wherein the acquiring unit, the first clustering unit, the classifying unit, the second clustering unit, the scoring unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem of high human intervention degree of interface data quality scoring in the prior art is solved by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for determining the quality of interface data.
The embodiment of the invention provides a processor, which is used for running a program, wherein the method for determining the quality of the interface data is executed when the program runs.
An embodiment of the present invention provides a system for determining interface data quality, including one or more processors, a memory, a display device, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs include a program for executing any one of the above methods, the display device is configured to display expert scores of each of the data heaps, and the processor executes the program to implement at least the following steps:
step S101, acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, wherein in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface in one time period;
step S102, clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same;
step S103 of determining the category corresponding to each of the interfaces according to the plurality of first clustering results, where the category corresponding to the interface is the category that appears most frequently among the categories having the same name;
step S104, clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of interface data quality data and the corresponding data pile center is smaller than a second preset threshold value, and the number of the data piles of each second clustering result is the same;
in step S105, a score for each of the data piles is acquired.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program of initializing at least the following method steps when executed on a data processing device:
step S101, acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, wherein in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface in one time period;
step S102, clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same;
step S103, determining the category corresponding to each interface according to a plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name;
step S104, clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of interface data quality data and the corresponding data pile center is less than a second preset threshold, and the number of the data piles of each second clustering result is the same;
in step S105, a score for each of the data piles is acquired.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a computer-readable storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned computer-readable storage media comprise: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
From the above description, it can be seen that the above-described embodiments of the present application achieve the following technical effects:
1) the method for determining the interface data quality comprises the steps of firstly, obtaining interface data quality data of a plurality of interfaces in a plurality of time periods, wherein in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface data of the interfaces in one time period; then, clustering the interfaces of each time period to obtain a plurality of first clustering results, where the first clustering results are in one-to-one correspondence with the time periods, one first clustering result includes a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category center is smaller than a first predetermined threshold, one category includes at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same; then, determining the category corresponding to each interface according to a plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name; clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of the interface data quality data and the corresponding data pile centers is smaller than a second preset threshold, and the number of the data piles of each second clustering result is the same; and finally, obtaining the scores of the data piles. The method includes the steps that interface data in each time period are clustered, interfaces with high similarity of interface data quality data in each time period are divided into the same category, the interface data quality data in each category are clustered, the data with high similarity in each category are divided into the same data pile, expert scores of each data pile are obtained, the obtained interface data quality data come from real data, scoring is more real, scoring of each interface data quality data is avoided, scoring of each data pile in each category is carried out instead, scoring workload is reduced, manual intervention is reduced, scoring is more objective, and the problem that interface data quality scoring in the prior art is high in manual intervention degree is solved.
2) In the device for determining the quality of the interface data, an acquisition unit acquires interface data quality data of a plurality of interfaces in a plurality of time periods, one interface corresponds to a group of interface data quality data in one time period, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface in one time period; a first clustering unit clusters the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of squares of errors between any one group of interface data quality data and the corresponding category centers is smaller than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same; a classification unit that determines the category corresponding to each of the interfaces according to the plurality of first clustering results, the category corresponding to the interface being the category that appears most frequently among the categories having the same name; a second clustering unit clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of the square errors of any group of the interface data quality data and the corresponding data pile centers is less than a second preset threshold, and the number of the data piles of each second clustering result is the same; the scoring unit acquires a score of each data pile. The device divides the interfaces with high similarity of the interface data quality data in each time period into the same category by clustering the interface data quality data in each time period, divides the data with high similarity in each category into the same data piles by clustering the interface data quality data in each category, and obtains expert scores of each data pile, wherein the obtained interface data quality data come from real data, so that the scores are more real, and the situation that each interface data quality data is directly scored is avoided, but each data pile in each category is scored is avoided, so that the scoring workload is reduced, the manual intervention is reduced, the scoring is more objective, and the problem of high manual intervention degree of the interface data quality scoring in the prior art is solved.
3) The interface data quality determining system comprises one or more processors, a memory, a display device and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprise a method for executing any one of the above methods, the display device is used for displaying expert scores of the data piles, the system divides interfaces with high similarity of interface data in each time period into the same category by clustering the interface data in each time period, divides the data with high similarity in each category into the same data pile by clustering the interface data quality data in each category, obtains the expert score of each data pile, and obtains the obtained interface data quality data from real data, the scoring is more real, scoring of each data pile in each category is avoided instead of directly scoring the quality data of each interface data, the scoring workload is reduced, human intervention is reduced, scoring is more objective, and the problem of high human intervention degree in interface data quality scoring in the prior art is solved.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for determining quality of interface data, the method comprising:
acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, wherein in one time period, one interface corresponds to a group of interface data quality data, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface in one time period;
clustering the interfaces of each time period to obtain a plurality of first clustering results, wherein the first clustering results correspond to the time periods one by one, one first clustering result comprises a plurality of categories and category centers corresponding to the categories, the sum of the square errors between any group of interface data quality data and the corresponding category centers is less than a first preset threshold, one category comprises at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same;
determining the category corresponding to each interface according to the plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name;
clustering the interface data quality data of each category to obtain a plurality of second clustering results, wherein the second clustering results correspond to the categories one by one, one second clustering result comprises a plurality of data piles and data pile centers corresponding to the data piles, one data pile comprises at least one group of interface data quality data, the sum of squares of errors between any group of interface data quality data and the corresponding data pile center is less than a second preset threshold, and the number of the data piles of each second clustering result is the same;
a score is obtained for each of the data heaps.
2. The method of claim 1, wherein clustering the interface for each of the time periods to obtain a plurality of first clustering results comprises:
a first clustering step, selecting a first clustering number, wherein the first clustering number is the number of the selected categories, clustering the interface of a target time period for multiple times to obtain a plurality of first preliminary clustering results, clustering each time into a first clustering number of the categories, and the target time period is any time period of the time periods;
a first determining step of determining that the first preliminary clustering result is the first clustering result corresponding to the target time period when the first preliminary clustering result meets a first clustering termination condition, wherein the first clustering termination condition is that the positions of category centers corresponding to all the categories after clustering are not changed;
and repeating the first clustering step and the first determining step until all the interface clustering of the time period is finished, and obtaining a plurality of first clustering results.
3. The method of claim 1, wherein prior to clustering each of the categories separately to obtain a plurality of second clustering results, the method further comprises:
and deleting normal data in each category to obtain problem data in each category, wherein the normal data are interface data quality data with the accuracy data, the integrity data and the timeliness data of the interface data all meeting the interface data requirements, and the problem data are interface data quality data with at least one of the accuracy data, the integrity data and the timeliness data of the interface data not meeting the interface data requirements.
4. The method of claim 1, wherein clustering each of the categories separately to obtain a plurality of second clustering results comprises:
a second clustering step of selecting a second cluster number, wherein the second cluster number is the number of the selected data piles of the categories, and clustering the interface data quality data of the target category for multiple times to obtain a plurality of second preliminary clustering results, each clustering is performed to form a second cluster number of the data piles, and the target category is any of the plurality of categories;
a second determining step of determining, when the second preliminary clustering result satisfies a second clustering termination condition, that the second preliminary clustering result is the second clustering result corresponding to the target class, and the second clustering termination condition is that the position of the data pile center corresponding to each data pile is not changed;
and repeating the second clustering step and the second determining step until all the interface data quality data of the classes are clustered to end, and obtaining a plurality of second clustering results.
5. The method of claim 1, wherein obtaining a score for each of the data heaps comprises:
obtaining a plurality of scores for each of said data heaps;
deleting the lowest score of the highest sum of scores in the plurality of scores of each data pile to obtain a plurality of groups of target scores, wherein one group of target scores corresponds to one data pile;
calculating the average of the target scores of each group to obtain the score of each data pile.
6. The method of any one of claims 1 to 5, wherein after obtaining the score for each of the data heaps, the method further comprises:
deleting all the interface data quality data of each data pile to obtain a plurality of first target data piles, wherein data pile centers corresponding to the first target data piles are in one-to-one correspondence with data pile centers corresponding to the data piles;
acquiring the interface data quality data of at least one time period;
calculating the sum of the squares of errors of the data quality data of each interface in each time period and the data pile center corresponding to each first target data pile to obtain a plurality of groups of target distances, wherein one group of target distances correspond to one group of data quality data of the interface, one group of target distances comprise a plurality of target distances, and the target distances correspond to the first target data piles one by one;
determining a plurality of minimum target distances according to the plurality of groups of target distances, wherein the minimum target distance is the minimum target distance in any group of target distances, and one minimum target distance corresponds to one group of interface data quality data;
dividing each interface data quality data into a corresponding optimal first target data pile to obtain a plurality of second target data piles, wherein the optimal first target data pile is the first target data pile corresponding to the minimum target distance corresponding to the interface data quality data;
a score is obtained for each of the second target data heaps.
7. An apparatus for determining quality of interface data, the apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring interface data quality data of a plurality of interfaces in a plurality of time periods, one interface corresponds to a group of interface data quality data in one time period, and the group of interface data quality data comprises accuracy data, integrity data and timeliness data of the interface data of the interfaces in one time period;
a first clustering unit, configured to cluster the interfaces of each time period to obtain a plurality of first clustering results, where the first clustering results are in one-to-one correspondence with the time periods, one first clustering result includes a plurality of categories and category centers corresponding to the categories, a sum of squared errors between any one group of interface data quality data and the corresponding category center is smaller than a first predetermined threshold, one category includes at least one interface, and the number of the categories and the names of the categories of any two first clustering results are the same;
the classification unit is used for determining the category corresponding to each interface according to a plurality of first clustering results, wherein the category corresponding to the interface is the category with the largest occurrence frequency in the categories with the same name;
a second clustering unit, configured to cluster the interface data quality data of each of the categories to obtain a plurality of second clustering results, where the second clustering results are in one-to-one correspondence with the categories, one of the second clustering results includes a plurality of data piles and data pile centers corresponding to the data piles, one of the data piles includes at least one group of the interface data quality data, a sum of squares of errors between any one group of the interface data quality data and the corresponding data pile center is smaller than a second predetermined threshold, and the number of the data piles of each of the second clustering results is the same;
and the scoring unit is used for acquiring the score of each data pile.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program performs the method of any one of claims 1 to 6.
9. A processor, configured to run a program, wherein the program when running performs the method of any one of claims 1 to 6.
10. A system for determining quality of interface data, comprising one or more processors, memory, display means and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the method of any one of claims 1 to 6, and the display means is for presenting expert scores for each of the data heaps.
CN202210361932.3A 2022-04-07 2022-04-07 Interface data quality determination method, device and system Pending CN114693144A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210361932.3A CN114693144A (en) 2022-04-07 2022-04-07 Interface data quality determination method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210361932.3A CN114693144A (en) 2022-04-07 2022-04-07 Interface data quality determination method, device and system

Publications (1)

Publication Number Publication Date
CN114693144A true CN114693144A (en) 2022-07-01

Family

ID=82143249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210361932.3A Pending CN114693144A (en) 2022-04-07 2022-04-07 Interface data quality determination method, device and system

Country Status (1)

Country Link
CN (1) CN114693144A (en)

Similar Documents

Publication Publication Date Title
WO2021052031A1 (en) Statistical interquartile range-based commodity inventory risk early warning method and system, and computer readable storage medium
CN104756106B (en) Data source in characterize data storage system
NZ541411A (en) Technology evaluating device, technology evaluating program, and technology evaluating method
CN108764705A (en) A kind of data quality accessment platform and method
CN112102073A (en) Credit risk control method and system, electronic device and readable storage medium
CN113051291A (en) Work order information processing method, device, equipment and storage medium
CN111062597A (en) Method and device for detecting criminal suspicion of financial statement of listed company
WO2021174699A1 (en) User screening method, apparatus and device, and storage medium
US10803124B2 (en) Technological emergence scoring and analysis platform
CN114022269A (en) Enterprise credit risk assessment method in public credit field
CN113326255A (en) Method and device for screening effective test data, terminal equipment and storage medium
CN116883070A (en) Bank generation payroll customer loss early warning method
CN114693144A (en) Interface data quality determination method, device and system
CN115829722A (en) Training method of credit risk scoring model and credit risk scoring method
CN112506930B (en) Data insight system based on machine learning technology
CN114626940A (en) Data analysis method and device and electronic equipment
CN114511409A (en) User sample processing method and device and electronic equipment
CN114611515A (en) Method and system for identifying actual control person of enterprise based on enterprise public opinion information
CN108446907A (en) Safe checking method and device
CN113205270B (en) Method and system for automatically generating satisfaction evaluation table and calculating evaluation score
CN112258095B (en) Standard normal distribution based scoring method, device, equipment and storage medium
CN117349728A (en) Quality evaluation method and device for intelligent model
Dokic et al. Towards a data quality index for data valuation in the data economy
CN115439166A (en) Enterprise classification method and device
CN114549217A (en) Credit data generation method and device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination