CN117235063A

CN117235063A - Data quality management method based on artificial intelligence technology

Info

Publication number: CN117235063A
Application number: CN202311489902.1A
Authority: CN
Inventors: 李保平; 谢超; 杨建荣; 陈木辉; 麦新伟; 黄月梅; 戴思敏
Original assignee: Guangzhou Huitong Guoxin Technology Co ltd
Current assignee: Guangzhou Huitong Guoxin Technology Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2023-12-15
Anticipated expiration: 2043-11-10
Also published as: CN117235063B

Abstract

The invention relates to the technical field of data management, in particular to a data quality management method based on an artificial intelligence technology. Extracting partial data in a database at a specified time and setting the partial data as one-time pitch data; judging whether the primary tone drawing data is abnormal, when the primary tone drawing data is abnormal, determining a responsibility number, re-extracting part of data in a database, setting the part of data as secondary tone drawing data, judging whether the primary tone drawing data is abnormal, and judging the risk level of the responsibility number by combining the tone drawing results of the primary tone drawing data and the secondary tone drawing data; and defining a supervision mode of the responsibility numbers based on the risk level, and judging whether to execute the risk level regulation operation according to the condition of the follow-up data. And judging the corresponding responsibility number and risk level by combining the data extracted for the first time and the second time, and judging the abnormal condition of the data corresponding to the subsequent responsibility number to judge whether to execute the risk level regulation operation or not so as to reasonably manage the responsibility number and the corresponding data.

Description

Data quality management method based on artificial intelligence technology

Technical Field

The invention relates to the technical field of data management, in particular to a data quality management method based on an artificial intelligence technology.

Background

Along with the gradual increase of business information of enterprises, a system in an enterprise platform can generate a large amount of business data, such as order data, sales data, product data and the like, which are usually recorded by personnel or the system in the process of formation, but in the process of recording, the data are inevitably wrong due to misoperation of personnel and faults of the system, so that the data quality of a database in the enterprises is problematic at present, the data with quality problems are usually processed in a way of extracting the data in the database for one-to-one check and correction, but for massive data, not all the data have quality problems, and in this case, the efficiency effect is caused to the analysis and the treatment of the data quality at present;

and after the data is input, a responsible person for inputting the data needs to be set, and when the data corresponding to the responsible person is abnormal, how to perform corresponding risk management and control on the responsible person and the data (avoid risk management and control on the basis of one-to-one correction of massive data) so as to reduce the continuous abnormal change of the subsequent data is a problem to be solved at present.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a data quality management method based on an artificial intelligence technology, which can effectively solve the problem of how to manage and control the quality risk of mass data when the mass data is abnormal in the prior art.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

the invention provides a data quality management method based on an artificial intelligence technology, which comprises the following method steps:

s1, extracting part of data in a database at a specified time, and setting the extracted data as one-time extraction data;

s2, analyzing and judging whether the primary tone drawing data is abnormal, when judging that the primary tone drawing data is abnormal, determining a responsibility number corresponding to the primary tone drawing data, re-extracting part of data in a database based on the responsibility number, setting the data as secondary tone drawing data, judging whether the primary tone drawing data is abnormal again, and judging the risk level of the responsibility number by combining the tone drawing results of the primary tone drawing data and the secondary tone drawing data;

s3, defining a supervision mode of the responsibility numbers based on the risk level, and regulating and controlling the data quality in the current database through the supervision mode.

Further, the specified time includes:

zhou Du, month and quarter.

Further, when determining whether the primary and secondary pitch data are abnormal, determining according to the following method:

whether there are missing, invalid, duplicate, and abnormal features in the data, wherein:

the data abnormal item is judged by a Z-score algorithm, and the algorithm comprises the following steps:

step one: collecting historical data corresponding to the lottery data in the database at the appointed time;

step two: and (5) calculating an average value:wherein: />As an average value of the historical data,all are historical data, and n is the total number of days of the historical data;

step three: and (5) calculating standard deviation:wherein: />As the standard deviation of the historical data,is the ith history data;

step four: setting a threshold constant, and judging based on a Z-score algorithm:wherein: z-score is the decision value, +.>For the currently extracted variable data, determining whether the Z-score is greater than a threshold constant, and when it is greater than the threshold constant, the Z-score is +.>Is abnormal and based on->Performing a correction operation on the abnormal data, when it is equal to or less than a threshold constant,is normal.

Further, the risk level comprises a first level, a second level and a third level, and the risk levels of the first level, the second level and the third level are in a sequence from high to low, wherein the risk level is judged as follows:

when the primary lottery data is abnormal, the risk level corresponding to the responsibility number is three-level;

when the primary drawing data and the secondary drawing data are abnormal, the risk level corresponding to the responsibility number is a second level, and a responsibility number supervision mode is set;

when the responsibility numbers are monitored in a limited supervision mode and the data corresponding to the responsibility numbers are judged to be abnormal by monitoring, the risk level corresponding to the responsibility numbers is first-level.

Further, the monitoring mode is set according to the risk level, and includes:

when the risk level is three-level, no supervision mode is set for the responsibility number;

when the risk level is the second level, the supervision mode of the responsibility numbers is to monitor the data corresponding to the responsibility numbers in real time, and record the quantity of the abnormal data corresponding to the responsibility numbers;

and when the risk level is the first level, the supervision mode of the responsibility numbers is to interrupt and limit the input state, and the data corresponding to the responsibility numbers are monitored in real time.

Further, when the risk level is two-level and the responsibility number is in the monitoring state, determining the number of abnormal data corresponding to the current responsibility number and abnormal data of the secondary lottery, setting a regulation and control interval, and judging whether to execute the risk level decreasing/increasing operation based on the regulation and control interval, wherein the mode of judging and executing the risk level decreasing/increasing operation is as follows:

wherein, p is the number of the current corresponding data anomalies, y is the number of the secondary drawing data anomalies, the three-sequence change state of the corresponding anomaly data under the monitoring state is set, A, B, C respectively refers to the grades of the number p of the current corresponding data anomalies, A, B, C sequentially sequence grades according to the number p less than or at most, when p is in A after the maximum three changes, the background real-time monitoring operation is paused, the risk grade supervision operation is executed, the next corresponding data of the responsibility number is extracted, and whether the risk grade regulation operation is executed is judged according to the abnormal data quantity.

Further, when the interrupt limiting input state is formed, the data corresponding to the responsibility number next time is extracted, and whether to execute the risk level decrementing/incrementing operation is determined according to the abnormal condition of the data.

Further, when the primary tone data is abnormal and the responsibility numbers corresponding to the primary tone data are multiple, the corresponding tone data amount and the total tone data amount of the secondary tone data are defined, and the judgment modes of the corresponding tone data amount and the total tone data amount are as follows:

step one: obtaining the data quantity with abnormal responsibility numbers under the historical data, obtaining the total data quantity with accumulated abnormality, and determining the occupation ratio of the data quantity with abnormal responsibility numbers under the historical data under the total data quantity;

step two: determining total data volume of the lottery based on total data volume of abnormal data corresponding to the responsibility number, and determining corresponding lottery data volume in the total lottery data volume according to the occupation ratio of the responsibility number;

step three: setting a reduction threshold, when the number of responsibility numbers under the historical data is smaller than the number of responsibility numbers corresponding to the one-time lottery data, uniformly reducing the duty ratio in the first step according to the reduction threshold and regenerating the duty ratio, and determining the corresponding lottery data amount according to the reduction threshold and the duty ratio.

Compared with the prior art, the technical scheme provided by the invention has the following beneficial effects:

1. and (3) evaluating whether the data is abnormal or not by extracting the data in the database, re-extracting the data in the database based on the evaluated abnormal data, comprehensively judging the corresponding responsibility numbers by combining the abnormal conditions of the first and second extracted data, synchronously defining the risk level of the responsibility numbers, setting risk management measures for the risk levels, and reducing the influence of the risk management measures on other data in the database.

2. The risk level of the responsibility number is judged, monitoring is carried out based on the data input in the database, and a supervision operation is formed on the responsibility number, so that whether risk level switching adjustment is carried out or not is judged, the accuracy of the follow-up management and control data of the responsibility number is improved, the condition of low data quality in the database can be reduced, and therefore accurate data support decision can be provided for the enterprise condition by the data in the database.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of the overall process of the present invention;

FIG. 2 is a schematic diagram of a risk level determination method according to the present invention;

fig. 3 is a schematic diagram of a supervision method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention is further described below with reference to examples.

Example 1 (see fig. 1-3): a data quality management method based on artificial intelligence technology comprises the following steps:

s1, extracting part of data in a database at a specified time, and setting the extracted data as one-time lottery data, wherein the specified time can be the circumference, the month and the quarter, for example, extracting part of data in the database at one month to form one-time lottery data, and for the data in the database, specific information of the data and responsible persons (responsibility numbers, such as current product data, namely, responsible persons including product size information, application scene information, price information, order quantity and proposed data) are generally included;

s2, analyzing and judging whether primary tone drawing data are abnormal, when judging that the primary tone drawing data are abnormal, determining a responsibility number corresponding to the primary tone drawing data, re-extracting part of data in a database based on the responsibility number, setting the data as secondary tone drawing data, judging whether the data are abnormal again, judging the risk level of the responsibility number by combining the tone drawing results of the primary tone drawing data and the secondary tone drawing data, judging whether the data are abnormal or not by judging whether the data are abnormal, when judging that the data are abnormal, judging the responsibility number corresponding to the abnormal data in the primary tone drawing data, re-extracting the data in the database according to the responsibility number, forming secondary tone drawing data, synchronously judging whether the data are abnormal, and judging the risk level of the responsibility number according to the abnormal conditions of the primary tone drawing data and the secondary tone drawing data;

s3, defining a supervision mode of the responsibility numbers based on the risk levels, regulating the data quality in the current database through the supervision mode, and judging whether to execute risk level regulation operation according to abnormal conditions of the follow-up corresponding data of the responsibility numbers in the supervision mode so as to re-judge the risk levels of the responsibility numbers; for the abnormal data in the primary and secondary lottery data, the corresponding responsibility numbers are the same, so that the risk level of the responsibility number can be estimated through the abnormal data, the corresponding supervision mode (see the development explanation below) is set for the risk level, the responsibility number is controlled under the corresponding supervision mode, so that the probability of abnormal data in the subsequent database is reduced, the accuracy of the data in the database is improved, the abnormal condition of the subsequent corresponding data under the condition of the responsibility number is judged to be executed or not to switch the original risk level of the responsibility number, the reasonable analysis judgment can be made for the risk condition of the responsibility number by defining the regulation mode, the specific control is implemented for the specific condition according to the responsibility number, and the data with higher accuracy can provide data analysis auxiliary decision for the state change of enterprise products.

When judging whether the primary and secondary pitch data are abnormal, the scheme judges according to the following method:

in the scheme, whether the data is abnormal or not is judged by collecting whether a blank (unfilled data exists in a list) of the data of the primary and the secondary claims, whether superfluous filled data (such as data under a non-holiday is extracted, but data of the holiday exists in the list and is judged to be superfluous filled) exists in the data of the primary and the secondary claims, and the same data (the same size information, application scene information and price information exist in different product data) is judged to be abnormal, wherein the abnormal term refers to that the data which does not accord with the product size information, the application scene information and the price information exists in the list, for example, when the order quantity in the historical data is 10-15 intervals, the data of 30 order quantity exists in the currently extracted data, and therefore, the data is judged to be abnormal, and the specific judgment mode and the specific modification are as follows:

It should be noted that, the data deletion is also modified in this way, the modification of the data invalidation is direct deletion, and the modification of the data repetition is to extract the original data of the given product for verification and re-supplement into the list.

Further, the risk levels comprise a first level, a second level and a third level, and the risk levels of the first level, the second level and the third level are in a sequence from high to low, wherein the risk levels are judged as follows:

when the primary toned data is abnormal, the risk level corresponding to the responsibility number is three, and the risk level is three: the abnormal data is the first occurrence, the risk probability is low, and no supervision measures are set for the responsibility numbers;

when the primary and secondary lottery data are abnormal, the risk level corresponding to the responsibility number is a second level, and the responsibility number supervision mode is set, wherein the second level of the risk level refers to: the abnormal data are data which are appeared for a plurality of times, have medium risk probability, set supervision measures for the responsibility numbers and need to monitor the data corresponding to the subsequent responsibility numbers;

when the responsibility numbers are monitored in a limited supervision mode and the data corresponding to the responsibility numbers are judged to be abnormal by monitoring, the risk level corresponding to the responsibility numbers is first-level, and the first-level risk level refers to: on the basis of the second level, the data corresponding to the responsibility numbers are continuously abnormal under the time of subsequent supervision and are listed as the first level so as to limit the condition that the responsibility numbers influence the data.

Next, the supervision mode is set according to the risk level, including:

210. when the risk level is three-level, no supervision mode is set for the responsibility number;

220. when the risk level is the second level, the supervision mode of the responsibility numbers is to monitor the data corresponding to the responsibility numbers in real time, and record the quantity of the abnormal data corresponding to the responsibility numbers;

230. when the risk level is one level, the supervision mode of the responsibility number is an interruption limiting input state, and data corresponding to the responsibility number is monitored in real time, wherein the interruption limiting input state refers to that under the time of the data corresponding to the original requirement of the responsibility number, the time for inputting, changing and other operations on the data is the middle part of the set time, for example, a person originally setting the responsibility number to be 123 needs to input or change the data under the condition that the set time (when the set time is singular, the set time is median, when the set time is double, the set time is equally divided into two time periods, and when the median of each time period is the time for allowing the data to be input), and the time for allowing the responsibility number to input the data under the interruption limiting input state is 10.4 days, so that the responsibility number and the data corresponding to the responsibility number are managed and controlled, and the accuracy of the data is improved.

When the risk level is two-level and the responsibility number is in the monitoring state, determining the number of abnormal data corresponding to the current responsibility number and abnormal data of the secondary lottery, setting a regulation section, judging whether to execute risk regulation operation or not based on the regulation section, namely risk level decreasing/increasing operation, and judging and executing the risk level decreasing/increasing operation in the following modes:

wherein p is the number of the current corresponding data anomalies, y is the number of the secondary pitch data anomalies, the three-sequence change state of the corresponding anomaly data with the responsibility number in the monitoring state is set, A, B, C respectively refers to the level of the current corresponding data anomalies, AAnd B, C, according to the sequential sequence levels of less than or equal to the maximum number, when p is in A after the maximum three changes, suspending the background real-time monitoring operation, executing the risk level supervision operation to extract the datse:Sub>A corresponding to the responsibility number next time, judging whether to execute the risk level decrementing/incrementing operation according to the abnormal datse:Sub>A amount, monitoring the abnormal condition of the datse:Sub>A corresponding to the responsibility number next time through the background real-time monitoring operation according to the judging formulse:Sub>A, setting the abnormal condition of the datse:Sub>A of the maximum three times for the responsibility number, and the abnormal datse:Sub>A amount corresponding to the responsibility number needs to be decremented according to C-A, if: after the system is in the second level, the data quantity corresponding to the abnormal responsibility number is C-A, B-A or A respectively, namely when the system is in A, the real-time monitoring operation is stopped immediately (the system background monitors the input condition of the data corresponding to the responsibility number), and the risk level supervision operation is executed to extract the data corresponding to the responsibility number next time, and whether the risk level descending/increasing operation is executed is judged according to the abnormal data quantity, specifically, the system comprises:

when the extracted data is abnormal, determining that the number of responsibility number data is at the number of times experienced by A:

r1, when it is three timesWhen the abnormal data amount in the extracted data is C, B, A or exceeds y, the abnormal data still cannot be eliminated after the maximum allowable abnormal change times are counted as one stage;

r2, when it is secondaryThe risk level does not perform a decrementing/incrementing operation, remains unchanged, performs a decrementing operation based on whether the next abnormal data amount is not C, B, A or exceeds y, and otherwise remains unchanged, and determines whether to perform R1 according to whether the next abnormal data amount is a;

r3, when the risk level is one time (A), the risk level does not execute the decrementing/increasing operation, remains unchanged, executes the decrementing operation based on whether the data amount of the next abnormality is not C, B, A or exceeds y, otherwise remains unchanged, and judges whether to execute R2 according to whether the data amount of the next abnormality is A;

it should be noted that, when the data amount of the abnormality corresponding to the responsibility number is not a (B or C) under the condition of three maximum allowable times, the risk level is increased, and this indicates that after the maximum allowable number of abnormal changes, the abnormal data amount still cannot be reduced and remains (and the abnormal data amount is still more than before), and the risk level is listed as a level.

And finally, when the interrupt limiting input state is formed, extracting the data corresponding to the responsibility number next time, judging whether to execute risk grade regulation and control operation according to the abnormal condition of the data, specifically, executing descending operation on the basis of whether the data quantity of the next abnormal data is not C, B, A or exceeds y, otherwise, keeping unchanged, and when the risk grade is one-level and is limited by the interrupt limiting input state, extracting the data corresponding to the responsibility number three times after extracting the responsibility number, and when the abnormal condition of the data is C, B, A or exceeds y, sealing the responsibility number (the data cannot be input and regulated) and forming sealing information to be input to a system terminal, so that a system terminal management and control personnel can handle the data.

Example 2:

unlike the above-described embodiments, the present embodiment makes a specific determination explanation of the data amount of the secondary pitch data:

when the primary tone data is abnormal and a plurality of responsibility numbers corresponding to the primary tone data exist, the corresponding tone data volume and the total tone data volume of the secondary tone data are defined, and the judgment modes of the corresponding tone data volume and the total tone data volume are as follows:

step two: determining total data volume of the lottery based on total data volume of abnormal data corresponding to the responsibility number, and determining corresponding lottery data volume in the total lottery data volume according to the occupation ratio of the responsibility number; determining the ratio of the follow-up total data volume and the ratio of the data volume of the extraction tone according to the ratio of the data volume of the responsibility number corresponding to the abnormality to the total data volume by determining the data volume of the responsibility number corresponding to the abnormality in the historical data and accumulating and superposing the data volumes to obtain the total data volume, so that the data volume to be analyzed and evaluated can be accurately acquired in a database through the secondary extraction tone data after the primary extraction tone data is formed;

step three: setting a reduction threshold, when the number of responsibility numbers under the historical data is smaller than the number of responsibility numbers corresponding to the one-time lottery data, uniformly reducing the duty ratio in the first step according to the reduction threshold and regenerating the duty ratio, and determining the corresponding lottery data amount according to the reduction threshold and the duty ratio. For example: when there are multiple responsibility numbers of 123, 124, 125 in the list of 10.1 days, respectively, and responsibility numbers of history data are 124, 125, and there is responsibility number less than the current 123, in order to facilitate implementation of secondary pitch data, a reduction threshold is set in advance in the system to reduce the occupation ratio of the responsibility numbers of the current 124, 125, etc. equally, for example, the reduction threshold is preset to 10%, the occupation ratio of 124, 125 is reduced by 5%, respectively, avoiding that unbalanced reduction of the occupation ratio results in that it affects the determination accuracy of data extraction analysis, 10% is set to the occupation ratio of 123, and the total pitch data amount is kept unchanged, so as to extract and determine the data in the database more accurately.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; these modifications or substitutions do not depart from the essence of the corresponding technical solutions from the protection scope of the technical solutions of the embodiments of the present invention.

Claims

1. The data quality management method based on the artificial intelligence technology is characterized by comprising the following steps of:

s3, defining a supervision mode of the responsibility numbers based on the risk levels, and judging whether to execute risk level regulation operation according to abnormal conditions of subsequent corresponding data of the responsibility numbers in the supervision mode so as to re-judge the risk levels of the responsibility numbers.

2. The method for managing data quality based on artificial intelligence technology according to claim 1, wherein the specified time includes:

zhou Du, month and quarter.

3. The method for managing data quality based on artificial intelligence technology according to claim 1, wherein when determining whether the primary and secondary pitch data are abnormal, the determining is performed according to the following method:

step three: and (5) calculating standard deviation:wherein: />Is the standard deviation of historical data +.>Is the ith history data;

step four: setting a threshold constant, and judging based on a Z-score algorithm:wherein: z-score is the decision value, +.>For the currently extracted variable data, determining whether the Z-score is greater than a threshold constant, and when it is greater than the threshold constant, the Z-score is +.>Is abnormal and based on->Performing a correction operation on the abnormal data, when it is less than or equal to a threshold constant>Is normal.

4. The method for managing data quality based on artificial intelligence technology according to claim 1, wherein the risk levels include a first level, a second level and a third level, and the risk levels of the first level, the second level and the third level are in a sequence from high to low, wherein the risk levels are determined as follows:

5. The method for managing data quality based on artificial intelligence technology according to claim 4, wherein the supervision mode is set according to a risk level, and the method comprises:

6. The method for managing data quality based on artificial intelligence technology according to claim 5, wherein the risk level is two-level, and when the responsibility number is in the monitoring state, determining the number of abnormal data corresponding to the current responsibility number and abnormal data of the secondary lottery, setting a regulation interval, and determining whether to execute the risk level regulation operation based on the regulation interval, wherein the method for determining and executing the risk level regulation operation is as follows:

wherein p is the number of current corresponding data anomalies, y is the number of secondary pitch data anomalies, the three-sequence change state of corresponding anomaly data with responsibility numbers in the monitoring state is set, A, B, C respectively refers to the level of the current corresponding data anomalies, A, B, C is sequentially ordered according to the number from the bottom to the top, and the likeAnd (3) when p is in A after the maximum three changes, stopping the background real-time monitoring operation, executing the risk level supervision operation, extracting the data corresponding to the responsibility number next time, and judging whether to execute the risk level regulation operation according to the abnormal data quantity.

7. The method according to claim 6, wherein when the interrupt limiting input state is formed, the next corresponding data of the responsibility number is extracted, and whether to execute the risk level controlling operation is determined according to the abnormal condition.

8. The method for managing data quality based on artificial intelligence technology according to claim 1, wherein when the primary tone data is abnormal and there are a plurality of responsibility numbers corresponding to the primary tone data, the corresponding tone data amount and the total tone data amount of the secondary tone data are defined, and the determination modes of the corresponding tone data amount and the total tone data amount are as follows: