CN115017969A - Data quality monitoring method and device for numerical label and electronic equipment - Google Patents

Data quality monitoring method and device for numerical label and electronic equipment Download PDF

Info

Publication number
CN115017969A
CN115017969A CN202210422610.5A CN202210422610A CN115017969A CN 115017969 A CN115017969 A CN 115017969A CN 202210422610 A CN202210422610 A CN 202210422610A CN 115017969 A CN115017969 A CN 115017969A
Authority
CN
China
Prior art keywords
statistical
numerical
monitored
label
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210422610.5A
Other languages
Chinese (zh)
Inventor
侯宗元
张茜
胡立波
刘允侃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210422610.5A priority Critical patent/CN115017969A/en
Publication of CN115017969A publication Critical patent/CN115017969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Abstract

The invention provides a data quality monitoring method and device for a numerical label, electronic equipment and a computer readable storage medium. The invention provides a data quality monitoring method of a numerical label, which comprises the following steps: acquiring a current numerical label to be monitored, and acquiring a plurality of statistical indexes according to the current numerical label to be monitored; detecting the plurality of statistical indexes by using a completely trained isolated forest model to determine whether abnormal statistical indexes exist in the plurality of statistical indexes; and determining whether the numerical label to be monitored is abnormal or not according to whether the abnormal statistical indexes exist in the plurality of statistical indexes or not. The data quality monitoring method of the numerical label can timely and accurately acquire the abnormal numerical label, and improves the timeliness and the accuracy of data quality monitoring.

Description

Data quality monitoring method and device for numerical label and electronic equipment
Technical Field
The present invention relates to the field of data quality detection technologies, and in particular, to a data quality monitoring method and apparatus for a numerical label, an electronic device, and a computer-readable storage medium.
Background
A tag is a form of data used to characterize a business entity. And depicting the business entity through the label, and reflecting the characteristics of the business entity from multiple angles. For example, when a user is depicted, the corresponding labels are respectively called as user labels, including gender, age, region, hobbies, product preference, income and other aspects. The labels can be divided into numerical labels, enumerated value labels, text labels and the like according to data type division. The value of the numerical label is numerical, such as the age, income, active time of the last 7 days, and the like of the user.
The label structure is extremely simple, all labels are arranged in a line around the business entity, and the labels are mutually independent and are very easy to manage. The tags are generated in an analytical system and then introduced into an application system for use. No matter how complex the computation logic and computation process of the label are, the extremely high access efficiency of the label in an application system is not influenced. Through simple operation on different labels, data screening and analysis can be carried out. For example, the guest group with different characteristics is screened out through the labels such as gender, age, region and the like, and then the image of the guest group can be obtained through analyzing the guest group through other labels, so that the whole operation process is very simple, and the data use efficiency is greatly improved. The label is used as cleaned data and can be directly used as data input of model training, and the data preparation time of modeling is shortened. Because of these characteristics, the tags are used on a large scale, and many companies have hundreds of tags and thousands of tags, and the tags of large internet companies are on a million scale, so that the timeliness and accuracy of quality monitoring of the large-scale tags are very important.
Disclosure of Invention
The invention aims to provide a data quality monitoring method and device of a numerical label, electronic equipment and a computer readable storage medium, so as to solve the technical problem that the timeliness and the accuracy of data quality monitoring of a large-scale numerical label in the prior art are not enough.
The technical scheme of the invention is as follows, and provides a data quality monitoring method of a numerical label, which comprises the following steps:
acquiring a current numerical label to be monitored, and acquiring a plurality of statistical indexes according to the current numerical label to be monitored;
detecting the plurality of statistical indexes by using a completely trained isolated forest model to determine whether abnormal statistical indexes exist in the plurality of statistical indexes;
and determining whether the current numerical label to be monitored is abnormal or not according to whether abnormal statistical indexes exist in the plurality of statistical indexes or not.
Further, obtaining a plurality of statistical indexes according to the current numerical label to be monitored includes:
sampling the current numerical type label to be monitored to obtain a sampled numerical type label to be monitored, and calculating the statistical indexes of the sampled numerical type label to be monitored to obtain a plurality of statistical indexes; and the current numerical label to be monitored is a numerical label with preset time granularity.
Further, calculating the statistical index of the sampled numerical label to be monitored, including:
and calculating the average value, the maximum value, the minimum value, the fraction value and the standard deviation of the tag data in the sampled numerical tag to be monitored.
Further, the training step of the isolated forest model comprises the following steps:
acquiring a numerical label in a history period of time, and acquiring a plurality of corresponding statistical indexes according to the numerical label in the history period of time;
and randomly selecting m statistical indexes from the corresponding statistical indexes as subsamples, putting the subsamples into root nodes of the isolated tree, and constructing leaf nodes by a recursion method to finish training of the isolated forest model.
Further, the constructing the leaf nodes by a recursive method to complete the training of the isolated forest model includes:
randomly appointing a dimension, randomly generating a cutting point in the data range of the current node, forming a hyperplane according to the cutting point, determining which branch of the left branch and the right branch of the current node the statistical index which is not in the current isolated tree belongs to according to the hyperplane, and putting the statistical index into the corresponding branch;
and re-executing the steps, and constructing new leaf nodes until only one statistical index or the isolated tree on the leaf nodes reaches the preset height, thereby finishing the training of the isolated forest model.
Further, determining which branch of the left branch and the right branch of the current node the statistical indicator not in the current isolated tree belongs to according to the hyperplane, and putting the statistical indicator into the corresponding branch, including:
and dividing the data space of the current node into 2 subspaces according to the hyperplane, placing the statistical indexes which are smaller than the cut point under the currently selected dimensionality on the left branch of the current node, and placing the statistical indexes which are larger than or equal to the cut point under the currently selected dimensionality on the right branch of the current node.
Further, determining whether the current numerical label to be monitored is abnormal according to whether an abnormal statistical index exists in the plurality of statistical indexes, includes: if an abnormal statistical index exists in the statistical indexes, the current numerical label to be monitored is abnormal, otherwise, the current numerical label to be monitored is not abnormal.
The invention provides a data quality monitoring device of a numerical label, which comprises a data acquisition module, an index abnormity monitoring module and a label abnormity monitoring module, wherein the index abnormity monitoring module is used for acquiring the index abnormity data of the numerical label;
the data acquisition module is used for acquiring a current numerical label to be monitored and acquiring a plurality of statistical indexes according to the current numerical label to be monitored;
the index abnormality monitoring module is used for detecting the plurality of statistical indexes by utilizing a completely trained isolated forest model so as to determine whether abnormal statistical indexes exist in the plurality of statistical indexes;
and the tag abnormity monitoring module is used for determining whether the numerical tag to be monitored is abnormal or not according to whether the abnormal statistical indexes exist in the plurality of statistical indexes.
Another aspect of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the data quality monitoring method for a numerical label according to any one of the above aspects.
Another aspect of the present invention provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the method for monitoring data quality of a numerical label according to any one of the above aspects.
The invention has the beneficial effects that: acquiring a current numerical label to be monitored, and acquiring a plurality of statistical indexes according to the current numerical label to be monitored; detecting a plurality of statistical indexes by using the isolated forest model with complete training to determine whether abnormal statistical indexes exist in the plurality of statistical indexes; determining whether the numerical label to be monitored is abnormal or not according to whether the abnormal statistical indexes exist in the plurality of statistical indexes or not; by the scheme, the abnormal numerical value type label can be timely and accurately acquired, and timeliness and accuracy of data quality monitoring are improved.
Drawings
Fig. 1 is a schematic flow chart of a data quality monitoring method for a numerical label according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a data quality monitoring device of a numerical tag according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Fig. 1 is a schematic flow chart of a data quality monitoring method for a numeric tag according to a first embodiment of the present invention. It should be noted that the data quality monitoring method of the numerical label of the present invention is not limited to the flow sequence shown in fig. 1 if substantially the same result is obtained. As shown in fig. 1, the method for monitoring data quality of a numerical label mainly includes the following steps:
s101, acquiring a current numerical label to be monitored, and acquiring a plurality of statistical indexes according to the current numerical label to be monitored;
s102, detecting the plurality of statistical indexes by using a completely trained isolated forest model to determine whether abnormal statistical indexes exist in the plurality of statistical indexes;
s103, determining whether the current numerical label to be monitored is abnormal or not according to whether abnormal statistical indexes exist in the plurality of statistical indexes or not.
In one embodiment, the numerical tag includes tag attributes and tag data, the tag attributes may include a subject object type, time aperture, tag name, and the like, and the tag data may include detailed data of time, object ID, and tag value; for example, the tag attribute of a numerical tag includes a customer tag, 12/1/2021, current day data and current day customer consumption amount, the tag data of the numerical tag includes a customer ID and the current day consumption amount of the corresponding customer, and the tag data is a piece of specific detail data; for another example, the tag attributes of another numeric tag include the employee tag, 12/25/2021, the cumulative data of the month, and the cumulative number of attendance of the employee during the month, and the tag data of the numeric tag includes the employee ID and the number of attendance of the corresponding employee within the specified date range. The tag attribute may also include other attributes, such as data source, service scenario, processing party, online time, etc., and the time aperture may include current year accumulation, history accumulation, week accumulation, etc., or may include finer time such as hour, minute, etc.
In an optional embodiment, obtaining a plurality of statistical indicators according to the current numerical label to be monitored includes:
sampling the current numerical type label to be monitored to obtain a sampled numerical type label to be monitored, and calculating the statistical indexes of the sampled numerical type label to be monitored to obtain a plurality of statistical indexes; and the current numerical label to be monitored is a numerical label with preset time granularity.
It should be noted that, in the data quality monitoring method for the numerical label provided by the embodiment of the present invention, it is monitored whether not only a numerical label (label data) is normal, but also a time-granular numerical label is normal, for example, the amount of customer consumption in the current day or several days, and whether the entire data in the current day or several days is normal.
In a specific embodiment, for the numerical type label, a data is generated every day, so that a large amount of data can be generated, and if the data is directly monitored, the monitoring cost can be increased. From a statistical point of view, as long as the number of samples is large enough, the statistical indicators calculated from the sampled numerical labels will be very close, according to the statistical indicators of the overall samples (the numerical labels to be monitored before sampling). For the client tags in the Internet, the data volume can reach the hundred million level, at the moment, the resource overhead and the time consumption of the total calculation of the statistical indexes of the numerical tags are relatively large, and the resource overhead and the time consumption can be reduced by the sampling mode. The calculated statistical indicator is stored with a well-defined tag attribute. For example, the tag attributes include the type of the subject object, time aperture, and tag name, i.e., the calculated statistical indicator is stored with a clear type of the subject object, time aperture, and tag name.
In an alternative embodiment, calculating the statistical index of the sampled numerical tags to be monitored includes:
and calculating the average value, the maximum value, the minimum value, the fraction value and the standard deviation of the tag data in the sampled numerical tag to be monitored.
For example, for user daily active duration data of a website or APP, the statistical index of user detail data (tag data) of daily granularity may be calculated, the statistical index may be average duration, maximum duration, minimum duration, 95-quantile duration, 5-quantile duration, proportion of active users to total users and standard deviation of the length, and the like, and specific tag data may be described as a date 2022-01-01, an average duration of 1.2 hours, a maximum duration of 100 minutes, and the like.
In an alternative embodiment, the training step of the isolated forest model comprises:
obtaining a numerical label in a history period of time, and obtaining a plurality of corresponding statistical indexes according to the numerical label in the history period of time;
and randomly selecting m statistical indexes from the corresponding statistical indexes as subsamples, putting the subsamples into root nodes of the isolated trees, and constructing leaf nodes by a recursion method so as to finish training of the isolated forest model.
It should be noted that m is the number of the statistical indexes, and the value is smaller than the total number of the statistical indexes; outlier data points are data points in the data set that are significantly different from other data points, and anomaly detection is the process of finding outlier data points in the data. Abnormal data points can be isolated by a smaller number of random feature segmentations than normal data points. There is a batch of data in a sample space, where there is dense distribution and where there is sparse distribution, if most samples of the batch of data are dense distribution, then the sparse part is the so-called outlier (i.e. outlier). Abnormal points can be detected from the data set through the isolated forest model, namely abnormal statistical indexes can be detected from the statistical indexes of preset time granularity.
In an optional embodiment, the constructing the leaf nodes by a recursive method to complete the training of the isolated forest model includes:
randomly appointing a dimension, randomly generating a cutting point in the data range of the current node, forming a hyperplane according to the cutting point, determining which branch of the left branch and the right branch of the current node the statistical index which is not in the current isolated tree belongs to according to the hyperplane, and putting the statistical index into the corresponding branch;
and re-executing the steps, and constructing new leaf nodes until only one statistical index or the isolated tree on the leaf nodes reaches the preset height, thereby finishing the training of the isolated forest model.
In one embodiment, the training of the isolated forest model comprises: randomly selecting m points from the training data as sub-samples, and putting the sub-samples into root nodes of the isolated tree; randomly appointing a dimension, and randomly generating a cutting point in the data range of the current node, wherein the cutting point is generated between the maximum value and the minimum value of the appointed dimension in the data of the current node; the selection of the cutting point generates a hyperplane, and the data space of the current node is divided into 2 subspaces; placing points smaller than the cut point under the currently selected dimensionality on the left branch of the current node, and placing points larger than or equal to the cut point on the right branch of the current node; recursion of the above steps at the left and right branch nodes of the node continues to construct new leaf nodes until only one data or tree on the leaf node has grown to the set height.
Need to explainThe isolated forest is similar to the random forest, and each tree is trained by randomly sampling data, so that the variance of the constructed forest is ensured to be large enough, namely, the more dissimilarity between the trees is, the better. When an isolated forest is constructed, two parameters need to be set: the maximum of the number of trees and the sample size per tree; the isolated forest measures whether a data point x is an abnormal point by introducing an abnormal value function s (x, n);
Figure BDA0003608543300000081
where E (H (x)) is the expected value of the path length of x in a plurality of trees, c (n) -2H (n-1) - (2(n-1)/n), H (×) + ln, ξ, c (n) is a data set containing n samples, the average path length of the trees is used to normalize the path length of the record x, H (×) is a harmonic tree, and ξ is an euler constant.
In an optional embodiment, determining to which branch of the left branch and the right branch of the current node the statistical indicator not in the current orphan tree belongs according to the hyperplane, and placing the statistical indicator into the corresponding branch includes:
and dividing the data space of the current node into 2 subspaces according to the hyperplane, placing the statistical indexes smaller than the cut point under the currently selected dimensionality on the left branch of the current node, and placing the statistical indexes larger than or equal to the cut point under the currently selected dimensionality on the right branch of the current node.
In one embodiment, part of the code for training of isolated forest models, as follows,
fromsklearn.ensemble importIsolationForest
X_train=…
clf=IsolationForest()
clf.fit(X_train)
X_test=…
y_pred_train=clf.predict(X_test)
in the above codes, from a skleern. ensemble elementary intelligence effort nforest, X _ train represent input training data, the training data is a plurality of statistical index data, which is multidimensional data dimension number that can be adjusted according to the practice, clf.fit (X _ train) represents data detection, a score is output for each line of data (data under each statistical index) in the code of the isolated forest model, a smaller score represents more abnormal data, and a negative value represents an abnormal value of the line of data.
In an optional embodiment, determining whether the current to-be-monitored numerical label is abnormal according to whether an abnormal statistical indicator exists in the plurality of statistical indicators includes: if an abnormal statistical index exists in the statistical indexes, the current numerical label to be monitored is abnormal, otherwise, the current numerical label to be monitored is not abnormal.
The method comprises the steps that a complete isolated forest model can be trained to detect whether the statistical indexes of preset time granularity (such as one day or several days) are abnormal or not, and therefore whether the corresponding numerical labels are abnormal or not is judged; if the statistical index has an abnormal statistical index, the numerical label has an abnormality, otherwise, the numerical label has no abnormality.
In a specific embodiment, in addition to the tag attributes such as the type of the subject object, time aperture, and tag name, the day of the week, whether the day is working, whether the day is holiday, and the like can be abstracted as the date dimension and used as the tag attribute, the tag attribute has corresponding tag data, and the numerical tag formed by the tag attribute and the tag data thereof can be used as the original data of the training data for training the isolated forest model or the numerical tag to be monitored. If one label is on-line for a long time, enough label data are collected, the isolated forest model is trained only by using the numerical label, and during training, the numerical label is subjected to data sampling to obtain the statistical index of the sampled numerical label. If a tag is just on line, the data which can be collected is little, and in training the isolated forest model, other types of tags can be used besides the tags such as the class of the subject object, the time caliber, the tag name and the like. For example, if the label data is the number of clicks per user per day for a new online page, and the label data has no history data, the label data similar to the label data may be used to train the model.
In another embodiment, after the abnormal numerical label is obtained, the user can be notified of the data abnormality by docking the mail system or other instant message system; and the detection result data of the numerical type label can be written into the database, and a report is made based on the data to monitor the label data condition.
The embodiment of the invention provides a data quality monitoring method of a numerical label, which comprises the steps of obtaining the current numerical label to be monitored, and obtaining a plurality of statistical indexes according to the current numerical label to be monitored; detecting a plurality of statistical indexes by using the isolated forest model with complete training to determine whether abnormal statistical indexes exist in the plurality of statistical indexes; determining whether the numerical label to be monitored is abnormal or not according to whether the abnormal statistical indexes exist in the plurality of statistical indexes or not; through the scheme, the abnormal numerical value type label can be timely and accurately acquired, timeliness and accuracy of data quality monitoring are improved, the data quality of the numerical value type label can be intelligently and automatically monitored, potential label data problems can be timely found, accordingly, label data quality is improved, and normal operation of each functional module of the downstream use label can be guaranteed.
The data quality monitoring method of the numerical label provided by the embodiment of the invention can be constructed based on artificial intelligence, and related data is acquired and processed based on an artificial intelligence technology, so that the data quality monitoring of the numerical label of unattended artificial intelligence is realized. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Fig. 2 is a schematic structural diagram of a data quality monitoring apparatus for a numerical label according to a second embodiment of the present invention. As shown in fig. 2, the data quality monitoring apparatus 20 for numerical labels includes a data obtaining module 21, an index anomaly monitoring module 22 and a label anomaly monitoring module 23; the data acquisition module 21 is configured to acquire a current numerical label to be monitored, and acquire a plurality of statistical indexes according to the current numerical label to be monitored; the index anomaly monitoring module 22 is configured to detect the plurality of statistical indexes by using a perfectly trained isolated forest model to determine whether an anomaly statistical index exists in the plurality of statistical indexes; the tag anomaly monitoring module 23 is configured to determine whether the current numerical tag to be monitored is anomalous according to whether an anomaly statistical indicator exists in the plurality of statistical indicators.
The numerical label comprises a label attribute and label data, wherein the label attribute can comprise a main body object type, time caliber, label name and the like, and the label data can comprise detailed data of time, object ID and label value; for example, the tag attribute of a numerical tag includes a customer tag, 12/1/2021, current day data and current day customer consumption amount, the tag data of the numerical tag includes a customer ID and the current day consumption amount of the corresponding customer, and the tag data is a piece of specific detail data; for another example, the tag attributes of another numerical tag include an employee tag, 12/25/2021/month/2021, monthly accumulated data, and monthly accumulated attendance days of the employee, and the tag data of the numerical tag includes an employee ID and the attendance days of the corresponding employee within a specified date range. The tag attribute may also include other attributes, such as data source, service scenario, processing party, online time, etc., and the time aperture may include current year accumulation, historical accumulation, week accumulation, etc., or may include more detailed time such as hour, minute, etc.
Further, the data obtaining module 21 is further configured to sample the current to-be-monitored numerical label to obtain a sampled to-be-monitored numerical label, and calculate a statistical index of the sampled to-be-monitored numerical label to obtain the plurality of statistical indexes; and the current numerical label to be monitored is a numerical label with preset time granularity.
In the data quality monitoring method for the numerical label provided by the embodiment of the present invention, it is monitored whether not only a numerical label is normal, but also a time-granular numerical label is normal, for example, the customer consumption amount in the current day or several days, and the overall data of the label in the current day or several days is normal.
For the numerical value type label, a part of data can be generated every day, so that a large amount of data can be generated, and if the data are directly monitored, the monitoring cost can be increased. From the statistical point of view, as long as the number of samples is large enough, the statistical indexes calculated according to the sampled numerical tags are very close to each other, and the statistical indexes are calculated according to the numerical tags to be monitored before sampling. For the client tags in the Internet, the data volume can reach the hundred million level, at the moment, the resource overhead and the time consumption of the total calculation of the statistical indexes of the numerical tags are relatively large, and the resource overhead and the time consumption can be reduced by the sampling mode. The calculated statistical indicator is stored with a well-defined tag attribute. For example, the tag attributes include the subject object type, time stamp, and tag name, i.e., the calculated statistical indicator is stored with a clear subject object type, time stamp, and tag name.
Further, the data obtaining module 21 is further configured to calculate an average value, a maximum value, a minimum value, a score value, and a standard deviation of the sampled tag data in the numerical tag to be monitored.
The tag data in the numerical tag may be described by some statistical indexes, which may include an average value, a maximum value, a minimum value, a score value, and a standard deviation, for example, for user daily active duration data of a website or APP, the statistical indexes of user detail data (tag data) of daily granularity may be calculated, and the statistical indexes may be average duration, maximum duration, minimum duration, 95 quantile duration, 5 quantile duration, proportion of active users to total users, and a standard deviation of time length, and the specific tag data may be described as a date 2022-01-01, an average duration 1.2 hours, a maximum duration 100 minutes, and the like.
Further, the data quality monitoring device 20 for the numerical labels further includes a model training module, which is configured to obtain the numerical labels in a history within a period of time, and obtain a plurality of corresponding statistical indexes according to the numerical labels in the history within the period of time; and randomly selecting m statistical indexes from the corresponding statistical indexes as subsamples, putting the subsamples into root nodes of the isolated tree, and constructing leaf nodes by a recursion method to finish training of the isolated forest model.
Wherein m is the number of the statistical indexes, and the value of m is less than the total number of the statistical indexes; outlier data points are data points in the data set that are significantly different from other data points, and anomaly detection is the process of finding outlier data points in the data. Abnormal data points can be isolated by a smaller number of random feature segmentations than normal data points. There is a batch of data in a sample space, where there is dense distribution and where there is sparse distribution, if most samples of the batch of data are dense distribution, then the sparse part is the so-called outlier (i.e. outlier). Abnormal points can be detected from the data set through the isolated forest model, namely abnormal statistical indexes can be detected from the statistical indexes of preset time granularity.
Further, the model training module is also used for randomly appointing a dimension, randomly generating a cutting point in the data range of the current node, forming a hyperplane according to the cutting point, determining which branch of the left branch and the right branch of the current node the statistical index which is not in the current isolated tree belongs to according to the hyperplane, and putting the statistical index into the corresponding branch; and re-executing the steps, and constructing new leaf nodes until only one statistical index or the isolated tree on the leaf nodes reaches the preset height, thereby finishing the training of the isolated forest model.
Wherein, the training step of the isolated forest model comprises the following steps: randomly selecting m points from the training data as sub-samples, and putting the sub-samples into root nodes of the isolated tree; randomly appointing a dimension, and randomly generating a cutting point in the data range of the current node, wherein the cutting point is generated between the maximum value and the minimum value of the appointed dimension in the data of the current node; the selection of the cutting point generates a hyperplane, and the data space of the current node is divided into 2 subspaces; placing the point smaller than the cut point under the currently selected dimension on the left branch of the current node, and placing the point larger than or equal to the cut point on the right branch of the current node; recursion of the above steps at the left and right branch nodes of the node continues to construct new leaf nodes until only one data or tree on the leaf node has grown to the set height.
The isolated forest is similar to the random forest, and each tree is trained through random sampling data, so that the variance of the constructed forest is ensured to be large enough, namely, the dissimilarity among the trees is better. When constructing an isolated forest, two parameters need to be set: the maximum of the number of trees and the sample size per tree; the isolated forest measures whether a data point x is an abnormal point by introducing an abnormal value function s (x, n);
Figure BDA0003608543300000131
where E (H (x)) is the expected value of the path length of x in a plurality of trees, c (n) -2H (n-1) - (2(n-1)/n), H (×) + ln, ξ, c (n) is a data set containing n samples, the average path length of the trees is used to normalize the path length of the record x, H (×) is a harmonic tree, and ξ is an euler constant.
Further, the model training module is further configured to divide the data space of the current node into 2 subspaces according to the hyperplane, place the statistical indicator smaller than the cut point in the currently selected dimension on the left branch of the current node, and place the statistical indicator greater than or equal to the cut point in the currently selected dimension on the right branch of the current node.
Further, the tag anomaly monitoring module 23 is further configured to determine that the current to-be-monitored numerical tag is anomalous when an anomaly statistical indicator exists in the plurality of statistical indicators, and otherwise, determine that the current to-be-monitored numerical tag is not anomalous.
The method comprises the following steps that a completely trained isolated forest model is used for detecting whether the statistical indexes of preset time granularity (such as one day or several days) are abnormal or not, and accordingly, whether the corresponding numerical type labels are abnormal or not is judged; if the statistical index has an abnormal statistical index, the numerical label has an abnormality, otherwise, the numerical label has no abnormality. In addition to the tag attributes such as the class, time aperture and tag name of the subject object, the day of the week, whether the work day is working or not, whether the holiday is working or not and the like can be abstracted into the date dimension and used as the tag attribute, the tag attribute has corresponding tag data, and the numerical tag formed by the tag attribute and the tag data thereof can be used as the original data of the training data for training the isolated forest model or the numerical tag to be monitored. If one label is on-line for a long time, enough label data are collected, the isolated forest model is trained only by using the numerical label, and during training, the numerical label is subjected to data sampling to obtain the statistical index of the sampled numerical label. If a tag is just on line, little data can be collected, and other types of tags can be used in training the isolated forest model, besides tags such as subject object type, time aperture, and tag name. For example, if the label data is the number of clicks per user per day for a new online page, and the label data has no history data, the label data similar to the label data may be used to train the model.
The embodiment of the invention provides a data quality monitoring device for a numerical label, which is characterized in that the numerical label to be monitored currently is obtained through a data obtaining module, and a plurality of statistical indexes are obtained according to the numerical label to be monitored currently; detecting a plurality of statistical indexes by using a completely trained isolated forest model through an index abnormality monitoring module to determine whether abnormal statistical indexes exist in the plurality of statistical indexes; determining whether the numerical label to be monitored is abnormal or not according to whether the abnormal statistical indexes exist in the plurality of statistical indexes or not through a label abnormality monitoring module; abnormal numerical value type labels can be acquired timely and accurately, and timeliness and accuracy of data quality monitoring are improved.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. As shown in fig. 3, the electronic device 30 includes a processor 31 and a memory 32 communicatively coupled to the processor 31.
The memory 32 stores program instructions for implementing the data quality monitoring method of the numerical tag of any of the above embodiments.
The processor 31 is operative to execute program instructions stored in the memory 32 for performing code testing.
The processor 31 may also be referred to as a CPU (Central Processing Unit). The processor 31 may be an integrated circuit chip having signal processing capabilities. The processor 31 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 32 can be used for storing the computer programs and/or modules, and the processor 31 can implement various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory 32 and calling the data stored in the memory 32. The memory 32 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like.
The memory 32 may be integrated in the processor 31 or may be provided separately from the processor 31.
A fourth embodiment of the present invention provides a storage medium storing program instructions capable of implementing all the methods described above, and the storage medium may be nonvolatile or volatile. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM). The program instructions may be stored in the storage medium in the form of a software product, and include several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.
While the foregoing is directed to embodiments of the present invention, it will be understood by those skilled in the art that various changes may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A data quality monitoring method of a numerical label is characterized by comprising the following steps:
acquiring a current numerical label to be monitored, and acquiring a plurality of statistical indexes according to the current numerical label to be monitored;
detecting the plurality of statistical indexes by using a completely trained isolated forest model to determine whether abnormal statistical indexes exist in the plurality of statistical indexes;
and determining whether the current numerical label to be monitored is abnormal or not according to whether abnormal statistical indexes exist in the plurality of statistical indexes or not.
2. The method for monitoring the data quality of the numerical label according to claim 1, wherein obtaining a plurality of statistical indicators according to the current numerical label to be monitored comprises:
sampling the current numerical label to be monitored to obtain a sampled numerical label to be monitored, and calculating the statistical indexes of the sampled numerical label to be monitored to obtain a plurality of statistical indexes; and the current numerical label to be monitored is a numerical label with preset time granularity.
3. The method for monitoring data quality of a numerical label according to claim 2, wherein calculating the statistical indicator of the sampled numerical label to be monitored comprises:
and calculating the average value, the maximum value, the minimum value, the fraction value and the standard deviation of the tag data in the sampled numerical tag to be monitored.
4. The data quality monitoring method of the numerical labels as claimed in claim 1, wherein the training step of the isolated forest model comprises:
obtaining a numerical label in a history period of time, and obtaining a plurality of corresponding statistical indexes according to the numerical label in the history period of time;
and randomly selecting m statistical indexes from the corresponding statistical indexes as subsamples, putting the subsamples into root nodes of the isolated tree, and constructing leaf nodes by a recursion method to finish training of the isolated forest model.
5. The data quality monitoring method for the numerical labels as claimed in claim 4, wherein the constructing leaf nodes by a recursive method to complete the training of the isolated forest model comprises:
randomly appointing a dimension, randomly generating a cutting point in the data range of the current node, forming a hyperplane according to the cutting point, determining which branch of the left branch and the right branch of the current node the statistical index which is not in the current isolated tree belongs to according to the hyperplane, and putting the statistical index into the corresponding branch;
and re-executing the steps, and constructing new leaf nodes until only one statistical index or the isolated tree on the leaf nodes reaches the preset height, thereby finishing the training of the isolated forest model.
6. The method for monitoring data quality of numerical labels according to claim 5, wherein determining which branch of the left branch and the right branch of the current node the statistical indicator not in the current orphan tree belongs to according to the hyperplane, and placing the statistical indicator into the corresponding branch comprises:
and dividing the data space of the current node into 2 subspaces according to the hyperplane, placing the statistical indexes which are smaller than the cut point under the currently selected dimensionality on the left branch of the current node, and placing the statistical indexes which are larger than or equal to the cut point under the currently selected dimensionality on the right branch of the current node.
7. The method for monitoring data quality of a numerical label according to claim 1, wherein determining whether the current numerical label to be monitored is abnormal according to whether an abnormal statistical indicator exists in the plurality of statistical indicators comprises: if abnormal statistical indexes exist in the statistical indexes, the current numerical label to be monitored is abnormal, otherwise, the current numerical label to be monitored is not abnormal.
8. A data quality monitoring device of a numerical label is characterized by comprising a data acquisition module, an index abnormity monitoring module and a label abnormity monitoring module;
the data acquisition module is used for acquiring a current numerical label to be monitored and acquiring a plurality of statistical indexes according to the current numerical label to be monitored;
the index abnormality monitoring module is used for detecting the plurality of statistical indexes by utilizing a completely trained isolated forest model so as to determine whether abnormal statistical indexes exist in the plurality of statistical indexes;
and the tag abnormity monitoring module is used for determining whether the numerical tag to be monitored is abnormal or not according to whether the abnormal statistical indexes exist in the plurality of statistical indexes.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the data quality monitoring method of the numerical tag according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements a data quality monitoring method of a numeric tag according to any one of claims 1 to 7.
CN202210422610.5A 2022-04-21 2022-04-21 Data quality monitoring method and device for numerical label and electronic equipment Pending CN115017969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210422610.5A CN115017969A (en) 2022-04-21 2022-04-21 Data quality monitoring method and device for numerical label and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210422610.5A CN115017969A (en) 2022-04-21 2022-04-21 Data quality monitoring method and device for numerical label and electronic equipment

Publications (1)

Publication Number Publication Date
CN115017969A true CN115017969A (en) 2022-09-06

Family

ID=83066493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210422610.5A Pending CN115017969A (en) 2022-04-21 2022-04-21 Data quality monitoring method and device for numerical label and electronic equipment

Country Status (1)

Country Link
CN (1) CN115017969A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028481A (en) * 2023-03-30 2023-04-28 紫金诚征信有限公司 Data quality detection method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028481A (en) * 2023-03-30 2023-04-28 紫金诚征信有限公司 Data quality detection method, device, equipment and storage medium
CN116028481B (en) * 2023-03-30 2023-06-27 紫金诚征信有限公司 Data quality detection method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
Aït-Sahalia et al. Modeling financial contagion using mutually exciting jump processes
KR101983538B1 (en) Systems and methods for calculating category proportions
CN111506723B (en) Question-answer response method, device, equipment and storage medium
CN111177714A (en) Abnormal behavior detection method and device, computer equipment and storage medium
CN111222982A (en) Internet credit overdue prediction method, device, server and storage medium
CN106293891B (en) Multidimensional investment index monitoring method
CN110019116B (en) Data tracing method, device, data processing equipment and computer storage medium
CN112416778A (en) Test case recommendation method and device and electronic equipment
US20210397956A1 (en) Activity level measurement using deep learning and machine learning
Ékes et al. The efficiency of bankruptcy forecast models in the Hungarian SME sector
CN114186760A (en) Analysis method and system for stable operation of enterprise and readable storage medium
CN115017969A (en) Data quality monitoring method and device for numerical label and electronic equipment
CN110362607B (en) Abnormal number identification method, device, computer equipment and storage medium
CN113190426B (en) Stability monitoring method for big data scoring system
CN113110961B (en) Equipment abnormality detection method and device, computer equipment and readable storage medium
CN113947076A (en) Policy data detection method and device, computer equipment and storage medium
CN117312657A (en) Abnormal function positioning method and device for financial application, computer equipment and medium
CN115759885B (en) Material sampling inspection method and device based on distributed material supply
CN111737320A (en) Method and device for establishing group user behavior baseline and computer equipment
Pavych et al. Software architecture for analyzing the impact of news on the stock market
CN113704599A (en) Marketing conversion user prediction method and device and computer equipment
Semenenko et al. Automated system-cognitive analysis of the dependence of export and import of agricultural machinery on its production (the case of Russia)
CN110727711A (en) Method and device for detecting abnormal data in fund database and computer equipment
CN110837459A (en) Big data-based operation performance analysis method and system
Thilakarathne et al. Predicting Floods in North Central Province of Sri Lanka using Machine Learning and Data Mining Methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination