CN112183644A - Index stability monitoring method and device, computer equipment and medium - Google Patents

Index stability monitoring method and device, computer equipment and medium Download PDF

Info

Publication number
CN112183644A
CN112183644A CN202011056363.9A CN202011056363A CN112183644A CN 112183644 A CN112183644 A CN 112183644A CN 202011056363 A CN202011056363 A CN 202011056363A CN 112183644 A CN112183644 A CN 112183644A
Authority
CN
China
Prior art keywords
data
stability
target
preset
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011056363.9A
Other languages
Chinese (zh)
Other versions
CN112183644B (en
Inventor
罗健
陈远波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202011056363.9A priority Critical patent/CN112183644B/en
Priority claimed from CN202011056363.9A external-priority patent/CN112183644B/en
Publication of CN112183644A publication Critical patent/CN112183644A/en
Application granted granted Critical
Publication of CN112183644B publication Critical patent/CN112183644B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention relates to the field of data processing, and discloses a method, a device, computer equipment and a medium for monitoring index stability, wherein the method comprises the following steps: the method comprises the steps of obtaining historical data characteristics of a first preset period as an initial training characteristic set, carrying out stability detection on the initial training set to obtain a detection result, determining a target training set according to the detection result, obtaining data characteristics in a prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period, calculating the stability of the data characteristics in the prediction set relative to the data characteristics in the target training set through a preset mode to be used as target stability, and determining a monitoring result of indexes of the prediction set based on the target stability and a preset stability threshold value.

Description

Index stability monitoring method and device, computer equipment and medium
Technical Field
The invention relates to the field of data processing, in particular to a method and a device for monitoring index stability, computer equipment and a medium.
Background
With the rapid development of computer technology, more and more data processing relates to machine learning and artificial intelligence, and a better model needs to be trained before the machine learning and the artificial intelligence are adopted for data processing, so that the model is adopted for rapid data processing.
In the model training process, the quality of the indexes entering the model directly determines the accuracy of the model training result, good indexes (small fluctuation and strong predictability) have irreplaceable effects on the model, but data are in various changes and cannot exist in an ideal form, so that the stability of the indexes is monitored in advance, the model effect is stabilized, and the model quality is ensured.
At present, existing index monitoring schemes are all index calculation tools to calculate the stability of each index and further give evaluation opinions, but in an actual scene, the related indexes are more, the indexes are calculated directly through the tools, the efficiency is low, and therefore an efficient index stability monitoring method is urgently needed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for monitoring index stability, computer equipment and a medium, which are used for improving the efficiency of monitoring the index stability.
In order to solve the foregoing technical problem, an embodiment of the present application provides a method for monitoring index stability, including:
acquiring historical data characteristics of a first preset period as an initial training characteristic set;
performing stability detection on the initial training set to obtain a detection result;
determining a target training set according to the detection result;
acquiring data characteristics in a prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period;
calculating the stability of the data features in the prediction set relative to the data features in the target training set in a preset mode to serve as target stability;
and determining the monitoring result of the prediction set index based on the target stability and a preset stability threshold.
Optionally, the obtaining of the historical data feature of the first preset period as the initial training feature set includes:
carrying out data cleaning and normalization processing on the historical data characteristics to obtain initial data;
if continuous data exist in the initial data, discretizing the continuous data to obtain discrete data;
and carrying out one-hot coding on the discrete data, and taking the data subjected to one-hot coding as the initial training feature set.
Optionally, the determining a target training set according to the detection result includes:
if the detection result is that unstable data exist in the initial training set, adding the stable data into a target training set, and taking the unstable data as abnormal data;
obtaining an unstable type corresponding to the abnormal data, and repairing the abnormal data according to a repair scheme corresponding to the unstable type;
and if the repair is successful, adding the repaired abnormal data into the target training set, and if the repair is failed, removing the abnormal data.
Optionally, the calculating, in a preset manner, a stability of the data features in the prediction set with respect to the data features in the target training set includes:
according to a preset binning mode, performing binning processing on the data features in the prediction set to obtain a first bin, and performing binning processing on the data features in the target training set to obtain a second bin;
aiming at any sub-box in the first sub-box, calculating stability indexes of the data characteristics in the sub-box and the data characteristics in the sub-box corresponding to the second sub-box to obtain basic stability;
and accumulating all the basic stability degrees to obtain the target stability degree corresponding to the data characteristic.
Optionally, the binning the data features in the prediction set according to a preset binning mode to obtain a first bin includes:
acquiring a binning configuration parameter from a preset configuration file, wherein the binning configuration parameter comprises the bin number threshold;
acquiring m characteristic values contained in the characteristic data, wherein m is a positive integer greater than 1;
storing the m characteristic values into a preset characteristic value set, setting an initial value of a binning wheel number k to be 0, and setting a binning result of binning of a 0 th wheel to be null, wherein k belongs to [0, m-1 ];
for each characteristic value in the characteristic value set, taking the characteristic value as a test split point, dividing the nominal variable into k +2 boxes on the basis of a box dividing result of a k-th round of box dividing, and calculating relevance index values corresponding to the characteristic values to obtain m-k relevance index values;
taking a characteristic value corresponding to the maximum value in the m-k correlation index values as a target splitting point, dividing the nominal variable into k +2 boxes on the basis of the box dividing result of the kth round of box dividing, taking the nominal variable as the box dividing result of the kth +1 round of box dividing, and removing the characteristic value from the characteristic value set;
if the value of k +2 does not reach the preset bin number threshold value, returning to each feature value in the feature value set, taking the feature value as a test split point, dividing the nominal variable into k +2 bins on the basis of the bin dividing result of the kth round of bin dividing, calculating the correlation index value corresponding to the feature value, continuing to execute the step of obtaining m-k correlation index values, and otherwise, stopping the bin dividing, and determining the bin dividing result of the kth +1 round of bin dividing as the first bin dividing.
Optionally, the relevance index value is any one of an IV value, a kini coefficient, and an information entropy.
In order to solve the above technical problem, an embodiment of the present application further provides a monitoring device for index stability, including:
the first data acquisition module is used for acquiring historical data characteristics of a first preset period as an initial training characteristic set;
the first stability detection module is used for performing stability detection on the initial training set to obtain a detection result;
the target training set determining module is used for determining a target training set according to the detection result;
the second data acquisition module is used for acquiring data characteristics in the prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period;
the second stability detection module is used for calculating the stability of the data features in the prediction set relative to the data features in the target training set in a preset mode to serve as target stability;
and the result determining module is used for determining the monitoring result of the prediction set index based on the target stability and a preset stability threshold.
Optionally, the first data obtaining module includes:
the data preprocessing unit is used for carrying out data cleaning and normalization processing on the historical data characteristics to obtain initial data;
the data discretization unit is used for discretizing continuous data to obtain discretization data if the continuous data exists in the initial data;
and the one-hot coding unit is used for carrying out one-hot coding on the discrete data and taking the data subjected to one-hot coding as the initial training feature set.
Optionally, the target training set determining module includes:
an abnormal data determining unit, configured to add the stable data to a target training set if the detection result is that unstable data exists in an initial training set, and use the unstable data as abnormal data;
the abnormal data repairing unit is used for acquiring an unstable type corresponding to the abnormal data and repairing the abnormal data according to a repairing scheme corresponding to the unstable type;
and the abnormal data classification unit is used for adding the repaired abnormal data into the target training set if the repair is successful, and removing the abnormal data if the repair is failed.
Optionally, the second stable detection module includes:
the data characteristic binning unit is used for performing binning processing on the data characteristics in the prediction set according to a preset binning mode to obtain a first bin, and performing binning processing on the data characteristics in the target training set to obtain a second bin;
the basic stability calculation unit is used for calculating the stability indexes of the data characteristics in the sub-boxes and the data characteristics in the sub-boxes corresponding to the second sub-boxes aiming at any one sub-box in the first sub-box to obtain basic stability;
and the target stability determining unit is used for accumulating all the basic stabilities to obtain the target stability corresponding to the data characteristics.
Optionally, the data feature binning unit comprises:
the parameter obtaining subunit is configured to obtain a binning configuration parameter from a preset configuration file, where the binning configuration parameter includes the bin number threshold;
a feature value obtaining subunit, configured to obtain m feature values included in the feature data, where m is a positive integer greater than 1;
the initialization unit is used for storing the m characteristic values into a preset characteristic value set, setting the initial value of the binning round number k to be 0, and setting the binning result of the 0 th round of binning to be null, wherein k belongs to [0, m-1 ];
an association index value calculation unit, configured to, for each feature value in the feature value set, divide the nominal variable into k +2 bins on the basis of a bin split result of a k-th round of bin splitting by using the feature value as a test split point, and calculate an association index value corresponding to the feature value to obtain m-k association index values;
a binning result determining unit, configured to take a feature value corresponding to a maximum value of the m-k correlation index values as a target split point, divide the nominal variable into k +2 bins on the basis of a binning result of a kth round of binning, take the nominal variable as a binning result of the kth +1 round of binning, and remove the feature value from the feature value set;
and the cyclic iteration unit is used for returning each characteristic value in the characteristic value set if the value of k +2 does not reach a preset bin number threshold value, dividing the nominal variable into k +2 bins on the basis of a bin dividing result of the kth round of bin dividing by taking the characteristic value as a test splitting point, calculating the associated index value corresponding to the characteristic value, continuing to execute the step of obtaining m-k associated index values, and otherwise, stopping the bin dividing and determining the bin dividing result of the kth +1 round of bin dividing as the first bin dividing.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the monitoring method for index stability when executing the computer program.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the monitoring method for index stability.
The index stability monitoring method, the device, the computer equipment and the medium provided by the embodiment of the invention acquire the historical data characteristics of a first preset period as an initial training characteristic set, perform stability detection on the initial training set to obtain a detection result, determine a target training set according to the detection result so as to enable the subsequent data stability detection to be performed quickly according to the target training set as a reference, thereby being beneficial to improving the accuracy of the subsequent data stability identification, and simultaneously acquire the data characteristics in a prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period, calculate the stability of the data characteristics in the prediction set relative to the data characteristics in the target training set by a preset mode to be used as a target stability, and determine the monitoring result of the index of the prediction set based on the target stability and a preset stability threshold value, stability monitoring is carried out by calculating the stability of the data features in the prediction set relative to the data features in the target training set, and the efficiency of data stability detection is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a method for index stability monitoring of the present application;
FIG. 3 is a schematic diagram of an embodiment of an index stability monitoring apparatus according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, as shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like.
The terminal devices 101, 102, 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablet computers, E-book readers, MP3 players (Moving Picture E interface shows a properties Group Audio Layer III, motion Picture experts compress standard Audio Layer 3), MP4 players (Moving Picture E interface shows a properties Group Audio Layer IV, motion Picture experts compress standard Audio Layer 4), laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the monitoring method for index stability provided in the embodiment of the present application is executed by a server, and accordingly, a monitoring device for index stability is disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. Any number of terminal devices, networks and servers may be provided according to implementation needs, and the terminal devices 101, 102 and 103 in this embodiment may specifically correspond to an application system in actual production.
Referring to fig. 2, fig. 2 shows a monitoring method for index stability according to an embodiment of the present invention, which is described by taking the method applied to the server in fig. 1 as an example, and is detailed as follows:
s201: and acquiring historical data characteristics of a first preset period as an initial training characteristic set.
Specifically, in actual business requirements, a prediction model needs to be iterated periodically, where model iteration involves screening data features, and the quality of the data features (i.e., model entry indexes) of the input model directly affects the prediction result of the model.
The historical data features refer to data features used in historical services, and different data features exist according to different services, which is not limited herein.
Preferably, in this embodiment, the first preset period is one month.
For example, in a specific embodiment, the preset model is a traffic condition trend prediction model, the first preset period is 1 month, and historical data features of approximately 1 month are obtained from a service library as an initial training feature set.
S202: and carrying out stability detection on the initial training set to obtain a detection result.
Specifically, after the initial training set is obtained, the stability of the data features in the initial training set needs to be detected, and the data restoration is performed on the unstable data features in the initial training set.
Wherein, PSI (stability Index) is used for the stability of each feature in the initial training set, and in this embodiment, the stability of the initial training set can be specifically detected in the following manner: preprocessing and unique hot coding are carried out on data features in an initial training set to obtain digitized data, similarity calculation is carried out on the digitized data, the digitized data with similarity exceeding a preset threshold value is used as family data, a family with the most family data is obtained and used as a reference family, an average value of the digitized data in the reference family is further obtained, PSI of each piece of digitized data and the average value is respectively calculated and used as a stable value of the digitized data, and a detection result is determined according to a comparison result of the stable value and the preset stable threshold value.
In this embodiment, the detection result includes that the data feature in the initial training set is stable and that part of the data feature in the initial training set is unstable, where the existence of part of the data feature in the initial training set is unstable, and includes but is not limited to: logic anomalies, periodic index anomalies, and wave anomalies, among others.
S203: and determining a target training set according to the detection result.
Specifically, after the detection result is obtained, if the detection result is that the data features in the initial training set are stable, the initial training set is determined as the target training set, if part of the data features in the initial training set are unstable, the unstable data features are repaired, and the repaired stable data features are used as the target training set.
Furthermore, different types of unstable types and a repairing mode of data characteristics of each unstable type can be preset, so that source data can be kept as far as possible, and the problem that the accuracy of subsequent stability monitoring is reduced due to the fact that data attributes of certain type of data are all eliminated is avoided. For the specific process of data repair, reference may be made to the description of the subsequent embodiments, and details are not repeated here to avoid repetition.
S204: and acquiring data characteristics in the prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period.
Specifically, after a target training set with relatively stable data characteristics is obtained, the target training set is used to periodically monitor the stability of the data characteristics in the prediction set.
The prediction set is a set of feature data which needs to be input into the model to predict the service processing result according to the service requirement, and it should be understood that if the feature data input into the model is not stable enough, the accuracy of the model prediction result may be reduced, so that the stability of the data feature in the prediction set needs to be monitored through stable and reliable data features.
Preferably, when the first preset period is 1 month, the second preset period is 1 day, that is, every month passes, the historical data characteristics of the last month are obtained, a target training set is obtained according to the obtained historical data characteristics, and the target training set is used for performing stability evaluation on the real-time data of each day in the subsequent month.
S205: and calculating the stability of the data features in the prediction set relative to the data features in the target training set in a preset mode to serve as the target stability.
Specifically, the data features in the target training set are all relatively stable data features, and the stability of the data features in the prediction set is determined by evaluating the fluctuation condition of the data features in the prediction set relative to the data features in the target training set.
The preset manner for stability evaluation includes, but is not limited to: accuracy (Precision), Recall (Recall), F-Value (F-Measure), rank correlation, Singular Value Decomposition (SVD) algorithm, etc., may be set according to actual requirements, and are not particularly limited herein.
S206: and determining the monitoring result of the prediction set index based on the target stability and a preset stability threshold.
Specifically, a monitoring result of the index of the prediction set is determined based on the target stability and a preset stability threshold, when the target stability does not exceed the preset stability threshold, the monitoring result is determined to be that the index is normal, and when the target stability exceeds the preset stability threshold, the monitoring result is determined to be that the index is not stable enough.
Preferably, the preset stability threshold is 0.25.
In this embodiment, the historical data feature of the first preset period is obtained as the initial training feature set, the stability of the initial training set is detected to obtain the detection result, the target training set is determined according to the detection result, so that the subsequent data stability can be rapidly detected according to the target training set as the reference, which is beneficial to improving the accuracy of the subsequent data stability identification, meanwhile, the data feature in the prediction set is obtained according to the second preset period, wherein the second preset period is smaller than the first preset period, the stability of the data feature in the prediction set relative to the data feature in the target training set is calculated in a preset manner as the target stability, the monitoring result of the index of the prediction set is determined based on the target stability and the preset stability threshold, and the stability monitoring is realized by calculating the stability of the data feature in the prediction set relative to the data feature in the target training set, the efficiency of data stability detection is improved.
In some optional implementation manners of this embodiment, in step S201, acquiring the historical data feature of the first preset period, as an initial training feature set, includes:
performing data cleaning and normalization processing on the historical data characteristics to obtain initial data;
if the initial data contains continuous data, discretizing the continuous data to obtain discrete data;
and carrying out one-hot coding on the discrete data, and taking the data subjected to one-hot coding as an initial training feature set.
Specifically, after the historical data features are obtained, data cleaning and normalization processing are carried out on the historical data features to obtain initial data, each piece of initial data comprises a plurality of attribute features, the types of each attribute feature are divided into a continuous type and a discrete type, the feature types of the initial data are further identified, if the feature types are the continuous type, discretization processing is carried out on the continuous type feature data, the continuous type feature data are converted into discrete type data, and the discrete type data are subjected to one-hot coding to obtain an initial training feature set.
The data cleaning refers to cleaning some data with incomplete attributes or redundant data and data which does not belong to a preset range, and the normalization processing refers to normalizing the format, the attributes, the range and the like of the data.
In the financial field, a piece of data often includes a plurality of attribute features, for example, a piece of initial data is user information data including a user name, a user gender, a contact information, a transacted business, and the like, and each of the items is an attribute feature.
The continuous attribute features refer to attribute features which can be randomly valued in a certain interval, the numerical values are continuous, two adjacent numerical values can be infinitely divided, namely infinite numerical values can be obtained, for example, the specification and the size of a produced part, the height, the weight, the chest circumference and the like measured by a human body are continuous attribute features, and the numerical values can only be obtained by a measuring or metering method.
The discrete attribute features refer to data in which feature values can be listed one by one in a certain order, and are usually valued in integer numbers. Such as the number of workers, the number of factories, the number of machines, etc., the numerical value of the discrete attribute feature is obtained by a counting method.
It should be noted that, in this embodiment, null filling is performed on the discrete attribute feature of the missing numerical value, and the filling is performed with a special character "NA", so as to avoid that the attribute feature has no corresponding feature value to cause an influence on subsequent stability calculation in the initial data.
Optionally, when the historical data features are more, the dimension reduction processing is performed on the historical data features, so that excessive unimportant features are avoided from participating in the operation, the operation amount and the occupation of system resources are reduced, and the data processing efficiency is improved.
Further, for each initial data, if it has m different attribute features, m binary features are obtained according to one-hot encoding (one-hot encoding). And the characteristic values are mutually exclusive, only one characteristic value is activated each time, the activated characteristic value is set to be 1, the rest characteristic values which are not activated are set to be constant 0, and finally the basic digital code corresponding to each characteristic value of the attribute characteristic is obtained.
It should be understood that the manner of the one-hot encoding can change the data in the original state into sparse data, can better solve the problem of classifying the attribute feature data samples by data mining, and plays a role in extending the features to a certain extent, wherein the data in the original state refers to the initial data and the value range of the attribute features thereof.
For example, when the attribute is "sex", the range of values of the attribute includes two values, i.e., male and female, the digitized code corresponding to the sex is male [1,0], and the digital bar code corresponding to the sex is female is male [0,1 ].
It is worth explaining that the evaluation effect of stability is affected due to different attribute characteristic value modes and value ranges, and the characteristic values of different attribute characteristics are coded in a unified mode through single-hot coding, so that the characteristic values in the original state can be changed into sparse data, the negative influence on stability calculation due to different value modes of different characteristic values is avoided, and the accuracy of stability calculation is effectively improved.
In this embodiment, through carrying out data cleaning, planning processing, discretization and one-hot coding to historical data characteristic, reduce the dimension and the complexity of data, be favorable to follow-up well data processing's in-process, reduce the operand to improve data processing efficiency, simultaneously, also realize the data quantization with different characteristics, be favorable to improving follow-up stability detection's accuracy.
In some optional implementation manners of this embodiment, in step S203, determining the target training set according to the detection result includes:
if the detection result is that unstable data exist in the initial training set, adding the stable data into the target training set, and taking the unstable data as abnormal data;
obtaining an unstable type corresponding to the abnormal data, and repairing the abnormal data according to a repair scheme corresponding to the unstable type;
and if the repair is successful, adding the repaired abnormal data into the target training set, and if the repair is failed, removing the abnormal data.
Specifically, if the detection result is that unstable data exists in the initial training set, the stable data is added to the target training set, the unstable data is used as abnormal data, the unstable type corresponding to the abnormal data is judged, and the abnormal data is repaired according to the instability corresponding to the abnormal data.
Wherein, to unstable data characteristic repair, need to take different strategies according to causing unstable reason difference, specifically include:
for the instability caused by logic abnormality, the logic is reprocessed, logic repair is carried out, and training set data repair is repeated;
for the abnormal condition caused by the upward or downward fluctuation of the index due to the fluctuation of the whole index, the processing mode is normalization processing, and the stability of the index (data characteristic) is ensured;
for instability caused by periodic indexes, considering that some indexes can periodically change along with time, the indexes are not processed;
and performing the split-box and mould-entering processing on the data characteristics with normal fluctuation of the data characteristics and abnormal fluctuation of the data characteristics but normal fluctuation of a certain box or a plurality of boxes.
In the embodiment, abnormal data with unstable detection results are repaired, so that the accuracy of data in a target training set is ensured, and the accuracy of subsequent target stability calculation is ensured.
In some optional implementations of this embodiment, in step S205, calculating, in a preset manner, a stability of the data features in the prediction set with respect to the data features in the target training set, where the calculating, as the target stability, includes:
according to a preset binning mode, performing binning processing on the data features in the prediction set to obtain a first bin, and performing binning processing on the data features in the target training set to obtain a second bin;
calculating the stability indexes of the data characteristics in the sub-box and the data characteristics in the sub-box corresponding to the second sub-box aiming at any one sub-box in the first sub-box to obtain basic stability;
and accumulating all the basic stability degrees to obtain the target stability degree corresponding to the data characteristics.
Specifically, the data features in the prediction set and the data features in the target training set are both more, and in order to determine the stability of the data features in the prediction set more quickly and accurately, the present embodiment adopts a preset binning mode to bin the data features in the prediction set and the data features in the target training set, and calculates the stability of each sub-bin, and determines the target stability through the stability of all sub-bins, thereby avoiding a large amount of data operations caused by directly performing stability calculation on all data, and being beneficial to improving the data processing efficiency.
Data binning (also known as discrete binning or segmentation), among others, is a data preprocessing technique for reducing the effects of minor observation errors, a method of grouping multiple continuous values into a smaller number of "bins".
The preset binning mode can be selected according to actual requirements, and common binning modes include but are not limited to equal-frequency binning, equal-width binning, binning based on k-means clustering and the like.
In this embodiment, the stability of the data features in the target training set is calculated by binning in a binning mode, so that the target stability is obtained, and the stability calculation efficiency is improved.
In some optional implementation manners of this embodiment, the binning the data features in the prediction set according to a preset binning manner, and obtaining the first bin includes:
acquiring a box separation configuration parameter from a preset configuration file, wherein the box separation configuration parameter comprises a box number threshold;
acquiring m characteristic values contained in the characteristic data, wherein m is a positive integer greater than 1;
storing the m characteristic values into a preset characteristic value set, setting an initial value of a binning wheel number k to be 0, and setting a binning result of binning of a 0 th wheel to be null, wherein k belongs to [0, m-1 ];
aiming at each characteristic value in the characteristic value set, taking the characteristic value as a test split point, dividing a nominal variable into k +2 boxes on the basis of a box dividing result of a k-th round of box dividing, and calculating an association index value corresponding to the characteristic value to obtain m-k association index values;
taking a characteristic value corresponding to the maximum value in the m-k correlation index values as a target splitting point, dividing a nominal variable into k +2 boxes on the basis of the box dividing result of the kth round of box dividing, taking the nominal variable as the box dividing result of the kth +1 round of box dividing, and removing the characteristic value from the characteristic value set;
if the value of k +2 does not reach the preset bin number threshold value, returning each characteristic value in the characteristic value set, taking the characteristic value as a test split point, dividing the nominal variable into k +2 bins on the basis of the bin dividing result of the kth round of bin dividing, calculating the associated index value corresponding to the characteristic value, continuing to execute the step of obtaining m-k associated index values, otherwise, stopping the bin dividing, and determining the bin dividing result of the kth +1 round of bin dividing as a first bin dividing.
Specifically, a target splitting point is determined to split by obtaining m feature values included in preset configuration parameters and feature data according to a mode of calculating an associated index value, and splitting is stopped when the obtained bin number reaches a bin number threshold value, so that a first split bin is obtained.
It should be understood that the preset configuration parameters include a threshold of the number of boxes that are to be finally binned, and may be configured according to actual needs, which is not limited herein.
Wherein, the relevance index value is any one of an IV value, a Keyny coefficient and an information entropy.
In the embodiment, the splitting point is determined by calculating the associated index value, the risk of overfitting the model is reduced, the increase and the decrease of the scattered characteristics are easy, the rapid iteration of the model is easy, and the efficiency and the accuracy of the data characteristic binning are improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Fig. 3 is a schematic block diagram of an index stability monitoring apparatus corresponding to the index stability monitoring method according to the foregoing embodiment. As shown in fig. 3, the apparatus for monitoring index stability includes a first data acquisition module 31, a first stability detection module 32, a target training set determination module 33, a second data acquisition module 34, a second stability detection module 35, and a result determination module 36. The functional modules are explained in detail as follows:
the first data acquisition module 31 is configured to acquire historical data features of a first preset period as an initial training feature set;
the first stability detection module 32 is configured to perform stability detection on the initial training set to obtain a detection result;
a target training set determining module 33, configured to determine a target training set according to the detection result;
a second data obtaining module 34, configured to obtain data features in the prediction set according to a second preset period, where the second preset period is smaller than the first preset period;
the second stability detection module 35 is configured to calculate, in a preset manner, a stability of the data features in the prediction set with respect to the data features in the target training set, and use the stability as a target stability;
and a result determining module 36, configured to determine a monitoring result of the prediction set index based on the target stability and a preset stability threshold.
Optionally, the first data obtaining module 31 includes:
the data preprocessing unit is used for carrying out data cleaning and normalization processing on the historical data characteristics to obtain initial data;
the data discretization unit is used for discretizing the continuous data to obtain discrete data if the continuous data exists in the initial data;
and the one-hot coding unit is used for carrying out one-hot coding on the discrete data and taking the data subjected to one-hot coding as an initial training feature set.
Optionally, the target training set determining module 33 includes:
the abnormal data determining unit is used for adding the stable data into the target training set and taking the unstable data as abnormal data if the detection result is that the unstable data exist in the initial training set;
the abnormal data repairing unit is used for acquiring the unstable type corresponding to the abnormal data and repairing the abnormal data according to the repairing scheme corresponding to the unstable type;
and the abnormal data classification unit is used for adding the repaired abnormal data into the target training set if the repair is successful, and removing the abnormal data if the repair is failed.
Optionally, the second stabilization detection module 35 includes:
the data characteristic binning unit is used for performing binning processing on the data characteristics in the prediction set according to a preset binning mode to obtain a first bin, and performing binning processing on the data characteristics in the target training set to obtain a second bin;
the basic stability calculation unit is used for calculating the stability indexes of the data characteristics in the sub-boxes and the data characteristics in the sub-boxes corresponding to the second sub-boxes aiming at any one sub-box in the first sub-box to obtain basic stability;
and the target stability determining unit is used for accumulating all the basic stabilities to obtain the target stability corresponding to the data characteristics.
Optionally, the data feature binning unit comprises:
the parameter acquisition subunit is used for acquiring the box separation configuration parameters from a preset configuration file, wherein the box separation configuration parameters comprise a box number threshold value;
the characteristic value acquisition subunit is used for acquiring m characteristic values contained in the characteristic data, wherein m is a positive integer greater than 1;
the initialization unit is used for storing the m characteristic values into a preset characteristic value set, setting the initial value of the binning wheel number k to be 0, and setting the binning result of the 0 th round of binning to be null, wherein k belongs to [0, m-1 ];
the relevance index value calculation unit is used for dividing the nominal variable into k +2 boxes on the basis of the box dividing result of the kth round of box dividing by taking the characteristic value as a test splitting point for each characteristic value in the characteristic value set, and calculating relevance index values corresponding to the characteristic values to obtain m-k relevance index values;
a binning result determining unit, configured to take a feature value corresponding to a maximum value of the m-k correlation index values as a target split point, divide the nominal variable into k +2 bins on the basis of the binning result of the kth round of binning, take the nominal variable as the binning result of the kth +1 round of binning, and remove the feature value from the feature value set;
and the circulating iteration unit is used for returning each characteristic value in the characteristic value set if the value of k +2 does not reach the preset bin number threshold value, dividing the nominal variable into k +2 bins on the basis of the bin dividing result of the kth round of bin dividing by taking the characteristic value as a test splitting point, calculating the associated index value corresponding to the characteristic value, continuing to execute the step of obtaining m-k associated index values, and otherwise, stopping the bin dividing and determining the bin dividing result of the kth +1 round of bin dividing as the first bin dividing.
For the specific definition of the monitoring device for index stability, reference may be made to the above definition of the monitoring method for index stability, and details are not described here. All or part of the modules in the index stability monitoring device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only the computer device 4 having the components connection memory 41, processor 42, network interface 43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or D interface display memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as program codes for controlling electronic files. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute the program code stored in the memory 41 or process data, such as program code for executing control of an electronic file.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The present application provides another embodiment, which is to provide a computer-readable storage medium, wherein the computer-readable storage medium stores an interface display program, and the interface display program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the index stability monitoring method.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A method for monitoring index stability is characterized by comprising the following steps:
acquiring historical data characteristics of a first preset period as an initial training characteristic set;
performing stability detection on the initial training set to obtain a detection result;
determining a target training set according to the detection result;
acquiring data characteristics in a prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period;
calculating the stability of the data features in the prediction set relative to the data features in the target training set in a preset mode to serve as target stability;
and determining the monitoring result of the prediction set index based on the target stability and a preset stability threshold.
2. The method for monitoring index stability according to claim 1, wherein the acquiring the historical data feature of the first preset period as an initial training feature set comprises:
carrying out data cleaning and normalization processing on the historical data characteristics to obtain initial data;
if continuous data exist in the initial data, discretizing the continuous data to obtain discrete data;
and carrying out one-hot coding on the discrete data, and taking the data subjected to one-hot coding as the initial training feature set.
3. The method for monitoring the stability of the index as claimed in claim 1, wherein the determining the target training set according to the detection result comprises:
if the detection result is that unstable data exist in the initial training set, adding the stable data into a target training set, and taking the unstable data as abnormal data;
obtaining an unstable type corresponding to the abnormal data, and repairing the abnormal data according to a repair scheme corresponding to the unstable type;
and if the repair is successful, adding the repaired abnormal data into the target training set, and if the repair is failed, removing the abnormal data.
4. The method for monitoring the stability of the index according to any one of claims 1 to 3, wherein the calculating the stability of the data features in the prediction set relative to the data features in the target training set in a preset manner as the target stability includes:
according to a preset binning mode, performing binning processing on the data features in the prediction set to obtain a first bin, and performing binning processing on the data features in the target training set to obtain a second bin;
aiming at any sub-box in the first sub-box, calculating stability indexes of the data characteristics in the sub-box and the data characteristics in the sub-box corresponding to the second sub-box to obtain basic stability;
and accumulating all the basic stability degrees to obtain the target stability degree corresponding to the data characteristic.
5. The method for monitoring index stability according to claim 4, wherein the binning the data features in the prediction set according to a preset binning mode to obtain a first bin comprises:
acquiring a binning configuration parameter from a preset configuration file, wherein the binning configuration parameter comprises the bin number threshold;
acquiring m characteristic values contained in the characteristic data, wherein m is a positive integer greater than 1;
storing the m characteristic values into a preset characteristic value set, setting an initial value of a binning wheel number k to be 0, and setting a binning result of binning of a 0 th wheel to be null, wherein k belongs to [0, m-1 ];
for each characteristic value in the characteristic value set, taking the characteristic value as a test split point, dividing the nominal variable into k +2 boxes on the basis of a box dividing result of a k-th round of box dividing, and calculating relevance index values corresponding to the characteristic values to obtain m-k relevance index values;
taking a characteristic value corresponding to the maximum value in the m-k correlation index values as a target splitting point, dividing the nominal variable into k +2 boxes on the basis of the box dividing result of the kth round of box dividing, taking the nominal variable as the box dividing result of the kth +1 round of box dividing, and removing the characteristic value from the characteristic value set;
if the value of k +2 does not reach the preset bin number threshold value, returning to each feature value in the feature value set, taking the feature value as a test split point, dividing the nominal variable into k +2 bins on the basis of the bin dividing result of the kth round of bin dividing, calculating the correlation index value corresponding to the feature value, continuing to execute the step of obtaining m-k correlation index values, and otherwise, stopping the bin dividing, and determining the bin dividing result of the kth +1 round of bin dividing as the first bin dividing.
6. The method for monitoring index stability of claim 5, wherein the correlation index value is any one of an IV value, a Keyny coefficient and an information entropy.
7. A monitoring device for index stability is characterized by comprising:
the first data acquisition module is used for acquiring historical data characteristics of a first preset period as an initial training characteristic set;
the first stability detection module is used for performing stability detection on the initial training set to obtain a detection result;
the target training set determining module is used for determining a target training set according to the detection result;
the second data acquisition module is used for acquiring data characteristics in the prediction set according to a second preset period, wherein the second preset period is smaller than the first preset period;
the second stability detection module is used for calculating the stability of the data features in the prediction set relative to the data features in the target training set in a preset mode to serve as target stability;
and the result determining module is used for determining the monitoring result of the prediction set index based on the target stability and a preset stability threshold.
8. The apparatus for monitoring the stability of an indicator of claim 7, wherein the target training set determining module comprises:
an abnormal data determining unit, configured to add the stable data to a target training set if the detection result is that unstable data exists in an initial training set, and use the unstable data as abnormal data;
the abnormal data repairing unit is used for acquiring an unstable type corresponding to the abnormal data and repairing the abnormal data according to a repairing scheme corresponding to the unstable type;
and the abnormal data classification unit is used for adding the repaired abnormal data into the target training set if the repair is successful, and removing the abnormal data if the repair is failed.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the index stability monitoring method according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the method for monitoring the stability of an index according to any one of claims 1 to 6.
CN202011056363.9A 2020-09-29 Index stability monitoring method and device, computer equipment and medium CN112183644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011056363.9A CN112183644B (en) 2020-09-29 Index stability monitoring method and device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011056363.9A CN112183644B (en) 2020-09-29 Index stability monitoring method and device, computer equipment and medium

Publications (2)

Publication Number Publication Date
CN112183644A true CN112183644A (en) 2021-01-05
CN112183644B CN112183644B (en) 2024-05-03

Family

ID=

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9449283B1 (en) * 2012-08-20 2016-09-20 Context Relevant, Inc. Selecting a training strategy for training a machine learning model
CN108764273A (en) * 2018-04-09 2018-11-06 中国平安人寿保险股份有限公司 A kind of method, apparatus of data processing, terminal device and storage medium
CN108959187A (en) * 2018-04-09 2018-12-07 中国平安人寿保险股份有限公司 A kind of variable branch mailbox method, apparatus, terminal device and storage medium
CN109213755A (en) * 2018-09-30 2019-01-15 长安大学 A kind of traffic flow data cleaning and restorative procedure based on Time-space serial
CN109857593A (en) * 2019-01-21 2019-06-07 北京工业大学 A kind of data center's log missing data restoration methods
CN110428087A (en) * 2019-06-25 2019-11-08 万翼科技有限公司 Business stability prediction technique, device, computer equipment and storage medium
CN110659325A (en) * 2018-05-31 2020-01-07 罗伯特·博世有限公司 System and method for large scale multidimensional spatiotemporal data analysis
CN111191731A (en) * 2020-01-02 2020-05-22 同盾控股有限公司 Data processing method and device, storage medium and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9449283B1 (en) * 2012-08-20 2016-09-20 Context Relevant, Inc. Selecting a training strategy for training a machine learning model
CN108764273A (en) * 2018-04-09 2018-11-06 中国平安人寿保险股份有限公司 A kind of method, apparatus of data processing, terminal device and storage medium
CN108959187A (en) * 2018-04-09 2018-12-07 中国平安人寿保险股份有限公司 A kind of variable branch mailbox method, apparatus, terminal device and storage medium
CN110659325A (en) * 2018-05-31 2020-01-07 罗伯特·博世有限公司 System and method for large scale multidimensional spatiotemporal data analysis
CN109213755A (en) * 2018-09-30 2019-01-15 长安大学 A kind of traffic flow data cleaning and restorative procedure based on Time-space serial
CN109857593A (en) * 2019-01-21 2019-06-07 北京工业大学 A kind of data center's log missing data restoration methods
CN110428087A (en) * 2019-06-25 2019-11-08 万翼科技有限公司 Business stability prediction technique, device, computer equipment and storage medium
CN111191731A (en) * 2020-01-02 2020-05-22 同盾控股有限公司 Data processing method and device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
温粉莲;: "一种混合模型的时序数据异常检测方法", 数字通信世界, no. 01, pages 23 - 24 *

Similar Documents

Publication Publication Date Title
CN110866181B (en) Resource recommendation method, device and storage medium
CN112148987B (en) Message pushing method based on target object activity and related equipment
CN109284372B (en) User operation behavior analysis method, electronic device and computer readable storage medium
CN112085565A (en) Deep learning-based information recommendation method, device, equipment and storage medium
CN112328909B (en) Information recommendation method and device, computer equipment and medium
CN112529477A (en) Credit evaluation variable screening method, device, computer equipment and storage medium
CN113836131A (en) Big data cleaning method and device, computer equipment and storage medium
CN112328657A (en) Feature derivation method, feature derivation device, computer equipment and medium
CN115081025A (en) Sensitive data management method and device based on digital middlebox and electronic equipment
CN115794578A (en) Data management method, device, equipment and medium for power system
CN112990583B (en) Method and equipment for determining model entering characteristics of data prediction model
CN110019193B (en) Similar account number identification method, device, equipment, system and readable medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112541595A (en) Model construction method and device, storage medium and electronic equipment
CN111752734A (en) Abnormal data classification method, abnormal data analysis method, abnormal data classification device and abnormal data analysis device, and storage medium
CN110852893A (en) Risk identification method, system, equipment and storage medium based on mass data
CN115757075A (en) Task abnormity detection method and device, computer equipment and storage medium
CN113705201B (en) Text-based event probability prediction evaluation algorithm, electronic device and storage medium
CN112183644B (en) Index stability monitoring method and device, computer equipment and medium
CN112183644A (en) Index stability monitoring method and device, computer equipment and medium
CN115392361A (en) Intelligent sorting method and device, computer equipment and storage medium
CN114997419A (en) Updating method and device of rating card model, electronic equipment and storage medium
CN111652281B (en) Information data classification method, device and readable storage medium
CN112381458A (en) Project evaluation method, project evaluation device, equipment and storage medium
CN113449062A (en) Track processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination