CN113673595A - Data processing method, device and equipment - Google Patents

Data processing method, device and equipment Download PDF

Info

Publication number
CN113673595A
CN113673595A CN202110958825.4A CN202110958825A CN113673595A CN 113673595 A CN113673595 A CN 113673595A CN 202110958825 A CN202110958825 A CN 202110958825A CN 113673595 A CN113673595 A CN 113673595A
Authority
CN
China
Prior art keywords
index
index data
data
determining
hill
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110958825.4A
Other languages
Chinese (zh)
Inventor
周玮理
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN202110958825.4A priority Critical patent/CN113673595A/en
Publication of CN113673595A publication Critical patent/CN113673595A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Computing Systems (AREA)
  • Development Economics (AREA)
  • Mathematical Physics (AREA)
  • Educational Administration (AREA)
  • Medical Informatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the specification provides a data processing method, a data processing device and data processing equipment, wherein the method comprises the steps of obtaining index data; the index data comprises a plurality of attribute values corresponding to the same index; determining distribution information of the index data; based on the distribution information, removing abnormal values in the index data to obtain first index data; and performing box separation processing on the first index data to obtain a box separation result. By utilizing the embodiment of the specification, the problem that the binning result is inaccurate due to the fact that binning is influenced by noise in the prior art can be solved.

Description

Data processing method, device and equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus and device.
Background
With the rapid development of computer technology and the arrival of big data era, internet information technology has been widely applied to various industries of society, and has powerfully promoted the development of these industries. For example, in the general finance field, big data and various financial technology technologies are helping people to predict various risks, thereby effectively avoiding the risks, realizing the health and continuous development of the industry. However, as science and technology are continuously developed, data related to risks in the financial industry are exponentially increased, and in order to measure or predict risks more accurately, a risk prediction model is generally established according to key index data of various types of risks.
In the prior art, in order to make an established risk prediction model more stable, continuous variables are discretized generally in an equal-frequency binning mode, an equal-clustering binning mode or a K-means clustering binning mode. However, the similarity of the feature distribution of the equal-frequency binning and the equal-aggregation binning is not equal to that of the K-means clustering binning, and the K-means clustering binning is very susceptible to the influence of abnormal values, so that an accurate binning result cannot be obtained, and the risk prediction accuracy is influenced.
Therefore, there is a need for a solution to the above technical problems.
Disclosure of Invention
The embodiment of the specification provides a data processing method, a data processing device and data processing equipment, so that the binning accuracy can be improved, and the risk prediction accuracy is further improved.
The data processing method, the data processing device and the data processing equipment provided by the specification are realized in the following modes.
A method of data processing, comprising: acquiring index data; the index data comprises a plurality of attribute values corresponding to the same index; determining distribution information of the index data; based on the distribution information, removing abnormal values in the index data to obtain first index data; and performing box separation processing on the first index data to obtain a box separation result.
A data processing apparatus comprising: the acquisition module is used for acquiring index data; the index data comprises a plurality of attribute values corresponding to the same index; the determining module is used for determining the distribution information of the index data; the removing module is used for removing abnormal values in the index data based on the distribution information to obtain first index data; and the box separating module is used for carrying out box separating processing on the first index data to obtain a box separating result.
A data processing apparatus comprising at least one processor and a memory storing computer executable instructions which, when executed by the processor, implement the steps of any one of the method embodiments of the present specification.
A computer readable storage medium having stored thereon computer instructions which, when executed, implement the steps of any one of the method embodiments in the present specification.
The specification provides a data processing method, a data processing device and data processing equipment. Index data may be obtained in some embodiments; the index data comprises a plurality of attribute values corresponding to the same index; distribution information of the index data is determined. And based on the distribution information, removing abnormal values in the index data to obtain first index data, and performing box separation processing on the first index data to obtain box separation results. Before the index data is subjected to binning processing, corresponding detection and analysis are carried out on abnormal values in the index data, and the abnormal values are separately binned, so that the problem that binning results are inaccurate due to the fact that binning is affected by noise in the prior art can be solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the specification, are incorporated in and constitute a part of this specification, and are not intended to limit the specification. In the drawings:
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a data processing method provided herein;
FIG. 2 is a block diagram of one embodiment of a data processing apparatus provided herein;
fig. 3 is a block diagram of a hardware configuration of an embodiment of a data processing server provided in the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments in the present specification, and not all of the embodiments. All other embodiments that can be obtained by a person skilled in the art based on one or more embodiments of the present disclosure without making any creative effort shall fall within the protection scope of the embodiments of the present disclosure.
The following describes an embodiment of the present disclosure with a specific application scenario as an example. Specifically, fig. 1 is a schematic flow chart of an embodiment of a data processing method provided in this specification. Although the present specification provides the method steps or apparatus structures as shown in the following examples or figures, more or less steps or modules may be included in the method or apparatus structures based on conventional or non-inventive efforts.
One embodiment provided by the present specification can be applied to a client, a server, and the like. The client may include a terminal device, such as a smart phone, a tablet computer, and the like. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed system, and the like.
It should be noted that the following description of the embodiments does not limit the technical solutions in other extensible application scenarios based on the present specification. In an embodiment of a data processing method provided in the present specification, as shown in fig. 1, the method may include the following steps.
S0: acquiring index data; the index data comprises a plurality of attribute values corresponding to the same index.
Wherein the index may represent an attribute of the target object. Each index may correspond to a plurality of attribute values. For example, the attribute values corresponding to the age index may include 1 year, 7 years, 21 years, 68 years, etc., and the attribute values corresponding to the enterprise registry index may include beijing, shanghai, guangzhou, suzhou, etc. The plurality of attribute values corresponding to each index may be numeric, categorical, discrete, continuous, and the like.
In some implementation scenarios, before the index data is obtained, the index data set corresponding to each application scenario may be obtained. The index data set may include a plurality of pieces of index data, and each piece of index data may include a plurality of attribute values.
In some implementation scenarios, when the index data set corresponding to each application scenario is obtained, the relevant data may be pulled from a preset database or a memory, and then processed, so as to obtain the index data set. The preset database may include an Oracle database, a MySQL database, and the like.
For example, taking the pre-loan wind control scenario of an organization for a public customer as an example, the relevant data pulled from the preset database or memory may include customer industry and commerce information data, tax data, financial data, applied financial product information data, credit information data, transaction data, complex network information data, and the like. Taking the enterprise credit risk scenario as an example, the relevant data pulled from the preset database or the memory may include the business tax information, the financial data, the judicial administration data, the guarantee information data, and the like of the enterprise. Taking the personal credit risk scenario as an example, the related data pulled from the preset database or the memory may include personal identity information, multi-head loan data, behavior data, social network information, and the like. It is to be understood that the foregoing is only exemplary, and the embodiments of the present disclosure are not limited to the above examples, and other modifications may be made by those skilled in the art within the spirit of the present disclosure, and the scope of the present disclosure is intended to be covered by the claims as long as the functions and effects achieved by the embodiments are the same as or similar to the present disclosure.
Since for a machine learning task, a set of attributes (different indexes) is given, some of which may be critical for machine learning, but some of which have little meaning, it becomes important to process the relevant data after pulling the relevant data of each scene from a preset database or memory.
In some implementation scenarios, after the relevant data of each scenario is pulled from the preset database or the memory, the manner of processing the relevant data may include index derivation, index screening, data cleaning, and the like. The index derivation may be referred to as a feature structure. Index derivation can be used for exploring different statistical description information of the indexes under different time slices, and valuable information is extracted from the information so as to develop a prediction model. Attributes or metrics useful for the current learning task may be referred to as relevant metrics, and attributes or metrics not useful for the current learning task may be referred to as irrelevant metrics. The process of selecting a subset of relevant metrics from a given set of attributes is a screening of metrics. For example, taking the credit risk feature as an example, a good index may have the following advantages: (1) when the macroscopic economic factors are stable, the supervision policies are stable, the passenger groups are stably distributed, and the marketing of financial products is stable, the distribution of indexes also needs to be stable; (2) the incidence relation between the indexes and the credit risk is in accordance with the wind control business logic; (3) the future default needs to be significantly different from the distribution of non-default groups of passengers on the index. The data cleaning can preliminarily filter out noise information in the data, and the subsequent processing result of the data is improved.
In some implementations, the manner in which the metrics are derived can include summation, scaling, frequency, averaging, and so forth. For example, the total number of credits credited to a business customer on different platforms over a period of time, the total amount of funds moved out/moved in to the business customer over a period of time, the frequency of transactions across the business customer over a period of time, the monthly average of the net profit to the business customer over a period of time, etc. It is to be understood that the above description is only exemplary, and the derivation manner of the index is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the present application, and the present application is intended to cover the scope of the present application as long as the achieved function and effect are the same or similar to the present application.
In some implementation scenarios, after the processing is performed on the relevant data of each scenario, an index data set corresponding to each scenario may be obtained. Furthermore, when one or more pieces of index data need to be processed (such as box separation processing and the like), corresponding index data can be directly obtained from the corresponding index data set according to the index name, and therefore the subsequent data processing efficiency is improved.
S2: and determining distribution information of the index data.
In some implementation scenarios, after the index data is obtained, distribution information of the index data may be further determined. The distribution information may indicate the distribution of the index data, and may include, for example, a normal distribution, a non-normal distribution, and the like.
In some implementation scenarios, the determining distribution information of the index data may include: and performing normality test on the index data to determine the distribution information of the index data.
For example, in some implementation scenarios, qq plot test may be performed on each index data to determine whether the index data is normally distributed. Wherein qq plot can be used to visually verify whether a group of data comes from a certain distribution or whether two groups of data come from the same (family) distribution. qq plot can be used to verify that data is from a normal distribution.
Of course, the above description is only exemplary, and the way of checking whether the data is normally distributed is not limited to the above examples, such as histogram, skewness and kurtosis, etc., and other modifications are possible for those skilled in the art based on the technical spirit of the present application, but the scope of the present application should be covered as long as the functions and effects achieved by the present application are the same as or similar to that of the present application.
S4: and based on the distribution information, removing abnormal values in the index data to obtain first index data.
In some implementation scenarios, after determining the distribution information of the index data, the first index data may be obtained by removing abnormal values in the index data based on the distribution information. The attribute values included in the first index data are attribute values remaining after the abnormal attribute values are removed from the plurality of attribute values corresponding to the index data.
In some implementation scenarios, the removing abnormal values from the index data based on the distribution information to obtain first index data may include: determining a confidence interval corresponding to the index data under the condition that the index data are determined to meet normal distribution; and eliminating the attribute values outside the confidence interval to obtain first index data. Wherein the confidence interval refers to an estimation interval of the overall parameter constructed by the sample statistics. In statistics, the Confidence interval (Confidence interval) of a probability sample is an interval estimate for some overall parameter of this sample. The confidence interval is a commonly used interval estimation method, and the confidence interval is an interval formed by taking a confidence upper limit and a confidence lower limit of the statistic as an upper boundary and a lower boundary respectively.
For example, in some implementation scenarios, after qq plot inspection is performed on each piece of index data to determine whether the index data is normally distributed, if the distribution information is normally distributed, a confidence interval of the index data under an extreme condition, that is, a threshold of an interval under a preset confidence level, may be calculated, and then an extreme value (that is, an abnormal value) other than the threshold is found according to the threshold and removed, so as to obtain the first index data. Wherein the confidence level may be understood as a confidence. The preset confidence level may be 95%, 90%, etc., and may be specifically set according to an actual scene.
In some implementation scenarios, the removing abnormal values in the index data based on the distribution information to obtain the first index data may further include: under the condition that the index data are determined not to meet normal distribution, calculating a Hill estimation value corresponding to each attribute value in the index data; constructing a Hill graph based on the sequence information and the Hill estimated value of each attribute value; selecting an abscissa of a starting point of a region which meets a preset condition in the Hill graph to obtain a first threshold; and removing the attribute value corresponding to the abscissa larger than the first threshold value to obtain first index data. The Hill estimation value may also be referred to as an extreme value index Hill estimation. Hill estimation is a very convenient method of selecting a threshold. The calculation of the Hill estimate provides a guarantee for the subsequent determination of the first threshold value.
In some implementation scenarios, the calculating the Hill estimation value corresponding to each attribute value in the index data may include: sequencing a plurality of attribute values in the index data to obtain sequenced attribute values; and calculating the Hill estimated value corresponding to each attribute value by using a preset mode based on the sorted attribute values.
In some implementation scenarios, the Hill estimation value corresponding to each attribute value may be calculated by:
Figure BDA0003221333450000061
wherein, X1,n、X2,n、…、Xn,nRepresenting the ordered n attribute values, Hk,nRepresents Xk,nAnd k and j represent serial numbers of the corresponding Hill estimated values, and k is more than or equal to 1 and less than or equal to n-1.
For example, in some implementation scenarios, after qq plot test is performed on each index data to determine whether the index data is normally distributed, if the distribution information is non-normally distributed, the attribute values { X corresponding to the same index may be used1,X2,…,XnAnd the values are arranged in ascending order to obtain the attribute value (X) after the ordering1,n,X2,n,…,Xn,nIn which X1,n≤X2,n≤…≤Xn,n. Further, the Hill evaluation value corresponding to each attribute value may be calculated according to the above formula (1).
In some implementation scenarios, after obtaining the Hill estimate corresponding to each attribute value, a set of points { (k, H)k,n) (ii) a K is more than or equal to 1 and less than or equal to n-1, furthermore, a curve can be constructed based on the point set, wherein the constructed curve can be called a Hill graph. A Hill chart may be used to select the threshold.
In some implementation scenarios, after constructing the Hill map, the abscissa of the starting point of the region satisfying the preset condition may be selected as the first threshold by observing the change of the Hill estimation value in the map. Further, the attribute values corresponding to the abscissa larger than the first threshold may be eliminated, so as to obtain the first index data. The preset condition area may be an interval corresponding to a stable part (or a relatively stable part, exhibiting a stable linear state) in the Hill diagram.
In some implementation scenarios, the abnormal value removed from the index data may be saved as a category.
S6: and performing box separation processing on the first index data to obtain a box separation result.
In some implementation scenarios, after the abnormal value in the index data is removed based on the distribution information and the first index data is obtained, the first index data may be subjected to binning processing to obtain a binning result. Wherein, the data binning can group the samples with high similarity. Binning may discretize a continuous variable, i.e., obtain discrete features. The discrete features may include the following advantages: (1) the increase and the decrease of discrete characteristics are easy, and the rapid iteration of the model is easy; (2) the discretized features have strong robustness on abnormal data, for example, one feature is that the age is 1 when the feature is older than 20 years, otherwise, the feature is 0, and if the feature is not discretized, an abnormal data 'age is 90 years', which causes great interference to the model; (3) the logistic regression belongs to a generalized linear model, the expression capacity is limited, after the univariate is discretized into N variables, each variable has independent weight, which is equivalent to introducing nonlinearity into the model, so that the expression capacity of the model can be improved, and the fitting effect is improved; (4) after the features are discretized, the model is more stable, for example, the age of a client is discretized, the age of 20-30 years is taken as an interval, the phenomenon that the client becomes a completely different person after being aged one year can be avoided, and after the features are discretized, the function of simplifying a logistic regression model can be achieved, and the risk of overfitting of the model is reduced; (5) all variables can be transformed to similar scales.
In some implementation scenarios, the binning the first index data to obtain a binning result may include: determining the number of target boxes; clustering the first index data by using a K-Means algorithm based on the target box number to obtain a classification result; and determining a box separation interval corresponding to the first index data according to the classification result. It is to be understood that the above description is only exemplary, and the way of clustering is not limited to the above examples, and those skilled in the art may make other modifications within the spirit of the present application, but all that can achieve the same or similar functions and effects as the present application should be covered by the protection scope of the present application.
In some implementation scenarios, a target binning number (which may also be referred to as a clustering number) of each index may be determined by using an inflection point method, and then binning operation is performed on the index data according to a K-Means algorithm.
In some implementation scenarios, after the target bin number is determined, each attribute value corresponding to the first index data may be used as a sample, then samples with the same number as the target bin number are randomly selected as centers, and the closest center point is found for the remaining samples, so as to divide them into several categories. Further, in each category, a new central point may be selected, and the above process is repeated until the central point is not changed any more, and the classification is finished, and a classification result is obtained. The classification result includes the same number of categories as the target bin number, and each category may include one or more attribute values.
In some implementation scenarios, after obtaining the classification result, the binning interval may be set according to the classification result.
In some implementations, the binning interval may be a number of finite segments for numerical metric data. For example, the business registration years are divided into less than 5 years, 5 to 10 years, 10 to 20 years, more than 20 years, and the like.
In some implementation scenarios, for the type indicator data, if the number of attribute values is large, the binning interval may be a few segments that are combined into a small number. For example, the provinces of the enterprise registration are divided into { north, top, wide }, { black, Ji, Liao }, { river, Zhe, Shanghai }, { Min, Guangdong, Xiang }, and the like.
Of course, the above description is only exemplary, the binning interval is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the present application, but all that can be achieved is within the scope of the present application as long as the achieved functions and effects are the same or similar to the present application.
In some implementation scenarios, after obtaining the binning result, the method may further include: and determining a characteristic index for predicting the default of the institution based on the binning result corresponding to each index data.
In some implementation scenarios, the determining a characteristic index of the forecast agency default based on the binning result corresponding to each index data includes: calculating an IV value of each index based on the binning result corresponding to each index data; and determining a characteristic index for predicting the default of the organization according to the IV value of each index. Wherein, IV is called Information Value. The IV value may be used to indicate the degree of contribution of the index to the target prediction, i.e., the prediction ability of the index. Generally, the higher the IV value, the stronger the predictive ability of the index, and the higher the degree of information contribution. Of course, the above description is only exemplary, and the characteristic index of the default of the forecast organization can also be determined by calculating the WOE (evidence weight) of each index, and other modifications are possible for those skilled in the art in light of the technical spirit of the present application, but the present application shall be covered by the protection scope as long as the achieved function and effect are the same or similar to the present application.
In some implementation scenarios, determining the characteristic indicator of the forecast agency default according to the IV value of each indicator may include: and sequencing all the indexes according to the IV value of each index, and selecting the indexes meeting the preset conditions from the indexes as the characteristic indexes for predicting the default of the institution. The preset conditions may be 3 before sorting, 5 before sorting, an IV value greater than 0.3, and the like, which is not limited in this specification.
In some implementations, the characteristic indicator of the predicted agency breach can include at least one of: the profit capacity index, the repayment capacity index, the operation capacity index, the monthly fund flow change index of the organization and the business scene degree index of the organization.
In the embodiment of the specification, after the binning result is obtained, the screening of the IV value is performed, so that the index scale can be reduced, and the data redundancy is reduced, thereby forming a global optimal index system.
In some implementations, after determining the characteristic indicators of the predictive agency's default, they can be presented as important risk indicating variables. And each characteristic index can also set a risk warning threshold according to the bad account rate and the proportion of the data occupying the whole amount of each box, and then the risk warning threshold is applied to a crediting scene or a supervision system.
In some implementation scenarios, after determining the characteristic indicator of the forecast agency default, the method may further include: acquiring a sample data set; the sample data set comprises sample data corresponding to a plurality of institutions, and each sample data comprises data corresponding to the characteristic indexes and institution default information; and training a preset model by using the sample data set to obtain a default prediction model. The preset model may be a logistic regression model, a neural network model, or the like.
The logistic regression model has the following advantages: (1) the interpretability is strong, the relation among the variables is a linear additive relation, and a risk threshold value can be set for the final score and each variable respectively; (2) the structure is simple, and the influence of the input variable on the target variable is easy to obtain; (3) the test, deployment, monitoring, tuning and the like of the model are relatively simple, and the engineering is relatively easy. However, since the data input and the result output of other machine learning models are all black box operations, the interpretability is poor, and the preset model in all the embodiments of the present specification is preferably a logistic regression model.
In some implementation scenarios, after obtaining the default prediction model, the method may further include: acquiring data corresponding to the characteristic indexes of the target mechanism; and determining default information of the target mechanism according to the data corresponding to the characteristic indexes of the target mechanism and the default prediction model. Wherein the target entity may be an entity for which a risk needs to be predicted.
In the embodiment of the specification, the distribution condition of each piece of index data is analyzed, the abnormal value is separately divided into one box according to different conditions, and then the box dividing operation is performed on the remaining attribute values, so that the abnormal value is ignored in other clustering box dividing methods, the problem that the abnormal value is easily influenced by noise in the box dividing process in the prior art can be solved, and an accurate box dividing result is obtained.
The embodiment of the specification can be applied to a pre-stage characteristic variable preprocessing stage of selecting the risk key indexes by machine learning algorithm modeling.
It is to be understood that the foregoing is only exemplary, and the embodiments of the present disclosure are not limited to the above examples, and other modifications may be made by those skilled in the art within the spirit of the present disclosure, and the scope of the present disclosure is intended to be covered by the claims as long as the functions and effects achieved by the embodiments are the same as or similar to the present disclosure.
From the above description, it can be seen that the embodiment of the present application can obtain index data; the index data comprises a plurality of attribute values corresponding to the same index; distribution information of the index data is determined. And based on the distribution information, removing abnormal values in the index data to obtain first index data, and performing box separation processing on the first index data to obtain box separation results. Before the index data is subjected to binning processing, corresponding detection and analysis are carried out on abnormal values in the index data, and the abnormal values are separately binned, so that the problem that binning results are inaccurate due to the fact that binning is affected by noise in the prior art can be solved.
In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. Reference is made to the description of the method embodiments.
Based on the data processing method, one or more embodiments of the present specification further provide a data processing apparatus. The apparatus may include systems (including distributed systems), software (applications), modules, components, servers, clients, etc. that use the methods described in the embodiments of the present specification in conjunction with any necessary apparatus to implement the hardware. Based on the same innovative conception, embodiments of the present specification provide an apparatus as described in the following embodiments. Since the implementation scheme of the apparatus for solving the problem is similar to that of the method, the specific implementation of the apparatus in the embodiment of the present specification may refer to the implementation of the foregoing method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Specifically, fig. 2 is a schematic block diagram of an embodiment of a data processing apparatus provided in this specification, and as shown in fig. 2, the data processing apparatus provided in this specification may include: the system comprises an acquisition module 120, a determination module 122, a culling module 124 and a binning module 126.
An obtaining module 120, which may be configured to obtain index data; the index data comprises a plurality of attribute values corresponding to the same index;
a determining module 122, operable to determine distribution information of the indicator data;
a removing module 124, configured to remove an abnormal value in the index data based on the distribution information to obtain first index data;
the binning module 126 may be configured to perform binning processing on the first index data to obtain a binning result.
In some implementation scenarios, before the index data is obtained, the index data set corresponding to each application scenario may be obtained. The index data set may include a plurality of pieces of index data, and each piece of index data may include a plurality of attribute values.
In some implementation scenarios, when the index data set corresponding to each application scenario is obtained, the relevant data may be pulled from a preset database or a memory, and then processed, so as to obtain the index data set. The preset database may include an Oracle database, a MySQL database, and the like.
In some implementation scenarios, after the relevant data of each scenario is pulled from the preset database or the memory, the manner of processing the relevant data may include index derivation, index screening, data cleaning, and the like. In some implementations, the manner in which the metrics are derived can include summation, scaling, frequency, averaging, and so forth.
In some implementation scenarios, the determining module 122 may include:
and the checking unit can be used for performing normality check on the index data and determining the distribution information of the index data.
In some implementation scenarios, qq plot test can be performed on each index datum to determine whether the index datum is normally distributed. Wherein qq plot can be used to visually verify whether a group of data comes from a certain distribution or whether two groups of data come from the same (family) distribution. qq plot can be used to verify that data is from a normal distribution. Of course, the above description is only exemplary, and the way of checking whether the data is normally distributed is not limited to the above examples, such as histogram, skewness and kurtosis, etc., and other modifications are possible for those skilled in the art based on the technical spirit of the present application, but the scope of the present application should be covered as long as the functions and effects achieved by the present application are the same as or similar to that of the present application.
In some implementation scenarios, the culling module 124 may include:
the first determining unit may be configured to determine a confidence interval corresponding to the index data when it is determined that the index data satisfies a normal distribution;
the first removing unit may be configured to remove the attribute values outside the confidence interval to obtain first index data.
In some implementation scenarios, the culling module 124 may further include:
the calculating unit may be configured to calculate a Hill estimation value corresponding to each attribute value in the index data when it is determined that the index data does not satisfy the normal distribution;
the construction unit can be used for constructing a Hill graph based on the sequence information and the Hill estimated value of each attribute value;
the first obtaining unit can be used for selecting an abscissa of a starting point of the region which meets the preset condition in the Hill graph to obtain a first threshold;
the second obtaining unit may be configured to remove the attribute value corresponding to the abscissa larger than the first threshold, and obtain the first index data.
In some implementation scenarios, the calculating the Hill estimation value corresponding to each attribute value in the index data may include: sequencing a plurality of attribute values in the index data to obtain sequenced attribute values; and calculating the Hill estimated value corresponding to each attribute value by using a preset mode based on the sorted attribute values.
In some implementation scenarios, the Hill estimation value corresponding to each attribute value may be calculated by:
Figure BDA0003221333450000111
wherein, X1,n、X2,n、…、Xn,nRepresenting the ordered n attribute values, Hk,nRepresents Xk,nAnd k and j represent serial numbers of the corresponding Hill estimated values, and k is more than or equal to 1 and less than or equal to n-1.
In some implementation scenarios, the binning module 126 may include:
a second determining unit, which can be used for determining the target bin number;
a third obtaining unit, configured to cluster the first index data by using a K-Means algorithm based on the target bin count to obtain a classification result;
and the third determining unit is used for determining the box separation interval corresponding to the first index data according to the classification result.
In some implementation scenarios, a target binning number (which may also be referred to as a clustering number) of each index may be determined by using an inflection point method, and then binning operation is performed on the index data according to a K-Means algorithm.
In some implementation scenarios, after the target bin number is determined, each attribute value corresponding to the first index data may be used as a sample, then samples with the same number as the target bin number are randomly selected as centers, and the closest center point is found for the remaining samples, so as to divide them into several categories. Further, in each category, a new central point may be selected, and the above process is repeated until the central point is not changed any more, and the classification is finished, and a classification result is obtained. The classification result includes the same number of categories as the target bin number, and each category may include one or more attribute values.
In some implementation scenarios, after obtaining the classification result, the binning interval may be set according to the classification result.
In some implementation scenarios, after obtaining the binning result, the method may further include: and determining a characteristic index for predicting the default of the institution based on the binning result corresponding to each index data.
In some implementation scenarios, the determining a characteristic index of the forecast agency default based on the binning result corresponding to each index data may include: calculating an IV value of each index based on the binning result corresponding to each index data; and determining a characteristic index for predicting the default of the organization according to the IV value of each index.
In some implementations, the characteristic indicator of the predicted agency breach can include at least one of: the profit capacity index, the repayment capacity index, the operation capacity index, the monthly fund flow change index of the organization and the business scene degree index of the organization.
In some implementation scenarios, after determining the characteristic indicator of the forecast agency default, the method may further include: acquiring a sample data set; the sample data set comprises sample data corresponding to a plurality of institutions, and each sample data comprises data corresponding to the characteristic indexes and institution default information; and training a preset model by using the sample data set to obtain a default prediction model.
In some implementation scenarios, after obtaining the default prediction model, the method may further include: acquiring data corresponding to the characteristic indexes of the target mechanism; and determining default information of the target mechanism according to the data corresponding to the characteristic indexes of the target mechanism and the default prediction model.
It should be noted that the above-mentioned description of the apparatus according to the method embodiment may also include other embodiments, and specific implementation manners may refer to the description of the related method embodiment, which is not described herein again.
The present specification also provides an embodiment of a data processing apparatus comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement steps comprising: acquiring index data; the index data comprises a plurality of attribute values corresponding to the same index; determining distribution information of the index data; based on the distribution information, removing abnormal values in the index data to obtain first index data; and performing box separation processing on the first index data to obtain a box separation result.
It should be noted that the above-mentioned apparatuses may also include other embodiments according to the description of the method or apparatus embodiments. The specific implementation manner may refer to the description of the related method embodiment, and is not described in detail herein.
The method embodiments provided in the present specification may be executed in a mobile terminal, a computer terminal, a server or a similar computing device. Taking an example of the data processing server running on a server, fig. 3 is a block diagram of a hardware structure of an embodiment of a data processing server provided in this specification, where the server may be a data processing apparatus or a data processing device in the foregoing embodiment. As shown in fig. 3, the server 10 may include one or more (only one shown) processors 100 (the processors 100 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 200 for storing data, and a transmission module 300 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 3 is only an illustration and is not intended to limit the structure of the electronic device. For example, the server 10 may also include more or fewer components than shown in FIG. 3, and may also include other processing hardware, such as a database or multi-level cache, a GPU, or have a different configuration than shown in FIG. 3, for example.
The memory 200 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the data processing method in the embodiments of the present specification, and the processor 100 executes various functional applications and data processing by executing the software programs and modules stored in the memory 200. Memory 200 may include high speed random access memory and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 200 may further include memory located remotely from processor 100, which may be connected to a computer terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission module 300 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission module 300 includes a Network adapter (NIC) that can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission module 300 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The method or apparatus provided by the present specification and described in the foregoing embodiments may implement service logic through a computer program and record the service logic on a storage medium, where the storage medium may be read and executed by a computer, so as to implement the effect of the solution described in the embodiments of the present specification. The storage medium may include a physical device for storing information, and typically, the information is digitized and then stored using an electrical, magnetic, or optical media. The storage medium may include: devices that store information using electrical energy, such as various types of memory, e.g., RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, and usb disks; devices that store information optically, such as CDs or DVDs. Of course, there are other ways of storing media that can be read, such as quantum memory, graphene memory, and so forth.
The embodiments of the data processing method or apparatus provided in this specification may be implemented in a computer by a processor executing corresponding program instructions, for example, implemented in a PC end using a c + + language of a windows operating system, implemented in a linux system, or implemented in an intelligent terminal using, for example, android and iOS system programming languages, and implemented in processing logic based on a quantum computer.
It should be noted that descriptions of the apparatus, the device, and the system described above according to the related method embodiments may also include other embodiments, and specific implementations may refer to descriptions of corresponding method embodiments, which are not described in detail herein.
The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, when implementing one or more of the present description, the functions of some modules may be implemented in one or more software and/or hardware, or the modules implementing the same functions may be implemented by a plurality of sub-modules or sub-units, etc.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices, systems according to embodiments of the invention. It will be understood that the implementation can be by computer program instructions which can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.
The above description is merely exemplary of one or more embodiments of the present disclosure and is not intended to limit the scope of one or more embodiments of the present disclosure. Various modifications and alterations to one or more embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims.

Claims (15)

1. A data processing method, comprising:
acquiring index data; the index data comprises a plurality of attribute values corresponding to the same index;
determining distribution information of the index data;
based on the distribution information, removing abnormal values in the index data to obtain first index data;
and performing box separation processing on the first index data to obtain a box separation result.
2. The method of claim 1, wherein the determining distribution information of the metric data comprises:
and performing normality test on the index data to determine the distribution information of the index data.
3. The method according to claim 2, wherein the removing abnormal values in the index data based on the distribution information to obtain first index data comprises:
determining a confidence interval corresponding to the index data under the condition that the index data are determined to meet normal distribution;
and eliminating the attribute values outside the confidence interval to obtain first index data.
4. The method according to claim 2, wherein the removing abnormal values in the index data based on the distribution information to obtain first index data further comprises:
under the condition that the index data are determined not to meet normal distribution, calculating a Hill estimation value corresponding to each attribute value in the index data;
constructing a Hill graph based on the sequence information and the Hill estimated value of each attribute value;
selecting an abscissa of a starting point of a region which meets a preset condition in the Hill graph to obtain a first threshold;
and removing the attribute value corresponding to the abscissa larger than the first threshold value to obtain first index data.
5. The method as claimed in claim 4, wherein the calculating the Hill estimation value corresponding to each attribute value in the index data comprises:
sequencing a plurality of attribute values in the index data to obtain sequenced attribute values;
and calculating the Hill estimated value corresponding to each attribute value by using a preset mode based on the sorted attribute values.
6. The method of claim 5, wherein the Hill estimate for each attribute value is calculated by:
Figure FDA0003221333440000021
wherein, X1,n、X2,n、…、Xn,nRepresenting the ordered n attribute values, Hk,nRepresents Xk,nAnd k and j represent serial numbers of the corresponding Hill estimated values, and k is more than or equal to 1 and less than or equal to n-1.
7. The method according to claim 1, wherein the binning the first index data to obtain a binning result comprises:
determining the number of target boxes;
clustering the first index data by using a K-Means algorithm based on the target box number to obtain a classification result;
and determining a box separation interval corresponding to the first index data according to the classification result.
8. The method of claim 1, after obtaining the binned results, further comprising:
and determining a characteristic index for predicting the default of the institution based on the binning result corresponding to each index data.
9. The method of claim 8, wherein determining a characteristic indicator of a forecast agency violation based on the binned result for each indicator data comprises:
calculating an IV value of each index based on the binning result corresponding to each index data;
and determining a characteristic index for predicting the default of the organization according to the IV value of each index.
10. The method of claim 8, wherein the characteristic indicators of the forecasted agency breach comprise at least one of: the profit capacity index, the repayment capacity index, the operation capacity index, the monthly fund flow change index of the organization and the business scene degree index of the organization.
11. The method of claim 10, wherein determining the characteristic indicators predictive of agency violations further comprises:
acquiring a sample data set; the sample data set comprises sample data corresponding to a plurality of institutions, and each sample data comprises data corresponding to the characteristic indexes and institution default information;
and training a preset model by using the sample data set to obtain a default prediction model.
12. The method of claim 11, wherein obtaining the default prediction model further comprises:
acquiring data corresponding to the characteristic indexes of the target mechanism;
and determining default information of the target mechanism according to the data corresponding to the characteristic indexes of the target mechanism and the default prediction model.
13. A data processing apparatus, comprising:
the acquisition module is used for acquiring index data; the index data comprises a plurality of attribute values corresponding to the same index;
the determining module is used for determining the distribution information of the index data;
the removing module is used for removing abnormal values in the index data based on the distribution information to obtain first index data;
and the box separating module is used for carrying out box separating processing on the first index data to obtain a box separating result.
14. A data processing apparatus comprising at least one processor and a memory storing computer-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 12.
15. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1-12.
CN202110958825.4A 2021-08-20 2021-08-20 Data processing method, device and equipment Pending CN113673595A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110958825.4A CN113673595A (en) 2021-08-20 2021-08-20 Data processing method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110958825.4A CN113673595A (en) 2021-08-20 2021-08-20 Data processing method, device and equipment

Publications (1)

Publication Number Publication Date
CN113673595A true CN113673595A (en) 2021-11-19

Family

ID=78544216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110958825.4A Pending CN113673595A (en) 2021-08-20 2021-08-20 Data processing method, device and equipment

Country Status (1)

Country Link
CN (1) CN113673595A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033037A (en) * 2018-07-26 2018-12-18 厦门大学 Buoy automatic monitoring system data quality control method
CN110942171A (en) * 2019-09-12 2020-03-31 中电科新型智慧城市研究院有限公司 Enterprise labor and resource dispute risk prediction method based on machine learning
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN112100291A (en) * 2020-09-18 2020-12-18 中国建设银行股份有限公司 Data binning method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033037A (en) * 2018-07-26 2018-12-18 厦门大学 Buoy automatic monitoring system data quality control method
CN110942171A (en) * 2019-09-12 2020-03-31 中电科新型智慧城市研究院有限公司 Enterprise labor and resource dispute risk prediction method based on machine learning
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111311402A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 XGboost-based internet financial wind control model
CN112100291A (en) * 2020-09-18 2020-12-18 中国建设银行股份有限公司 Data binning method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
欧阳资生: "极值估计在金融保险中的应用", 30 April 2006, 北京:中国经济出版社, pages: 61 *

Similar Documents

Publication Publication Date Title
CN108564286B (en) Artificial intelligent financial wind-control credit assessment method and system based on big data credit investigation
WO2017143919A1 (en) Method and apparatus for establishing data identification model
CN112734559B (en) Enterprise credit risk evaluation method and device and electronic equipment
CN110751557A (en) Abnormal fund transaction behavior analysis method and system based on sequence model
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN112990386B (en) User value clustering method and device, computer equipment and storage medium
CN110766481A (en) Client data processing method and device, electronic equipment and computer readable medium
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
CN110782349A (en) Model training method and system
CN109242165A (en) A kind of model training and prediction technique and device based on model training
CN117611011A (en) Data processing method and device, electronic equipment and storage medium
CN115481694B (en) Data enhancement method, device and equipment for training sample set and storage medium
CN115237970A (en) Data prediction method, device, equipment, storage medium and program product
CN114936714A (en) Vehicle replacement prediction model training method, prediction method, device, medium and equipment
CN113673595A (en) Data processing method, device and equipment
CN115619430A (en) User value evaluation method and device
CN110570301B (en) Risk identification method, device, equipment and medium
CN114596152A (en) Method, device and storage medium for predicting debt subject default based on unsupervised model
CN112685610A (en) False registration account identification method and related device
Zeng A comparison study on the era of internet finance China construction of credit scoring system model
CN113435655B (en) Sector dynamic management decision method, server and system
CN117788133A (en) Method for constructing retail credit risk prediction model and retail credit score model
CN118333737A (en) Method for constructing retail credit risk prediction model and consumer credit business Scorebetai model
CN118798977A (en) Method, apparatus, device, storage medium and product for analyzing conversion of paid member
CN118071482A (en) Method for constructing retail credit risk prediction model and consumer credit business Scorebetad model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination