CN116976187A

CN116976187A - Modeling variable determining method, abnormal data prediction model construction method and device

Info

Publication number: CN116976187A
Application number: CN202310566293.9A
Authority: CN
Inventors: 李欣
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-10-31

Abstract

The application relates to a modeling variable determining method, an abnormal data prediction model building method and a device. The method involves artificial intelligence, comprising: and performing variable derivative processing according to each first payment data sample and each second payment data sample to obtain each first derivative variable and each second derivative variable, performing step-by-step feature screening processing on each first derivative variable to obtain each first modeling variable, and determining the first modeling variable and the second derivative variable with the same service attribute. And carrying out variable distribution statistics based on the first modeling variable and the second derivative variable with the same service attribute to obtain distribution statistics data, and screening to obtain second modeling variables according to the distribution statistics data, each first modeling variable and each second derivative variable, wherein each second modeling variable is used for constructing an abnormal data prediction model. By adopting the method, comprehensive and accurate modeling variables can be obtained, and an abnormal data prediction model with high prediction accuracy is constructed.

Description

Modeling variable determining method, abnormal data prediction model construction method and device

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a modeling variable determining method, an abnormal data prediction model constructing method, an abnormal data determining apparatus, a computer device, a storage medium, and a computer program product.

Background

With the development of artificial intelligence technology and the popularization and application of various internet monetary transactions, in the practical application process, in order to improve the processing efficiency in aspects of service popularization, service handling and the like, it is necessary to evaluate the abnormal conditions of the services for potential use objects of different financial services or each use object for which service handling and application has been proposed, so as to reduce the occurrence of invalid service popularization or service handling conditions according to corresponding abnormal evaluation data of the services.

In the conventional technology, the evaluation of the abnormal business condition of the usage object is generally performed based on the basic attribute information of each usage object, the history data (including the history business transaction data, the history overdue condition, etc.), and the like. However, in determining the modeling variables required for the anomaly evaluation modeling, it is necessary to obtain the modeling variables after a long performance time from the viewpoint, i.e., the modeling variables are usually far from the current actual use time, and the determined modeling variables represent the historical performance of the object to be used. Therefore, when the abnormality evaluation model constructed based on the modeling variables is used for performing abnormality evaluation in the current actual use time, a series of changes occurring in the performance data of the use object may cause model failure, and the accuracy of the actual evaluation result is still to be improved.

Disclosure of Invention

Based on this, it is necessary to provide a modeling variable determination method, an abnormal data prediction model construction method, an abnormal data determination method, an apparatus, a computer device, a storage medium, and a computer program product capable of improving accuracy of an abnormal evaluation result for different usage objects, in view of the above-described technical problems.

In a first aspect, the present application provides a modeling variable determination method. The method comprises the following steps:

performing variable derivative processing according to the first payment data samples corresponding to the first target objects and the second payment data samples corresponding to the second target objects to obtain a plurality of first derivative variables and a plurality of second derivative variables;

step-by-step feature screening processing is carried out on the basis of each first derivative variable, each first modeling variable is obtained through screening, and first service attribute information corresponding to each first modeling variable is obtained;

performing service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information, and determining a first modeling variable and a second derivative variable of the same service attribute;

based on the first modeling variable and the second derivative variable of the same service attribute, carrying out variable distribution statistics to obtain distribution statistics data between the first modeling variable and the second derivative variable of the same service attribute;

Screening to obtain second modeling variables matched with the second target objects according to the distribution statistical data, the first modeling variables and the second derivative variables; and each second modeling variable is used for constructing and obtaining an abnormal data prediction model.

In a second aspect, the present application provides a method for constructing an abnormal data prediction model. The method comprises the following steps:

acquiring first modeling variables determined according to first payment data samples corresponding to each first target object, and constructing an initial abnormal data prediction model based on each first modeling variable;

acquiring second payment data samples corresponding to each second target object and each first modeling variable, and performing modeling variable screening processing on the second payment data samples and each first modeling variable to determine each second modeling variable;

based on the second modeling variables and the first payment data samples, performing migration learning processing on the initial abnormal data prediction model, and constructing an abnormal data prediction model; the abnormal data prediction model is used for performing abnormal data prediction processing on each payment data to be identified to obtain corresponding abnormal data.

In a third aspect, the present application provides an abnormal data determination method. The method comprises the following steps:

Receiving an abnormal data determining request, and acquiring payment data to be identified corresponding to the abnormal data determining request;

based on a trained abnormal data prediction model, carrying out abnormal data prediction processing on the payment data to be identified, and obtaining abnormal data corresponding to the payment data to be identified;

the trained abnormal data prediction model is obtained by performing migration learning processing on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample; the initial abnormal data prediction model is constructed according to first modeling variables determined by first payment data samples corresponding to each first target object; the second modeling variables are obtained by performing modeling variable screening processing and determining based on second payment data samples corresponding to the second target objects and the first modeling variables.

In one embodiment, the method further comprises:

determining an abnormal interval to which the payment data to be identified belongs according to the abnormal data;

and acquiring service processing logic corresponding to each abnormal interval, and performing service evaluation processing on the service application associated with the payment data to be identified according to the service processing logic.

In a fourth aspect, the application further provides a modeling variable determining device. The device comprises:

the variable derivative processing module is used for performing variable derivative processing according to the first payment data samples corresponding to the first target objects and the second payment data samples corresponding to the second target objects to obtain a plurality of first derivative variables and a plurality of second derivative variables;

the step feature screening processing module is used for carrying out step feature screening processing based on each first derivative variable, screening to obtain each first modeling variable, and obtaining first service attribute information corresponding to each first modeling variable;

the service attribute matching processing module is used for carrying out service attribute matching processing according to the second service attribute information corresponding to each second derivative variable and the first service attribute information, and determining a first modeling variable and a second derivative variable with the same service attribute;

the variable distribution statistics module is used for carrying out variable distribution statistics based on the first modeling variable and the second derivative variable with the same service attribute to obtain distribution statistics data between the first modeling variable and the second derivative variable with the same service attribute;

The second modeling variable determining module is used for screening and obtaining second modeling variables matched with the second target objects according to the distribution statistical data, the first modeling variables and the second derivative variables; and each second modeling variable is used for constructing and obtaining an abnormal data prediction model.

In a fifth aspect, the present application further provides an abnormal data prediction model construction device. The device comprises:

the initial abnormal data prediction model construction module is used for acquiring first modeling variables determined according to the first payment data samples corresponding to each first target object and constructing and obtaining an initial abnormal data prediction model based on each first modeling variable;

the modeling variable screening processing module is used for acquiring second payment data samples corresponding to each second target object and each first modeling variable, carrying out modeling variable screening processing, and determining each second modeling variable;

the abnormal data prediction model construction module is used for carrying out migration learning processing on the initial abnormal data prediction model based on the second modeling variables and the first payment data samples to construct an abnormal data prediction model; the abnormal data prediction model is used for performing abnormal data prediction processing on each payment data to be identified to obtain corresponding abnormal data.

In a sixth aspect, the present application further provides an abnormal data determining apparatus. The device comprises:

the payment data acquisition module to be identified is used for receiving the abnormal data determination request and acquiring the payment data to be identified corresponding to the abnormal data determination request;

the abnormal data prediction processing module is used for performing abnormal data prediction processing on the payment data to be identified based on the trained abnormal data prediction model to obtain abnormal data corresponding to the payment data to be identified; the trained abnormal data prediction model is obtained by performing migration learning processing on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample; the initial abnormal data prediction model is constructed according to first modeling variables determined by first payment data samples corresponding to each first target object; the second modeling variables are obtained by performing modeling variable screening processing and determining based on second payment data samples corresponding to the second target objects and the first modeling variables.

In a seventh aspect, there is provided a computer device comprising: a memory storing a computer program and a processor implementing the method as in the first aspect or its respective implementation, or in the second aspect or its respective implementation, or in the third aspect or its respective implementation when the computer program stored in the memory is executed.

In an eighth aspect, a computer readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, implements a method as in the first aspect or in each implementation manner thereof, or in the second aspect or in each implementation manner thereof, or in the third aspect or in each implementation manner thereof.

A ninth aspect provides a computer program product comprising a computer program which, when executed by a processor, performs the method as in the first aspect or in each of its implementations, or in the second aspect or in each of its implementations, or in the third aspect or in each of its implementations.

In the modeling variable determining method, the abnormal data prediction model constructing method, the abnormal data determining device, the computer equipment, the storage medium and the computer program product, the first payment data sample corresponding to each first target object and the second payment data sample corresponding to each second target object are subjected to variable derivation processing to obtain a plurality of first derived variables and second derived variables, so that more comprehensive and universal derived variables are obtained. The method comprises the steps of carrying out step feature screening processing on the basis of each first derivative variable, screening to obtain each first modeling variable, obtaining first service attribute information corresponding to each first modeling variable, carrying out service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information, and determining the first modeling variable and the second derivative variable with the same service attribute, so that preliminary screening on the service attribute information can be carried out on each second derivative variable according to the existing first modeling variable. Further, based on the first modeling variable and the second derivative variable with the same service attribute, the variable distribution statistics is carried out to obtain the distribution statistics data between the first modeling variable and the second derivative variable with the same service attribute, further service information matching, variable distribution statistics and variable screening can be carried out on each first modeling variable screened out based on the comprehensive first payment data sample, the second modeling variable matched with each second target object is obtained through screening, finally, an abnormal data prediction model can be constructed according to each second modeling variable, the situation that modeling variable errors are large due to overlong observation time of the modeling variable or the fact that data samples are missing in the modeling variable determining process is avoided, and further, abnormal data corresponding to payment data to be identified can be rapidly and accurately obtained according to the constructed abnormal data prediction model.

Drawings

FIG. 1 is an application environment diagram of a modeling variable determination method, an abnormal data prediction model construction method, and an abnormal data determination method in one embodiment;

FIG. 2 is a flow diagram of a method of modeling variable determination in one embodiment;

FIG. 3 is a schematic flow chart of a step feature screening process based on first derivative variables to obtain first modeling variables according to an embodiment;

FIG. 4 is a schematic flow chart of a step feature screening process based on each first derivative variable to obtain each first modeling variable according to another embodiment;

FIG. 5 is a flow chart of a method of modeling variable determination in another embodiment;

FIG. 6 is a flow chart of a method of constructing an abnormal data prediction model in one embodiment;

FIG. 7 is a flowchart of a method for constructing an abnormal data prediction model according to another embodiment;

FIG. 8 is a flow chart of a method of determining outlier data in an embodiment;

FIG. 9 is a schematic diagram of interval distribution of anomaly data for an anomaly data determination method in one embodiment;

FIG. 10 is a block diagram of a modeling variable determination apparatus in one embodiment;

FIG. 11 is a block diagram showing a construction of an abnormal data prediction model in one embodiment;

FIG. 12 is a block diagram showing the construction of an abnormal data determination apparatus in one embodiment;

fig. 13 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The modeling variable determining method, the abnormal data prediction model constructing method and the abnormal data determining method provided by the embodiment of the application relate to an artificial intelligence technology, and can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, network media and auxiliary driving. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Among them, natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. Machine Learning (ML) is a multi-domain interdisciplinary, which relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory, and is used to specially study how a computer simulates or implements Learning behavior of a human being so as to obtain new knowledge or skill, and reorganize an existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, and teaching learning.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The modeling variable determining method, the abnormal data prediction model constructing method and the abnormal data determining method provided by the embodiment of the application relate to the technologies of computer vision, natural language processing, machine learning and the like in the artificial intelligence technology, and can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, portable wearable devices, aircrafts, etc., and the internet of things devices may be smart speakers, smart car devices, etc. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be an independent physical server, or may be a server cluster formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms, where the terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication modes, which is not limited in the embodiment of the present application.

Further, each of the terminal 102 and the server 104 may be separately used to perform the modeling variable determining method, the abnormal data prediction model constructing method, and the abnormal data determining method provided in the embodiment of the present application, and the terminal 102 and the server 104 may also cooperatively perform the modeling variable determining method, the abnormal data prediction model constructing method, and the abnormal data determining method provided in the embodiment of the present application. For example, taking the terminal 102 and the server 104 cooperatively execute the modeling variable determining method provided in the embodiment of the present application as an example, the server 104 performs variable derivation processing according to the first payment data samples corresponding to each first target object and the second payment data samples corresponding to each second target object, so as to obtain a plurality of first derived variables and a plurality of second derived variables. The first payment data sample, the second payment data sample, and the like may be stored in a cloud storage of the server 104, or in a data storage system, or in a local storage of the terminal 102, and may be acquired from the server 104, or the data storage system, or the terminal 102 when the modeling variable determination process is required. Further, the server 104 performs step feature screening processing based on each first derivative variable, screens to obtain each first modeling variable, obtains first service attribute information corresponding to each first modeling variable, performs service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information, and determines the first modeling variable and the second derivative variable with the same service attribute. Similarly, the server 104 performs variable distribution statistics based on the first modeling variable and the second derivative variable with the same service attribute, so as to obtain distribution statistics data between the first modeling variable and the second derivative variable with the same service attribute, and further, the second modeling variable matched with each second target object can be obtained through screening according to the distribution statistics data, each first modeling variable and each second derivative variable. The second modeling variables are used for constructing an abnormal data prediction model, and the constructed abnormal data prediction model can be used for predicting abnormal data of the payment data to be identified fed back by the terminal 102 so as to obtain corresponding abnormal data, and can also feed back the obtained abnormal data to the terminal 102.

Similarly, taking the method for constructing the abnormal data prediction model provided by the embodiment of the present application as an example, where the terminal 102 and the server 104 cooperatively execute the method, the server 104 obtains the first modeling variables determined according to the first payment data samples corresponding to the first target objects, so as to construct and obtain the initial abnormal data prediction model based on the first modeling variables. Similarly, the server 104 performs modeling variable filtering processing by acquiring second payment data samples corresponding to the second target objects and the first modeling variables, and determines the second modeling variables. The first payment data sample, the first modeling variable, the second payment data sample, the second modeling variable, and the like may be stored in a cloud storage of the server 104, or in a data storage system, or in a local storage of the terminal 102, and may be acquired from the server 104, or the data storage system, or the terminal 102 when the abnormal data prediction model construction process is required. Further, the server 104 may perform a migration learning process on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample, so as to construct an abnormal data prediction model. The constructed abnormal data prediction model is used for performing abnormal data prediction processing on the payment data to be identified fed back by the terminal 102 to obtain corresponding abnormal data, and the obtained abnormal data can be fed back to the terminal 102.

Similarly, taking the method for determining abnormal data provided by the embodiment of the present application as an example where the terminal 102 and the server 104 cooperatively execute, the server 104 receives the request for determining abnormal data fed back by the terminal 102, and obtains the payment data to be identified corresponding to the request for determining abnormal data. The payment data to be identified may be stored in a cloud storage of the server 104, or in a data storage system, or in a local storage of the terminal 102, and may be obtained from the server 104, or the data storage system, or the terminal 102 when the abnormal data determination process is required. Further, the server 104 performs abnormal data prediction processing on the payment data to be identified based on the trained abnormal data prediction model, and obtains abnormal data corresponding to the payment data to be identified. The server 104 may further feed the obtained abnormal data back to the terminal 102. The trained abnormal data prediction model is obtained by performing migration learning processing on the initial abnormal data prediction model by the server 104 based on each second modeling variable and each first payment data sample. Similarly, the initial abnormal data prediction model is constructed by the server 104 from the first modeling variables determined by the first payment data samples corresponding to the first target objects, and the second modeling variables are determined by the server 104 by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables.

In one embodiment, as shown in fig. 2, a modeling variable determining method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step S202, performing variable derivation processing according to the first payment data samples corresponding to the first target objects and the second payment data samples corresponding to the second target objects to obtain a plurality of first derived variables and a plurality of second derived variables.

The first target object and the second target object can be understood as usage objects in different areas, and service handling records, service overdue records, consumption data and the like of the usage objects in different areas are different, for example, each first target object has service handling records, service overdue records on certain financial services, and also has corresponding consumption data on the aspects of a market, a restaurant service or the like, while the second target object has only service handling records on certain financial services, consumption data on the aspects of a market, a restaurant service or the like, and does not have service overdue records on certain financial services.

The first payment data sample and the second payment data sample can be respectively understood as a first target object and a second target object, and object basic data, service related data, consumption data and the like in a preset time, wherein the object basic data comprises basic information of a used object, such as age, gender, employment information and the like, the service related data can comprise specific service handling records, service types and the like, and the consumption data can comprise data such as consumption scenes, consumption modes (such as card payment, non-card payment, application payment and the like), consumption amount, consumption frequency, last consumption duration and the like. Different consumption modes, such as card payment, can further determine the service related data corresponding to the object of use, such as the number of binding cards under the binding card service, the consumption frequency and the consumption amount of different cards, and the like.

Specifically, a target variable derivative processing logic matched with a current business scene is selected from candidate variable derivative processing logics, and multi-attribute-level variable derivative processing is carried out on a first payment data sample and a second payment data sample according to the target variable derivative processing logic to obtain a plurality of first derivative variables and a plurality of second derivative variables.

The attribute hierarchy includes at least two of a consumption attribute, a service attribute and a feature attribute, the consumption attribute may be understood as attribute information such as consumption time, consumption frequency and consumption amount in the first payment data sample and the second payment data sample, the service attribute may be understood as attribute information related to service handling in the first payment data sample and the second payment data sample, such as attribute information such as a service type (including binding service, loan service, deposit service, consumption service, and financial product purchase service, etc.), a service handling record (including service handling time, service validity time, etc.), and the feature attribute may be understood as a processing mode of features such as feature dispersion, feature intersection, feature combination, etc. performed based on each data feature in the first payment data sample and the second payment data sample.

The candidate variable derivation processing logic specifically comprises: variable derivative logic based on consumption time-consumption frequency-consumption amount corresponding to consumption attributes, variable derivative logic based on business meaning corresponding to business attributes, feature discrete-cross derivative logic corresponding to feature attributes, and the like. The target variable derivative processing logic matched with the current service scene can be determined according to the derivative requirement of the current service scene, for example, the current service scene specifically needs to perform variable derivative processing on the consumption attribute firstly and then on the service attribute, or performs variable derivative processing on the characteristic attribute firstly and then on the consumption attribute, and the calling sequence of each candidate variable derivative processing logic can be determined according to different derivative requirements so as to realize variable derivative processing of multiple attribute levels.

For example, when determining the calling sequence of each candidate variable derivative processing logic according to the derivative requirement of the current service scenario, the corresponding target variable derivative processing logic may be further determined, for example, the calling sequence of the candidate variable derivative processing logic is: the variable derivative logic based on the consumption time, the consumption frequency and the consumption amount is called first, and then the variable derivative logic based on the business meaning is called, and the target variable derivative processing logic comprises the variable derivative logic based on the consumption time, the consumption frequency and the consumption amount and the variable derivative logic based on the business meaning. For another example, the call sequence to candidate variable derivative processing logic is: firstly, calling variable derivative logic based on business meaning, and then calling characteristic discrete-cross derivative logic, wherein the target variable derivative processing logic comprises variable derivative logic based on business meaning and characteristic discrete-cross derivative logic.

Further, according to the target variable derivation processing logic, performing variable derivation processing of the multi-attribute hierarchy on the first payment data sample and the second payment data sample to obtain a plurality of first derived variables and a plurality of second derived variables, the method specifically may include: and performing variable derivation processing on the first payment data sample and the second payment data sample by calling variable derivation logic based on consumption time-consumption frequency-consumption amount, wherein the variable derivation processing comprises the step of deriving consumption data of a using object in retail industry (including consumption amount, consumption frequency, and last consumption time length of the using object in retail industry such as a mall, a catering service and the like) into derived variables such as the last consumption time of the retail industry, the consumption times of the retail industry in a preset time period (such as 3 months), and the consumption amount of the retail industry in the preset time period (such as 3 months).

Likewise, it may further include: and carrying out variable derivation processing on the first payment data sample and the second payment data sample by calling variable derivation logic based on business meaning, wherein the first payment data sample and the second payment data sample are derived according to attribute information related to business handling, such as business type, business handling record and the like, so as to obtain derived variables such as employment information of the used object, consumption ratio of the used object of different employment information in retail industry and the like.

Wherein, can also include: and (3) performing processing such as feature dispersion, feature intersection, feature combination and the like on the first payment data sample and the second payment data sample by calling feature dispersion-intersection derivative logic to obtain new data features (namely new derivative variables) in a derivative way, so as to realize expansion of the first payment data sample and the second payment data sample. The method specifically can adopt a mode based on the derivation of an integrated tree model, namely adopts the integrated tree model to realize discretization, intersection and combination processing of each data characteristic in the first payment data sample and the second payment data, and constructs a new data characteristic as a new derived variable.

Step S204, based on each first derivative variable, step feature screening processing is carried out, each first modeling variable is obtained through screening, and first business attribute information corresponding to each first modeling variable is obtained.

The step feature screening process specifically comprises the following steps: the method comprises a first-stage screening process corresponding to information quantity attribute data of a first derivative variable, a second-stage screening process corresponding to group stability attribute data of the first-stage derivative variable, a third-stage screening process corresponding to characteristic association attribute data of the second-stage derivative variable, and a symbol consistency screening process and a variance expansion attribute screening process for the third-stage derivative variable.

The first-level screening process indicates that the information quantity attribute data needs to be screened out to meet the first-level derivative variable of the first-level screening condition, the second-level screening process indicates that the group stability attribute data needs to be screened out to meet the second-level derivative variable of the second-level screening condition, and the third-level screening process indicates that the characteristic association attribute data needs to be screened out to meet the third-level derivative variable of the third-level screening condition. The sign consistency screening process performed on the three-stage derivative variables represents a screening process in which positive coefficient estimation is positive, negative coefficient estimation is negative and zero coefficient estimation is zero for the three-stage derivative variables, and the variance expansion attribute screening process represents a screening process performed on each three-stage derivative variable according to the degree of co-linearity (i.e., variance expansion attribute value) between each three-stage derivative variable.

Specifically, based on each first derivative variable, performing primary screening processing to obtain a primary derivative variable of which the information quantity attribute data meets the primary screening condition, and performing secondary screening processing on each primary derivative variable to obtain a secondary derivative variable of which the population stability attribute data meets the secondary screening condition. Further, three-level screening processing is carried out on each two-level derivative variable to obtain three-level derivative variables with characteristic associated attribute data meeting three-level screening conditions, symbol consistency screening processing and variance expansion attribute screening processing are further carried out on the basis of each three-level derivative variable to obtain each first modeling variable, and first business attribute information corresponding to each first modeling variable is further obtained.

The first business attribute information corresponding to each first modeling variable may be specifically understood as business meaning of each first modeling variable, such as a business handling type (including binding card business, loan business, deposit business, financial product purchasing business, consumption business, etc.), and a business handling record (including business handling time, business validity time limit, etc.). When the service handling type is the binding card service, the service attributes such as the binding card time, the number of binding cards, the effective card holding time and the like corresponding to the binding card service can be further determined. If the business handling type is a loan business, the business attributes such as the loan amount, the loan time, the loan times, the repayment period and the like corresponding to the loan business can be further determined. Similarly, if the service transaction type is a consumption service, the service attributes such as a consumption scenario (for example, a mall, a restaurant service shop, etc.), a consumption mode (for example, card payment, non-card payment, application payment, etc.), a consumption amount, a consumption number, and a consumption frequency corresponding to the consumption service may be further determined.

Further, if the feature-related attribute data includes feature statistical verification attribute data and feature combination attribute data, the third-level filtering process specifically includes stepwise regression feature filtering process corresponding to the feature statistical verification attribute data and sequential feature filtering process corresponding to the feature combination attribute data. And when the characteristic association attribute data meets the three-level derivative variables of the three-level screening conditions, the method specifically comprises the following steps: and performing stepwise regression feature screening treatment on each secondary derivative variable to obtain sub-derivative variables with feature statistics checking attribute data meeting stepwise regression feature screening conditions, and performing sequence feature screening treatment on each sub-derivative variable to obtain three-stage derivative variables with feature combination attribute data meeting sequence feature screening conditions.

In one embodiment, prior to performing the step feature screening process based on each first derivative variable, further comprising performing a feature pre-process on each first derivative variable. The feature preprocessing method specifically comprises the following steps: missing value proportion processing, single value proportion processing, missing value processing, extremum processing, data type conversion processing, and the like.

Further, the characteristic pretreatment modes are shown in the following table 1:

TABLE 1 characteristic pretreatment mode

As can be seen from table 1, the missing value ratio processing means that the first derivative variables with the missing value ratio greater than the first preset ratio threshold are removed, and the single value ratio processing means that the first derivative variables with the single value ratio greater than the second preset ratio threshold are removed. The first preset proportion threshold value and the second preset proportion threshold value are set to the same value or set to different values, for example, the first preset proportion threshold value and the second preset proportion threshold value are both set to 90%. It can be understood that the actual values of the first preset proportional threshold and the second preset proportional threshold can be adjusted and set according to the actual service scenario, and are not limited to a certain specific value or a certain specific values.

Similarly, referring to table 1, the missing value processing indicates that, based on the business meaning, each first derivative variable having a missing value is filled, for example, the value content of each first derivative variable having a missing value is filled into a business type (including a binding business, a loan business, a deposit business, a financial product purchase business, etc.), a business transaction record, etc. The extremum processing means that each first derivative variable with the value larger than the maximum value and smaller than the minimum value is replaced by a preset maximum value and a preset minimum value respectively, namely, the value of each first derivative variable is adjusted to be within a preset [ minimum value, maximum value ] range, and the situation that the value is excessively large or excessively small is reduced. The data type conversion processing means that the data format of each first derivative variable needs to be uniformly processed, for example, all the first derivative variables are converted into structured data, so that the preset uniform and standard processing of each first derivative variable is realized, and the complicated conversion processing in the data processing process is reduced.

Step S206, according to the second service attribute information and the first service attribute information corresponding to each second derivative variable, service attribute matching processing is performed, and the first modeling variable and the second derivative variable of the same service attribute are determined.

The second service attribute information corresponding to each second derivative variable can be understood as the service meaning of each second derivative variable, and because the service handling conditions, payment data and the like of the service objects in different areas are different in the actual popularization and application processes of different internet financial services, the first service attribute information corresponding to each first modeling variable and the second service attribute information corresponding to each second derivative variable have incomplete correspondence, service attribute matching processing needs to be performed based on the first service attribute information and the second service attribute information, and the first modeling variable and the second derivative variable which are not corresponding to the service attribute information are removed, so that the first modeling variable and the second derivative variable with the same service attribute are obtained.

Specifically, the first service attribute information corresponding to each first modeling variable specifically includes a service handling type (including a binding card service, a loan service, a deposit service, a financial product purchase service, a consumption service, etc.), a service handling record (including a service handling time, a service valid time limit, etc.), and may also include a binding card time, a binding card number, a valid card holding time, etc. corresponding to the binding card service, a loan amount, a loan time, a loan number, a repayment period, etc. corresponding to the loan service, and service attributes such as a consumption scene, a consumption mode, a consumption amount, a consumption number, a consumption frequency, etc. corresponding to the consumption service.

Similarly, the second service attribute information corresponding to each second derivative variable may specifically include: the type of business transaction (including, for example, binding card business and consumption business, etc.), and the business transaction record (including business transaction time, etc.). When the service handling type is the binding card service, the service attributes such as the binding card time, the number of binding cards, the effective card holding time and the like corresponding to the binding card service can be further determined. If the transaction type is a consumption service, the service attributes such as consumption scenario (such as a mall, a restaurant service shop, etc.), consumption mode (such as card payment and no-card payment), consumption amount, consumption times, and consumption frequency corresponding to the consumption service can be further determined.

For example, the business transaction types of the first modeling variable include a binding business, a loan business, a deposit business, a financial product purchase business, a consumption business, and the like, while the business transaction types of the second derivative variable include a binding business, a consumption business, and the like, and the business attribute matching process needs to be performed based on the business transaction types between the first modeling variable and the second derivative variable, that is, the same business transaction type needs to be determined, such as determining the first modeling variable and the second derivative variable that are both binding businesses. The first modeling variable and the second derivative variable which are used as the same service attribute, namely, the first modeling variable and the second derivative variable which are determined to have the same service attribute under the service of the binding card, namely, the variables with overlapped service attributes, can comprise the number of binding cards, the binding card time and the effective card holding time of the used objects.

Likewise, after determining the same service handling type, for example, when determining that the service handling types are all consumption services, the first modeling variable and the second derivative variable with the same service attribute, namely, the variables with the same service attribute in the determined first modeling variable and the determined second derivative variable, namely, the variables with overlapping service attributes, can be used as the first modeling variable and the second derivative variable with the same service attribute, for example, the consumption amount, the consumption time, the consumption frequency and the like under the consumption service, and the consumption amount, the consumption time and the consumption frequency of the use object can be included.

Step S208, based on the first modeling variable and the second derivative variable of the same service attribute, performing variable distribution statistics to obtain distribution statistics data between the first modeling variable and the second derivative variable of the same service attribute.

The method comprises the steps of determining a first modeling variable and a second derivative variable with the same service attribute, for example, determining the number of the binding cards of a user object under the binding card service, wherein the first modeling variable and the second derivative variable with the same service attribute are the same as each other, namely, the first modeling variable and the second derivative variable are both provided with the variable of the number of the binding cards of the user object, and the first modeling variable and the second derivative variable are the same in service meaning of the variable of the number of the binding cards of the user object and are used for representing the total number of the binding cards of the user object in a preset time period in the binding card service handling process.

Likewise, the consumption frequency of the use object in the preset time period under the consumption service can be determined, and the first modeling variable and the second derivative variable which are the same service attribute, namely the first modeling variable and the second derivative variable, are provided with a variable of 'consumption frequency of the use object in the preset time period', and in the first modeling variable and the second derivative variable, the service meaning of the variable of 'consumption frequency of the use object in the preset time period' is the same, and the consumption frequency can be specifically determined by the number of consumption times and the preset time period.

Specifically, when the variable distribution statistics is performed based on the first modeling variable and the second derivative variable of the same service attribute, specifically, the variable distribution statistics is performed on the aspects of the centralized trend of data distribution, the discrete trend of data distribution and the distribution form of data frequency, so as to obtain the distribution statistics data between the first modeling variable and the second derivative variable of the same service attribute, including the centralized trend of data, the degree of data dispersion and the data distribution form between the first modeling variable and the second derivative variable of the same service attribute.

The centralized trend of the data distribution indicates that a certain variable used for representing the whole data needs to be determined, and specifically, a mode of respectively determining an average value or a median value in a first modeling variable and a second derivative variable can be adopted to obtain a representative variable used for representing the first modeling variable and the second derivative variable, and whether the two representative variables are consistent is judged. The discrete trend of the data distribution indicates that the degree of dispersion of the mean value among the data needs to be determined, the degree of density of the data distribution is obtained, and the data such as standard deviation or variance among the first modeling variable and the second derivative variable under the same service attribute can be distributed and calculated, and whether the calculated data such as standard deviation or variance is smaller than a corresponding preset data threshold or not is judged, namely whether the data is within an allowable error range or not. Similarly, the distribution form of the data frequency number indicates the data distribution form of the first modeling variable and the second derivative variable which need to determine the same service attribute, for example, whether the data distribution forms of the first modeling variable and the second derivative variable are normal distribution or not is judged.

Step S210, screening to obtain second modeling variables matched with second target objects according to the distribution statistical data, the first modeling variables and the second derivative variables, wherein the second modeling variables are used for constructing and obtaining an abnormal data prediction model.

Specifically, based on the distribution statistical data, the distribution trend of the first modeling variable and the second derivative variable with the same service attribute is determined, and the first modeling variable and the second derivative variable with the same service attribute and the same distribution trend are determined as the second modeling variable matched with each second target object. And each second modeling variable is used for constructing an abnormal data prediction model for carrying out abnormal recognition or abnormal prediction on the payment data of the to-be-recognized use object.

The distribution statistical data specifically includes a data centralized trend, a data discrete degree and a data distribution form between a first modeling variable and a second derivative variable with the same service attribute, and the distribution trend of the first modeling variable and the second derivative variable with the same service attribute, namely the data centralized trend, the data discrete trend and the data distribution form of the first modeling variable and the second derivative variable, can be obtained based on the distribution statistical data.

Further, the first modeling variable and the second derivative variable with the same service attribute and the same distribution trend are obtained, namely, under the condition that the service attributes are the same, for example, the service attributes are all binding cards, the variable with the same service meaning exists in the first modeling variable and the second derivative variable at the same time, for example, the variable of 'the number of binding cards of the use object' is obtained after the variable of 'the number of the binding cards of the use object' in the first target object and the distribution trend of the data sample in the second target object are obtained, whether the distribution trend of the data sample corresponding to the first target object and the second target object is consistent or not is judged, and if the distribution trend is consistent, the variable of 'the number of the binding cards of the use object' is determined as the second modeling variable matched with each second target object.

The data sample of the variable "the number of binding cards of the usage object" in the first target object may be understood as the actual number of binding cards of each first target object, for example, the number of binding cards of the user 1 is 3, the number of binding cards of the user 2 is 5, and so on. Similarly, the data sample of the "number of binding cards of the usage object" in the second target object may be understood as the actual number of binding cards of each second target object, for example, the number of binding cards of the user 3 is 2, the number of binding cards of the user 4 is 3, and so on. It can be understood that when the variable "the number of binding cards of the usage object" is determined and the distribution trend of the data samples corresponding to the first target object and the second target object is the distribution trend of the actual number of binding cards of the first target object and the second target object is determined, whether the distribution trend is consistent is determined.

In the modeling variable determining method, the variable derivative processing is performed on the first payment data sample corresponding to each first target object and the second payment data sample corresponding to each second target object to obtain a plurality of first derivative variables and second derivative variables, so that more comprehensive and universal derivative variables are obtained. The method comprises the steps of carrying out step feature screening processing on the basis of each first derivative variable, screening to obtain each first modeling variable, obtaining first service attribute information corresponding to each first modeling variable, carrying out service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information, and determining the first modeling variable and the second derivative variable with the same service attribute, so that preliminary screening on the service attribute information can be carried out on each second derivative variable according to the existing first modeling variable. Further, based on the first modeling variable and the second derivative variable with the same service attribute, the variable distribution statistics is carried out to obtain the distribution statistics data between the first modeling variable and the second derivative variable with the same service attribute, further service information matching, variable distribution statistics and variable screening can be carried out on each first modeling variable screened out based on the comprehensive first payment data sample, the second modeling variable matched with each second target object is obtained through screening, finally, an abnormal data prediction model can be constructed according to each second modeling variable, the situation that large errors exist in the modeling variables caused by incomplete data samples is avoided, and further, abnormal data corresponding to payment data to be identified can be rapidly and accurately obtained according to the constructed abnormal data prediction model.

In one embodiment, as shown in fig. 3, the step of performing a step feature screening process based on each first derivative variable to obtain each first modeling variable includes:

step S302, performing primary screening processing according to the information quantity attribute data corresponding to each first derivative variable, and screening out the primary derivative variables of which the information quantity attribute data meets the primary screening conditions from each first derivative variable.

The information quantity attribute data corresponding to each first derivative variable can be understood as an IV value (i.e., information Value, understood as information quantity or information value) corresponding to each first derivative variable, which is used for measuring the prediction capability of the independent variable on the dependent variable, and features with low correlation with the target dependent variable can be removed through the IV value, so as to achieve the purposes of feature screening and feature quantity reduction. The method of performing the first-stage screening according to the information quantity attribute data corresponding to each first derivative variable may be understood as a filtering type feature screening method, that is, specifically, the information quantity attribute data, that is, the variable whose IV value satisfies the first-stage screening condition is filtered from each first derivative variable, and determined as the first-stage derivative variable.

Specifically, the information quantity attribute data value corresponding to each first derivative variable, namely the IV value, is determined, a preset IV value threshold corresponding to the first-stage screening condition is obtained, and then each first derivative variable is screened according to the preset IV value threshold corresponding to the first-stage screening condition, and the information quantity attribute data, namely the variable with the IV value being greater than or equal to the preset IV value threshold is screened out and is determined as the first-stage derivative variable with the information quantity attribute data, namely the IV value meeting the first-stage screening condition.

It can be understood that the preset IV threshold corresponding to the first-stage screening condition can be adjusted and set according to actual requirements, and is not limited to a specific value, for example, the preset IV threshold is set to 0.01, and the first-stage screening condition can be understood that the IV value is greater than or equal to 0.01, or can be set to other different values.

Specifically, the following formula (1) is adopted, and the information quantity attribute data value corresponding to each first derivative variable, namely an IV value, is calculated and obtained:

wherein i represents the ith feature bin and n represents the total number of feature bins, namely, n feature bins are obtained after feature bins are processed (such as chi-square bins) on the first derivative variable. Wherein IV _i The information amount attribute data value in the i-th feature bin is represented, i.e., IV represents the sum of the information amount attribute data values in the n feature bins.

Further, the IV value in the ith feature bin is IV _i The method is calculated by the following formula (2):

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the ratio of the number of bad samples in the ith sub-box to the total number of bad samples, +.>Representing the ratio of the number of good samples in the ith sub-box to the total number of good samples, +.>Representing the difference between the ratio of the bad sample number to the total bad sample number and the ratio of the good sample number to the total good sample number in the ith bin, +. >The ratio of the number of bad samples to the total number of bad samples in the ith bin and the ratio of the number of good samples to the total number of good samples are represented.

And S304, performing secondary screening processing based on the group stability attribute data of each primary derivative variable, and screening out the secondary derivative variables of which the group stability attribute data meets the secondary screening condition from each primary derivative variable.

After each level of derivative variable is obtained through screening, population stability attribute data of each level of derivative variable is further obtained. The group stability attribute data, namely PSI index (PoPulation stability index, understood as group stability index, used for measuring stability of a model or a feature), is a mode of performing secondary screening processing according to group stability attribute data corresponding to each primary derivative variable, and also belongs to a mode of filtering feature screening, namely specifically, the group stability attribute data, namely, a variable of which the PSI index meets the secondary screening condition is filtered from each primary derivative variable, and is determined as a secondary derivative variable.

Specifically, population stability attribute data, namely PSI index, of each first-stage derivative variable is obtained, a population stability attribute threshold corresponding to a second-stage screening condition is obtained, each first-stage derivative variable is further screened according to the population stability attribute threshold, and variables with the population stability attribute data smaller than the population stability attribute threshold are screened out to determine the second-stage derivative variable with the population stability attribute data meeting the second-stage screening condition.

It can be understood that the population stability attribute threshold corresponding to the second-level screening condition can be adjusted and set according to actual requirements, and is not limited to a specific value, for example, the population stability attribute threshold is set to 10%, and the second-level screening condition can be understood as PSI index < 10%, and can also be set to other different values.

Specifically, the group stability attribute data of each level of derivative variable, namely PSI index, is calculated by adopting the following formula (3):

wherein i represents the ith feature bin and n represents the total number of feature bins, namely, n feature bins are obtained after feature bin processing (such as chi-square bin dividing) is performed on each level of derivative variable. Wherein PSI _i Representing the population stability attribute data values within the ith feature bin, i.e., PSI represents the sum of the population stability attribute data values within the n feature bins.

Further, the ith special is calculated by specifically adopting the following formula (4)Group stability attribute data value PSI in a symptom bin _i ：

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing the ratio of the number of samples in the ith bin to the total number of samples in the training set, +.>Representing the ratio of the number of samples in the ith bin (bin standard is consistent with training set) to the total number of samples in the test set, +. >Representing the difference between the ratio of the number of samples in the ith sub-box to the total number of samples in the test set and the ratio of the number of samples in the ith sub-box to the total number of samples in the training set,/>The ratio between the number of samples in the ith bin and the total number of samples in the test set and the ratio between the number of samples in the ith bin and the total number of samples in the training set are expressed. />

And step S306, performing tertiary screening processing according to the characteristic association attribute data of each secondary derivative variable, and screening out tertiary derivative variables with characteristic association attribute data meeting tertiary screening conditions from each secondary derivative variable.

The feature association attribute data comprises feature statistics checking attribute data and feature combination attribute data, the three-stage screening process specifically comprises stepwise regression feature screening process corresponding to the feature statistics checking attribute data and sequence feature screening process corresponding to the feature combination attribute data, and the three-stage screening condition comprises stepwise regression feature screening condition corresponding to the feature statistics checking attribute data and sequence feature screening condition corresponding to the feature combination attribute data.

Specifically, stepwise regression feature screening processing is performed according to feature statistics verification attribute data of each secondary derivative variable, and sub-derivative variables with feature statistics verification attribute data meeting stepwise regression feature screening conditions are screened out from each secondary derivative variable. Further, based on the characteristic combination attribute data of each sub-derivative variable, sequence characteristic screening processing is carried out, and three-level derivative variables with characteristic combination attribute data meeting sequence characteristic screening conditions are screened out from each sub-derivative variable.

The feature statistical verification attribute data of each secondary derivative variable can be understood as a significance statistical test value obtained by performing a significance statistical test on the obtained model after modeling according to each secondary derivative variable, namely a significance difference value (Statistical significance). The method comprises the steps of selecting one secondary derivative variable from the secondary derivative variables, calculating the mean square error of linear regression of the currently selected secondary derivative variable, introducing the secondary derivative variable which is not selected one by one, selecting one secondary derivative variable capable of reducing the mean square error to the greatest extent, judging whether the secondary derivative variable obviously reduces the mean square error in a statistical sense, if so, indicating the characteristic statistical verification attribute data of the secondary derivative variable, meeting the condition of stepwise regression characteristic screening, and determining the secondary derivative variable as a child derivative variable.

Similarly, the feature combination attribute data of each sub-derivative variable may be understood as model effect attribute data of a model obtained by combining each sub-derivative variable to obtain a plurality of feature combinations and modeling the feature combinations including the plurality of sub-derivative variables. The sequence feature screening processing is performed according to the feature combination attribute data of each sub-derivative variable, and the feature screening method belongs to a packaged feature screening mode and can be understood as follows: the feature combination X of the sub-derivative variables starts from an empty set, one sub-derivative variable X is selected each time to be added into the feature combination X, so that the evaluation function (namely model effect attribute data) of the model is optimal, namely, one sub-derivative variable which enables the value of the evaluation function of the model to reach the optimal value is selected each time and is used as a three-level derivative variable which meets the sequence feature screening condition.

Step S308, based on the three-level derivative variables, symbol consistency screening processing and variance expansion attribute screening processing are performed to obtain first modeling variables.

Specifically, the symbol consistency screening process performed on the three-level derivative variable represents a screening process of estimating positive coefficients and negative coefficients of the three-level derivative variable as negative and estimating zero coefficients of the three-level derivative variable as zero, and specifically, the variable with inconsistent symbol estimates in the three-level derivative variable is removed, for example, the three-level derivative variable with positive coefficients and negative coefficients cannot be estimated correctly. The variance expansion attribute screening process means that the screening process is performed on each three-level derivative variable according to the collinearity degree (i.e. the variance expansion attribute value) between each three-level derivative variable, and specifically, the variable with the collinearity degree (i.e. the variance expansion attribute value) between the three-level derivative variables being greater than the preset attribute threshold value is eliminated. The preset attribute threshold value can be adjusted and set according to actual requirements, and is not particularly limited.

When the symbol consistency screening process and the variance expansion attribute screening process are performed on each three-stage derivative variable, the method further comprises the step of performing the saliency statistics check value screening process on each three-stage derivative variable, specifically, after modeling is performed on each three-stage derivative variable, performing the saliency statistics check on a model obtained through construction, obtaining the saliency statistics check value of the corresponding three-stage derivative variable in the model, and performing screening, for example, removing the three-stage derivative variable with the saliency statistics check value larger than a preset statistics threshold.

In this embodiment, a first-stage screening process is performed according to the information quantity attribute data corresponding to each first derivative variable, a first-stage derivative variable whose information quantity attribute data satisfies a first-stage screening condition is screened out from each first derivative variable, a second-stage screening process is performed based on the population stability attribute data of each first-stage derivative variable, and a second-stage derivative variable whose population stability attribute data satisfies a second-stage screening condition is screened out from each first-stage derivative variable. Further, three-level screening processing is performed according to the characteristic association attribute data of each two-level derivative variable, and three-level derivative variables with characteristic association attribute data meeting three-level screening conditions are screened out from each two-level derivative variable, so that symbol consistency screening processing and variance expansion attribute screening processing can be performed based on each three-level derivative variable, and each first modeling variable is obtained. The method and the device realize multi-level and more comprehensive step feature screening processing of each first derivative variable, thereby realizing accurate screening of a large number of derivative variables from different angles and directions, rapidly removing invalid variables with low relevance, poor stability, lower model effect than expected effect and other problems, and further improving the working efficiency of further modeling variable screening processing according to each first modeling variable obtained by screening.

In one embodiment, as shown in fig. 4, the step of performing a step feature screening process based on each first derivative variable to obtain each first modeling variable includes:

step S402, performing primary screening processing according to the information quantity attribute data corresponding to each first derivative variable, and screening out the primary derivative variables of which the information quantity attribute data meets the primary screening condition from each first derivative variable.

And S404, performing secondary screening processing based on the group stability attribute data of each primary derivative variable, and screening out the secondary derivative variables of which the group stability attribute data meets the secondary screening condition from each primary derivative variable.

Step S406, stepwise regression feature screening processing is performed according to the feature statistical verification attribute data of each secondary derivative variable, and sub-derivative variables with feature statistical verification attribute data meeting stepwise regression feature screening conditions are screened out from each secondary derivative variable.

Step S408, based on the feature combination attribute data of each sub-derivative variable, performing sequence feature screening processing, and screening out three-level derivative variables with feature combination attribute data meeting sequence feature screening conditions from each sub-derivative variable.

Step S410, based on each three-level derivative variable, symbol consistency screening processing and variance expansion attribute screening processing are performed to obtain each first modeling variable.

In one embodiment, the screening process based on each first derivative variable is as shown in Table 2 below:

TABLE 2 screening process based on first derivative variables

As can be seen from table 2, the first-stage screening process is understood as: and (3) determining an information quantity attribute data value (IV value) corresponding to each first derivative variable, acquiring a preset IV value threshold (such as 0.01) corresponding to the first-stage screening condition, further screening each first derivative variable according to the preset IV value threshold corresponding to the first-stage screening condition, screening out the information quantity attribute data (IV value) variable which is greater than or equal to the preset IV value threshold, and determining the information quantity attribute data (IV value) as the first-stage derivative variable which meets the first-stage screening condition.

The secondary screening process is understood to be: the group stability attribute data (PSI index) of each level of derivative variable is obtained, a group stability attribute threshold (such as 10%) corresponding to a second-level screening condition is obtained, each level of derivative variable is further screened according to the group stability attribute threshold, and the variables with the group stability attribute data smaller than the group stability attribute threshold are screened out to determine the second-level derivative variables with the group stability attribute data meeting the second-level screening condition.

Wherein, tertiary screening treatment includes: stepwise regression feature screening processing including forward stepwise regression feature screening processing and bidirectional stepwise regression feature screening processing, and sequential feature screening processing including sequential feature forward screening processing and sequential feature floating forward screening processing, specifically:

and (2) forward stepwise regression feature screening processing (namely Forward selection) which is represented by firstly selecting one secondary derivative variable from the secondary derivative variables, calculating the mean square error of linear regression of the currently selected secondary derivative variable, then introducing the secondary derivative variables which are not selected one by one, selecting one secondary derivative variable capable of reducing the mean square error to the greatest extent, judging whether the secondary derivative variable obviously reduces the mean square error in a statistical sense, and if so, indicating feature statistical verification attribute data of the secondary derivative variable, meeting stepwise regression feature screening conditions, and determining the secondary derivative variable. Specifically, the feature statistics checking attribute data, namely, the variable with the p value smaller than 0.5 is reserved and used as the sub-derivative variable.

The bi-directional stepwise regression feature screening process (i.e., bidirectional elimination) represents that the secondary derivative variables are introduced into the model one by one, variance checking (i.e., F checking) is performed after each secondary derivative variable is introduced, and mean checking (i.e., t checking) is performed one by one on the secondary derivative variables that have been selected, and the originally introduced secondary derivative variable is deleted when it becomes no longer significant due to the subsequent introduction of the secondary derivative variable. The process is iterated repeatedly until no significant secondary derivative variable is selected into the regression equation, and no insignificant secondary derivative variable is removed from the regression equation, so that the final tertiary derivative variable is obtained. The process of selecting variables by the two-way stepwise regression method comprises two basic steps: firstly, eliminating the variables which are not obvious through t test from the regression model, and secondly, introducing new variables which are obvious through F test into the regression model. The method specifically comprises the steps of reserving characteristic statistics checking attribute data, namely variables with p values smaller than 0.5, and eliminating variables with p values larger than 0.5 to obtain sub-derivative variables.

Likewise, the sequence feature forward screening process (i.e., SFS, sequential Forward Selection) is understood to be: the feature combination X of the sub-derivative variables starts from an empty set, one sub-derivative variable X is selected each time to be added into the feature combination X, so that the evaluation function (namely model effect attribute data) of the model is optimal, namely, one sub-derivative variable which enables the value of the evaluation function of the model to reach the optimal value is selected each time and is used as a three-level derivative variable which meets the sequence feature screening condition.

While the sequence feature floating forward screening process (i.e., SFFS, sequential Floating Forward Selection) is understood to be: and starting from the empty set, selecting a subset X from the unselected subset-derived variables in each round, optimizing the evaluation function after adding the subset X, then selecting a subset z from the selected subset-derived variables, optimizing the evaluation function after eliminating the subset z, and finally screening to obtain the three-level derivative variables meeting the sequence feature screening conditions.

In this embodiment, a first-stage screening process is performed according to the information quantity attribute data corresponding to each first derivative variable, a first-stage derivative variable whose information quantity attribute data satisfies a first-stage screening condition is screened out from each first derivative variable, a second-stage screening process is performed based on the population stability attribute data of each first-stage derivative variable, and a second-stage derivative variable whose population stability attribute data satisfies a second-stage screening condition is screened out from each first-stage derivative variable. Further, stepwise regression feature screening processing is performed according to feature statistics verification attribute data of each secondary derivative variable, sub-derivative variables with feature statistics verification attribute data meeting stepwise regression feature screening conditions are screened out from each secondary derivative variable, sequence feature screening processing is performed based on feature combination attribute data of each sub-derivative variable, and three-level derivative variables with feature combination attribute data meeting sequence feature screening conditions are screened out from each sub-derivative variable. Finally, based on each three-level derivative variable, symbol consistency screening treatment and variance expansion attribute screening treatment are carried out to obtain each first modeling variable, multi-level and more comprehensive step feature screening treatment is carried out on each first derivative variable, so that accurate screening of a large number of derivative variables from different angles and directions is realized, invalid variables with low relevance, poor stability, lower model effect than expected effect and the like are rapidly removed, and therefore working efficiency of each first modeling variable obtained according to screening is improved, and further modeling variable screening treatment is carried out subsequently.

In one embodiment, as shown in fig. 5, a modeling variable determining method is provided, which specifically includes the following steps:

step S501, selecting target variable derivative processing logic matched with the current business scene from all candidate variable derivative processing logic.

Step S502, according to target variable derivation processing logic, performing variable derivation processing of a multi-attribute hierarchy on the first payment data sample and the second payment data sample to obtain a plurality of first derived variables and a plurality of second derived variables, wherein the attribute hierarchy comprises at least two of a consumption attribute, a business attribute and a feature attribute.

Step S503, performing a first-stage screening process according to the information quantity attribute data corresponding to each first derivative variable, and screening out the first-stage derivative variables of which the information quantity attribute data meets the first-stage screening condition from each first derivative variable.

And step S504, performing secondary screening processing based on the group stability attribute data of each primary derivative variable, and screening out the secondary derivative variables of which the group stability attribute data meets the secondary screening condition from each primary derivative variable.

Step S505, stepwise regression feature screening processing is performed according to the feature statistical verification attribute data of each secondary derivative variable, and sub-derivative variables with feature statistical verification attribute data meeting stepwise regression feature screening conditions are screened out from each secondary derivative variable.

Step S506, based on the characteristic combination attribute data of each sub-derivative variable, performing sequence characteristic screening processing, and screening three-level derivative variables with characteristic combination attribute data meeting sequence characteristic screening conditions from each sub-derivative variable.

Step S507, based on each three-level derivative variable, symbol consistency screening processing and variance expansion attribute screening processing are carried out to obtain each first modeling variable.

Step S508, obtaining first service attribute information corresponding to each first modeling variable, and carrying out service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information to determine the first modeling variable and the second derivative variable with the same service attribute.

Step S509, based on the first modeling variable and the second derivative variable of the same service attribute, performing variable distribution statistics to obtain distribution statistics data between the first modeling variable and the second derivative variable of the same service attribute.

Step S510, determining the distribution trend of the first modeling variable and the second derivative variable of the same business attribute based on the distribution statistical data.

And S511, determining the first modeling variable and the second derivative variable which have the same service attribute and the same distribution trend as the second modeling variable matched with each second target object, wherein each second modeling variable is used for constructing and obtaining an abnormal data prediction model.

In one embodiment, as shown in fig. 6, there is provided an abnormal data prediction model construction method, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:

step S602, obtaining first modeling variables determined according to the first payment data samples corresponding to the first target objects, and constructing and obtaining an initial abnormal data prediction model based on the first modeling variables.

Specifically, target variable derivation processing logic matched with a current business scene is selected from candidate variable derivation processing logic, and according to the target variable derivation processing logic, variable derivation processing of multiple attribute levels is performed on first payment data samples corresponding to first target objects to obtain multiple first derived variables, so that step feature screening processing is performed on the basis of the first derived variables, and each first modeling variable is obtained through screening.

The first target object may be understood as a usage object belonging to a certain area, where the service transaction record, the service overdue record, the payment data, and the like of the usage object in different areas are different. The first payment data sample may be understood as object base data, business association data, consumption data, etc. of the first target object within a preset time. The object basic data includes basic information of the object, such as age, sex, employment information, etc., the business related data may include specific business handling records and business types, etc., and the consumption data may include data such as consumption scene, consumption mode (such as card payment, card-free payment, application payment, etc.), consumption amount, consumption frequency, and last time duration. Different consumption modes, such as card payment, can further determine the service related data corresponding to the object of use, such as the number of binding cards under the binding card service, the consumption frequency and the consumption amount of different cards, and the like.

Specifically, after the first modeling variables determined according to the first payment data samples corresponding to the first target objects are obtained, an initial abnormal data prediction model is constructed and obtained based on the first modeling variables and model parameters corresponding to the modeling variables.

The first modeling variables may be specifically understood as information variables corresponding to the first target objects obtained through screening, including basic information variables, service information variables, consumption information variables and the like corresponding to the first target objects, where the basic information variables may specifically include variables such as age, sex, employment information and the like of the objects, the service information variables specifically include variables such as service handling records and service handling types of the objects, such as specifically service types such as binding card service, loan service, financial product purchase service and consumption service, and the like, and the consumption information variables may specifically include variables such as consumption scenes (such as a mall, or a restaurant service store and the like), consumption modes (such as card payment, card payment without card payment, application payment and the like), consumption amount, consumption times, consumption frequency and the like.

Further, the initial abnormal data prediction model constructed is represented by the formula "y=α1×x1+α2×x2+α3×x3+α4×x … +αn×xn". Wherein Y is the anomaly data of the usage object, and specifically may be the overdue risk data of the usage object, the larger Y is between [0,1] and represents the higher the overdue risk of the usage object, and X1, X2, X3, … Xn is represented as a first modeling variable, namely, the information variable corresponding to each first target object obtained by screening, and α1, α2, α3, …, and αn are model parameters of the initial anomaly data prediction model.

The initial abnormal data prediction model may be specifically a trained classification model, such as a logistic regression model, a KNN model, a decision tree model, a GBDT model, an XGBoost model, a deep neural network, and the like.

In one embodiment, the step feature screening process is performed based on each first derivative variable, and the process of screening to obtain each first modeling variable includes:

and performing primary screening processing according to the information quantity attribute data corresponding to each first derivative variable, screening out primary derivative variables of which the information quantity attribute data meets primary screening conditions from each first derivative variable, performing secondary screening processing based on the group stability attribute data of each primary derivative variable, and screening out secondary derivative variables of which the group stability attribute data meets secondary screening conditions from each primary derivative variable. Further, three-level screening processing is carried out according to the characteristic association attribute data of each two-level derivative variable, three-level derivative variables with characteristic association attribute data meeting three-level screening conditions are screened out from each two-level derivative variable, and symbol consistency screening processing and variance expansion attribute screening processing are carried out based on each three-level derivative variable, so that each first modeling variable is obtained.

The feature association attribute data comprises feature statistics checking attribute data and feature combination attribute data, and the three-level screening conditions comprise stepwise regression feature screening conditions corresponding to the feature statistics checking attribute data and sequence feature screening conditions corresponding to the feature combination attribute data.

Further, performing tertiary screening processing according to the feature association attribute data of each secondary derivative variable, and screening out tertiary derivative variables with feature association attribute data meeting tertiary screening conditions from each secondary derivative variable, wherein the tertiary screening processing comprises the following steps:

according to the feature statistics verification attribute data of each secondary derivative variable, stepwise regression feature screening processing is carried out, and sub-derivative variables with feature statistics verification attribute data meeting stepwise regression feature screening conditions are screened out from each secondary derivative variable; and (3) performing sequence feature screening processing based on feature combination attribute data of each sub-derivative variable, and screening three-stage derivative variables of which the feature combination attribute data meets sequence feature screening conditions from each sub-derivative variable.

Step S604, obtaining second modeling variables determined by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables.

Specifically, a second payment data sample corresponding to each second target object is obtained, target variable derivation processing logic matched with the current business scene is screened out from candidate variable derivation processing logic, and variable derivation processing of multiple attribute levels is carried out on the second payment data sample corresponding to each second target object according to the target variable derivation processing logic, so that multiple second derived variables are obtained. Further, by acquiring first service attribute information corresponding to each first modeling variable and second service attribute information corresponding to each second derivative variable, performing service attribute matching processing according to the first service attribute information and the second service attribute information, and determining the first modeling variable and the second derivative variable with the same service attribute.

Further, modeling variable screening processing is performed on the basis of the first modeling variable and the second derivative variable with the same service attribute, and each second modeling variable is determined. The method comprises the steps of carrying out variable distribution statistics based on first modeling variables and second derivative variables with the same service attribute, obtaining distribution statistics data between the first modeling variables and the second derivative variables with the same service attribute, and determining distribution trends of the first modeling variables and the second derivative variables with the same service attribute based on the distribution statistics data, so that the first modeling variables and the second derivative variables with the same service attribute and the same distribution trend are determined to be second modeling variables matched with each second target object.

Step S606, based on the second modeling variables and the first payment data samples, performing migration learning processing on the initial abnormal data prediction model, and constructing an abnormal data prediction model, wherein the abnormal data prediction model is used for performing abnormal data prediction processing on the payment data to be identified to obtain corresponding abnormal data.

Specifically, when the migration learning process is performed on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample, specifically, the information variable (i.e., the first modeling variable) in the initial abnormal data prediction model is screened, and based on each screened information variable and each first payment data sample, the model parameters of the initial abnormal data prediction model are iteratively updated, so that the initial abnormal data prediction model of the migration version is obtained.

Further, a first modeling variable and a second derivative variable which have the same service attribute and the same distribution trend are determined and used as second modeling variables, each information variable in the initial abnormal data prediction model of the migration plate is replaced by the second modeling variable which has the same service attribute and the same distribution trend as each information variable, model parameters of the initial abnormal data prediction model of the migration plate are reserved and combined with each second modeling variable, migration of the modeling variables and reserved learning of the model parameters are completed, and the abnormal data prediction model is constructed.

And iteratively updating model parameters of the initial abnormal data prediction model based on each screened information variable and each first payment data sample to obtain a migration version of the initial abnormal data prediction model, and ensuring that the model parameters after the partial information variables are removed are correct.

Specifically, the initial abnormal data prediction model of the constructed migration plate is represented by the formula "y=β1×x1+β3×x3+ … +βn×xn". Wherein Y is the abnormal data of the usage object, the abnormal data of the usage object can be specifically the overdue risk data of the usage object, the larger Y is between [0,1] which indicates the higher overdue risk of the usage object, and X1, X3, … Xn is the information variable obtained by further screening, and beta 1, beta 3, …, beta n are the model parameters of the initial abnormal data prediction model of the migration version. When removing part of information variables in the initial abnormal data prediction model, the information variables which have low relevance with the business information of the second target object or are not suitable for evaluating or predicting the abnormal data of the second target object, such as the information variables X2 and X4, are removed, and the information variables X1 and X3 are reserved. The information variables and the number of the removed variables can be adjusted according to different application scenes or actual requirements, and the method is not limited to the method of removing the information variables X2 and X4.

Further, for the initial abnormal data prediction model of the migration plate represented by the formula "y=β1×x1+β3×x3+ … +βn×xn", by replacing each information variable including X1, X3, …, xn in the initial abnormal data prediction model of the migration plate with a second modeling variable including X1', X3', …, xn ', etc. having the same business attribute and the same distribution trend as each information variable, specifically, replacing X1 with X1', replacing X3 with X3', and so on, replacing Xn with Xn'. Meanwhile, model parameters 'beta 1, beta 3, … and beta n' of the initial abnormal data prediction model of the migration version are reserved and combined with second modeling variables 'X1', X3', … and Xn', migration of the modeling variables and reserved learning of the model parameters are completed, and the abnormal data prediction model expressed by a formula 'Y=beta 1X 1' +beta 3X 3'+ … +beta n X Xn'.

In one embodiment, in the actual popularization and application process of different internet financial services, service handling conditions, consumption data and the like of the use objects in different areas are inconsistent, so that the corresponding risk assessment mode and the actual data applied during assessment are different. For example, in the area a, in some financial services, there are history data of transacting services, such as in different financial institutions and consumption institutions, including banks, markets, catering services, and the like, and there are corresponding service transacting records, service overdue records, product purchase records, payment records, and the like, so for each use object in the area a, abnormal data evaluation can be performed based on the service transacting records, service overdue records, and the like of each use object, and corresponding abnormal data is obtained.

For the internet financial service, the abnormal data of each user object may be understood as overdue risk data of the user object, that is, overdue risk assessment needs to be performed on each user object, so as to obtain overdue risk data corresponding to the user object, determine whether each user object has a service overdue risk, and according to an assessment result of whether the service overdue risk exists, perform targeted service popularization on the user object or pass or reject a service application provided by the corresponding user object.

When the abnormal data of each use object in the area A is obtained, the association relationship between the abnormal data and the product purchase record, payment record and the like of each use object can be further established, so that when the data such as the business handling record, the business overdue record and the like are absent, the abnormal data of the corresponding use object on the financial business can be determined by using the product purchase record, the payment record and the like of the object. It is understood that, based on the product purchase record, payment record, and the like of the usage object, specifically, overdue risk data of the corresponding usage object on the financial business can be determined.

Further, if the corresponding business handling record or business overdue record is not generated in some financial businesses in the B region, the corresponding business handling record or business overdue record in the B region can be obtained, and the product purchase record, payment record and the like of the B region can be obtained.

In one embodiment, as shown in fig. 7, a method for constructing an abnormal data prediction model is provided, which specifically includes:

p1, constructing to obtain an initial abnormal data prediction model

Specifically, target variable derivation processing logic matched with the current business scene is selected from candidate variable derivation processing logic, multiple attribute-level variable derivation processing is performed on first payment data samples corresponding to first target objects according to the target variable derivation processing logic, multiple first derived variables are obtained, step feature screening processing is performed on the basis of the first derived variables, and first modeling variables are obtained through screening.

Further, after the first modeling variables determined according to the first payment data samples corresponding to the first target objects are obtained, an initial abnormal data prediction model is constructed and obtained based on the first modeling variables and model parameters corresponding to the modeling variables.

Specifically, the initial abnormal data prediction model obtained by the construction is represented by the formula "y=α1×x1+α2×x2+α3×x3+α4×x … +αn×xn". Wherein Y is abnormal data of the use object, the larger Y is between [0,1], the higher Y represents the overdue risk of the use object, and X1, X2, X3, … Xn is represented as a first modeling variable, namely information variables corresponding to each first target object obtained through screening, and alpha 1, alpha 2, alpha 3, … and alpha n are model parameters of an initial abnormal data prediction model.

P2, obtaining initial abnormal data prediction model of migration plate

Specifically, a second payment data sample corresponding to each second target object is obtained, target variable derivation processing logic matched with the current business scene is screened out from candidate variable derivation processing logic, and variable derivation processing of multiple attribute levels is carried out on the second payment data sample corresponding to each second target object according to the target variable derivation processing logic, so that multiple second derived variables are obtained. Further, by acquiring first service attribute information corresponding to each first modeling variable and second service attribute information corresponding to each second derivative variable, performing service attribute matching processing according to the first service attribute information and the second service attribute information, determining the first modeling variable and the second derivative variable of the same service attribute, performing modeling variable screening processing based on the first modeling variable and the second derivative variable of the same service attribute, and determining each second modeling variable.

Further, based on the second modeling variables and the first payment data samples, the initial abnormal data prediction model is subjected to transfer learning processing, and an abnormal data prediction model is constructed. Specifically, the method includes screening information variables (namely first modeling variables) in an initial abnormal data prediction model, and based on each screened information variable and each first payment data sample, iteratively updating model parameters of the initial abnormal data prediction model to obtain an initial abnormal data prediction model of a migration version.

Specifically, the initial abnormal data prediction model of the constructed migration plate is represented by the formula "y=β1×x1+β3×x3+ … +βn×xn". Wherein Y is the abnormal data of the usage object, the abnormal data can be specifically the overdue risk data of the usage object, the larger Y is between [0,1], the higher Y is the overdue risk of the usage object, and X1, X3, … Xn is the information variable obtained by further screening, and beta 1, beta 3, …, beta n is the model parameter of the initial abnormal data prediction model of the migration version.

When removing part of information variables in the initial abnormal data prediction model, the information variables which have low relevance with the business information of the second target object or are not suitable for evaluating or predicting the abnormal data of the second target object, such as the information variables X2 and X4, are removed, and the information variables X1 and X3 are reserved.

P3, obtaining abnormal data prediction model

Specifically, after determining a first modeling variable and a second derivative variable which have the same service attribute and the same distribution trend as the second modeling variable, replacing each information variable in the initial abnormal data prediction model of the migration plate with the second modeling variable which has the same service attribute and the same distribution trend as each information variable, simultaneously reserving model parameters of the initial abnormal data prediction model of the migration plate, combining the model parameters with each second modeling variable, completing migration of the modeling variables and reserving and learning of the model parameters, and constructing the abnormal data prediction model.

Further, for the initial abnormal data prediction model of the migration plate represented by the formula "y=β1×x1+β3×x3+ … +βn×xn", by replacing each information variable including X1, X3, …, xn in the initial abnormal data prediction model of the migration plate with a second modeling variable including X1', X3', …, xn ', etc. having the same business attribute and the same distribution trend as each information variable, specifically, replacing X1 with X1', replacing X3 with X3', and so on, replacing Xn with Xn'.

Meanwhile, model parameters 'beta 1, beta 3, … and beta n' of the initial abnormal data prediction model of the migration version are reserved and combined with second modeling variables 'X1', X3', … and Xn', migration of the modeling variables and reserved learning of the model parameters are completed, and the abnormal data prediction model expressed by a formula 'Y=beta 1X 1' +beta 3X 3'+ … +beta n X Xn'.

In the method for constructing the abnormal data prediction model, the initial abnormal data prediction model is constructed and obtained based on the first modeling variables by acquiring the first modeling variables determined according to the first payment data samples corresponding to the first target objects, and the second modeling variables determined by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables are acquired. Further, through utilizing each second modeling variable and each first payment data sample, migration learning processing is conducted on the initial abnormal data prediction model, an abnormal data prediction model is quickly built, abnormal data prediction processing is conducted on each payment data to be identified according to the abnormal data prediction model, and accuracy of the obtained abnormal data is improved.

In one embodiment, as shown in fig. 8, there is provided an abnormal data determining method, which is described by taking the server in fig. 1 as an example, including the steps of:

step S802, an abnormal data determination request is received, and payment data to be identified corresponding to the abnormal data determination request is obtained.

Specifically, the payment data to be identified corresponding to the abnormal data determination request is obtained by receiving the abnormal data determination request and analyzing the abnormal data determination request. The payment data to be identified corresponding to the abnormal data determining request specifically comprises object basic data, service association data, consumption data and the like of the used object in a preset time.

The object basic data includes basic information of the object, such as age, gender, employment information, etc., the business related data may include specific business transaction records, business types (such as binding business, loan business, deposit business, consumption business, financial product purchase business, etc.), etc., and the consumption data may include data such as consumption scene, consumption mode (such as card payment, non-card payment, application payment, etc.), consumption amount, consumption frequency, and last time of consumption. The different consumption modes, such as card payment, can further determine the service related data of the used object, such as the number of binding cards under the binding card service, the consumption frequency and the consumption amount of different cards, and the like.

Step S804, based on the trained abnormal data prediction model, carrying out abnormal data prediction processing on the payment data to be identified, and obtaining abnormal data corresponding to the payment data to be identified. The trained abnormal data prediction model is obtained by performing migration learning processing on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample. The initial abnormal data prediction model is constructed according to first modeling variables determined by first payment data samples corresponding to each first target object. The second modeling variables are obtained by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables, and determining the second modeling variables.

Specifically, the trained abnormal data prediction model is utilized to predict the abnormal data of the payment data to be identified, and the abnormal data corresponding to the payment data to be identified is obtained. The abnormal data corresponding to the payment data to be identified is used for indicating abnormal data of the user in the financial business handling process, specifically may be overdue risk data of the user in the financial business handling process, for example, overdue repayment risk of the user after handling loan business, that is, whether the user can repay on a corresponding repayment date or not can be predicted, and whether overdue risk exists or not.

In one embodiment, after obtaining the abnormal data corresponding to the payment data to be identified, determining an abnormal section to which the payment data to be identified belongs further according to the abnormal data, and obtaining service processing logic corresponding to each abnormal section, so that service evaluation processing can be performed on a service application associated with the payment data to be identified according to the service processing logic.

The abnormal section to which the payment data to be identified belongs specifically comprises a high-section, a middle-section and a low-section, and the abnormal data of the usage object in the high-section, the middle-section and the low-section, namely the overdue risk, is sequentially promoted. The service processing logic corresponding to each abnormal section specifically comprises first processing logic corresponding to a high segmentation section, second processing logic corresponding to a middle segmentation section, third processing logic corresponding to a middle and low segmentation section and fourth processing logic corresponding to a low segmentation section.

Further, by acquiring service processing logic corresponding to each abnormal section, such as a first processing logic corresponding to a high segmentation section, a second processing logic corresponding to a middle segmentation section, a third processing logic corresponding to a middle and low segmentation section, a fourth processing logic corresponding to a low segmentation section, and the like, service evaluation processing is performed on service applications associated with the payment data to be identified of the corresponding abnormal section according to each determined service processing logic.

When the service evaluation processing is performed on the service application associated with the to-be-identified payment data of the high-segmentation interval, the overdue risk corresponding to the high-segmentation interval is low, so that the overdue possibility of the used object is low, and the proposed service application, such as loan service handling, binding card service handling and the like, adopts the first processing logic to perform the evaluation result obtained by the service evaluation as the application passing. When the service application associated with the middle segment interval and the middle and low segment interval is subjected to service evaluation processing, the overdue risk corresponding to the high segment interval is raised, so that the overdue possibility of the use object is high, and the proposed service application, such as loan service handling, binding card service handling and the like, needs to be subjected to further processing of information acquisition, evaluation and investigation and the like, and the second or third processing logic is adopted to perform service evaluation to obtain an evaluation result as the application pending, and the application is continuously checked.

Likewise, when the service evaluation processing is performed on the service application associated with the to-be-identified payment data in the low-segment interval, since the overdue risk corresponding to the high-segment interval is very high, the overdue possibility of the used object is very high, and the proposed service application, such as loan service handling, binding card service handling, and the like, adopts the fourth processing logic to perform the service evaluation to obtain the evaluation result which is the application rejection.

In one embodiment, the initial abnormal data prediction model is constructed and obtained by acquiring first modeling variables determined according to first payment data samples corresponding to each first target object, constructing and obtaining an initial abnormal data prediction model based on each first modeling variable, and acquiring second modeling variables determined by performing modeling variable screening processing based on second payment data samples corresponding to each second target object and each first modeling variable, so that migration learning processing is performed on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample, and an abnormal data prediction model is constructed and obtained.

In one embodiment, as shown in fig. 9, a section distribution schematic of abnormal data in an abnormal data determining method is provided, and referring to fig. 9, it can be known that by acquiring actual abnormal data (including 0.0% -4.5%) of a certain area in a service handling process, that is, standard abnormal data, and performing abnormal section division on the abnormal data determined according to the abnormal data determining method, that is, determining an abnormal section to which the payment data to be identified belongs, including a high-segment section, a middle-low-segment section, and a low-segment section. The usage object ratio of each abnormal section can be obtained by counting the usage objects in each abnormal section.

Further, referring to fig. 9, it is understood that by the anomaly data of the anomaly data determination method, the corresponding anomaly section includes a high-segment section, a middle-low-segment section, and a low-segment section, the anomaly data of the use object of the high-segment section is 1.8%, the anomaly data of the use object of the middle-segment section is 2.7%, the anomaly data of the use object of the middle-low-segment section is 3.1%, and the anomaly data of the use object of the low-segment section is 3.9%. Wherein the usage object of the high segment section is 10%, the usage object of the middle segment section is 32%, the usage object of the middle and low segment sections is 39%, and the usage object of the low segment is 19%.

The abnormal data determined by the abnormal data determining method, that is, the overdue risk data of the object, has a risk distinction degree, referring to fig. 9, it can be known that the overdue risk of the low-segment use object is 2 times of the overdue risk of the high-segment use object, so that the accurate overdue risk can be predicted, and according to the predicted overdue risk, the service application associated with the payment data to be identified in the corresponding abnormal segment is respectively subjected to service evaluation processing by adopting the service processing logic associated with the corresponding abnormal segment according to the predicted overdue risk, thereby avoiding providing service handling service for the use object with higher overdue risk and reducing the profit loss in the service handling process.

In the abnormal data determining method, the abnormal data determining request is received, and the payment data to be identified corresponding to the abnormal data determining request is obtained. The method comprises the steps of constructing an initial abnormal data prediction model according to first modeling variables determined by first payment data samples corresponding to first target objects, screening the modeling variables based on second payment data samples corresponding to second target objects and the first modeling variables, determining to obtain second modeling variables, and performing migration learning on the initial abnormal data prediction model based on the second modeling variables and the first payment data samples to obtain a trained abnormal data prediction model. Finally, based on the trained abnormal data prediction model, the abnormal data prediction processing is carried out on the payment data to be identified, so that the abnormal data corresponding to the payment data to be identified is accurately and rapidly obtained.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a modeling variable determining device, an abnormal data prediction model constructing device and an abnormal data determining device for realizing the modeling variable determining method, the abnormal data prediction model constructing method and the abnormal data determining method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitations in the embodiments of the modeling variable determining apparatus, the abnormal data prediction model constructing apparatus, and the abnormal data determining apparatus provided below may be referred to the limitations in the modeling variable determining method, the abnormal data prediction model constructing method, and the abnormal data determining method described above, and will not be repeated herein.

In one embodiment, as shown in fig. 10, a modeling variable determining apparatus is provided, including a variable derivation processing module 1002, a step feature screening processing module 1004, a business attribute matching processing module 1006, a variable distribution statistics module 1008, and a second modeling variable determining module 1010, wherein:

the variable derivation processing module 1002 is configured to perform variable derivation processing according to the first payment data samples corresponding to the first target objects and the second payment data samples corresponding to the second target objects, so as to obtain a plurality of first derived variables and a plurality of second derived variables;

The step feature screening processing module 1004 is configured to perform step feature screening processing based on each first derivative variable, screen to obtain each first modeling variable, and obtain first service attribute information corresponding to each first modeling variable;

the service attribute matching processing module 1006 is configured to perform service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information, and determine a first modeling variable and a second derivative variable of the same service attribute;

the variable distribution statistics module 1008 is configured to perform variable distribution statistics based on the first modeling variable and the second derivative variable of the same service attribute, and obtain distribution statistics data between the first modeling variable and the second derivative variable of the same service attribute;

the second modeling variable determining module 1010 is configured to screen to obtain second modeling variables matched with each second target object according to the distribution statistical data, each first modeling variable, and each second derivative variable, where each second modeling variable is used to construct and obtain an abnormal data prediction model.

In the modeling variable determining device, the variable derivation processing is performed on the first payment data sample corresponding to each first target object and the second payment data sample corresponding to each second target object to obtain a plurality of first derived variables and second derived variables, so that a more comprehensive and universal derived variable is obtained. The method comprises the steps of carrying out step feature screening processing on the basis of each first derivative variable, screening to obtain each first modeling variable, obtaining first service attribute information corresponding to each first modeling variable, carrying out service attribute matching processing according to second service attribute information corresponding to each second derivative variable and the first service attribute information, and determining the first modeling variable and the second derivative variable with the same service attribute, so that preliminary screening on the service attribute information can be carried out on each second derivative variable according to the existing first modeling variable. Further, based on the first modeling variable and the second derivative variable with the same service attribute, the variable distribution statistics is carried out to obtain the distribution statistics data between the first modeling variable and the second derivative variable with the same service attribute, further service information matching, variable distribution statistics and variable screening can be carried out on each first modeling variable screened out based on the comprehensive first payment data sample, the second modeling variable matched with each second target object is obtained through screening, finally, an abnormal data prediction model can be constructed according to each second modeling variable, the situation that modeling variable errors are large due to overlong observation time of the modeling variable or the fact that data samples are missing in the modeling variable determining process is avoided, and further, abnormal data corresponding to payment data to be identified can be rapidly and accurately obtained according to the constructed abnormal data prediction model.

In one embodiment, the variable derivative processing module is further configured to: screening out target variable derivative processing logic matched with the current service scene from the candidate variable derivative processing logic; performing variable derivation processing of multiple attribute levels on the first payment data sample and the second payment data sample according to target variable derivation processing logic to obtain a plurality of first derived variables and a plurality of second derived variables; wherein the attribute hierarchy includes at least two of a consumption attribute, a business attribute, and a feature attribute.

In one embodiment, the step feature screening processing module is further configured to: performing primary screening processing according to the information quantity attribute data corresponding to each first derivative variable, and screening out primary derivative variables of which the information quantity attribute data meets primary screening conditions from each first derivative variable; performing secondary screening processing based on the group stability attribute data of each primary derivative variable, and screening out the secondary derivative variables of which the group stability attribute data meets the secondary screening conditions from each primary derivative variable; performing tertiary screening treatment according to the characteristic association attribute data of each secondary derivative variable, and screening tertiary derivative variables of which the characteristic association attribute data meets tertiary screening conditions from each secondary derivative variable; and based on each three-level derivative variable, performing symbol consistency screening processing and variance expansion attribute screening processing to obtain each first modeling variable.

In one embodiment, the step feature screening processing module is further configured to: according to the feature statistics verification attribute data of each secondary derivative variable, stepwise regression feature screening processing is carried out, and sub-derivative variables with feature statistics verification attribute data meeting stepwise regression feature screening conditions are screened out from each secondary derivative variable; and (3) performing sequence feature screening processing based on feature combination attribute data of each sub-derivative variable, and screening three-stage derivative variables of which the feature combination attribute data meets sequence feature screening conditions from each sub-derivative variable.

In one embodiment, the second modeling variable determination module is further configured to: determining a distribution trend of a first modeling variable and a second derivative variable of the same business attribute based on the distribution statistical data; and determining the first modeling variable and the second derivative variable which have the same service attribute and the same distribution trend as second modeling variables matched with the second target objects.

In one embodiment, as shown in fig. 11, there is provided an abnormal data prediction model construction apparatus including: an initial anomaly data predictive model construction module 1102, a modeling variable screening processing module 1104, and an anomaly data predictive model construction module 1106, wherein:

The initial abnormal data prediction model construction module 1102 is configured to obtain first modeling variables determined according to first payment data samples corresponding to each first target object, and construct and obtain an initial abnormal data prediction model based on each first modeling variable;

the modeling variable screening processing module 1104 is configured to obtain second modeling variables determined by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables;

the abnormal data prediction model construction module 1106 is configured to perform migration learning processing on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample, so as to construct an abnormal data prediction model; and the abnormal data prediction model is used for carrying out abnormal data prediction processing on each payment data to be identified to obtain corresponding abnormal data.

In the abnormal data prediction model construction device, the initial abnormal data prediction model is constructed and obtained based on the first modeling variables by acquiring the first modeling variables determined according to the first payment data samples corresponding to the first target objects, and the second modeling variables determined by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables are acquired. Further, through utilizing each second modeling variable and each first payment data sample, migration learning processing is conducted on the initial abnormal data prediction model, an abnormal data prediction model is quickly built, abnormal data prediction processing is conducted on each payment data to be identified according to the abnormal data prediction model, and accuracy of the obtained abnormal data is improved.

In one embodiment, as shown in fig. 12, there is provided an abnormal data determining apparatus including: a payment data acquisition module 1202 to be identified, and an abnormal data prediction processing module 1204, wherein:

the payment data to be identified acquisition module 1202 is configured to receive an abnormal data determination request and acquire payment data to be identified corresponding to the abnormal data determination request;

the abnormal data prediction processing module 1204 is configured to perform abnormal data prediction processing on the payment data to be identified based on the trained abnormal data prediction model, so as to obtain abnormal data corresponding to the payment data to be identified; the trained abnormal data prediction model is obtained by performing migration learning processing on the initial abnormal data prediction model based on each second modeling variable and each first payment data sample; the initial abnormal data prediction model is constructed according to first modeling variables determined by first payment data samples corresponding to each first target object; the second modeling variables are obtained by performing modeling variable screening processing based on the second payment data samples corresponding to the second target objects and the first modeling variables, and determining the second modeling variables.

In the abnormal data determining device, the abnormal data determining request is received, and the payment data to be identified corresponding to the abnormal data determining request is acquired. The method comprises the steps of constructing an initial abnormal data prediction model according to first modeling variables determined by first payment data samples corresponding to first target objects, screening the modeling variables based on second payment data samples corresponding to second target objects and the first modeling variables, determining to obtain second modeling variables, and performing migration learning on the initial abnormal data prediction model based on the second modeling variables and the first payment data samples to obtain a trained abnormal data prediction model. Finally, based on the trained abnormal data prediction model, the abnormal data prediction processing is carried out on the payment data to be identified, so that the abnormal data corresponding to the payment data to be identified is accurately and rapidly obtained.

In one embodiment, an abnormal data determining apparatus is provided, further including a service evaluation processing module configured to:

determining an abnormal interval to which the payment data to be identified belongs according to the abnormal data; and acquiring service processing logic corresponding to each abnormal interval, and performing service evaluation processing on the service application associated with the payment data to be identified according to the service processing logic.

The above-described modeling variable determination means, abnormal data prediction model construction means, and each module in the abnormal data determination means may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 13. The computer device includes a processor, a memory, an InPut/OutPut interface (InPut/OutPut, I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as a first payment data sample corresponding to each first target object, a second payment data sample corresponding to each second target object, a first derivative variable, a second derivative variable, each first modeling variable, first service attribute information, second service attribute information, first modeling variables and second derivative variables of the same service attribute, distribution statistical data between the first modeling variables and the second derivative variables of the same service attribute, second modeling variables, an initial abnormal data prediction model, an abnormal data prediction model, payment data to be identified, abnormal data and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a modeling variable determination method, an abnormal data prediction model construction method, and an abnormal data determination method.

It will be appreciated by those skilled in the art that the structure shown in FIG. 13 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A modeling variable determination method, the method comprising:

2. The method according to claim 1, wherein performing a variable derivation process according to the first payment data sample corresponding to each first target object and the second payment data sample corresponding to each second target object to obtain a plurality of first derived variables and a plurality of second derived variables, includes:

screening out target variable derivative processing logic matched with the current service scene from the candidate variable derivative processing logic;

Performing variable derivation processing of a multi-attribute hierarchy on the first payment data sample and the second payment data sample according to the target variable derivation processing logic to obtain a plurality of first derived variables and a plurality of second derived variables; wherein the attribute hierarchy includes at least two of a consumption attribute, a business attribute, and a feature attribute.

3. The method of claim 1, wherein the step-wise feature screening process is performed based on each of the first derivative variables, the screening resulting in each of the first modeled variables, comprising:

performing primary screening processing according to the information quantity attribute data corresponding to each first derivative variable, and screening out primary derivative variables of which the information quantity attribute data meets primary screening conditions from each first derivative variable;

performing secondary screening processing based on the group stability attribute data of each primary derivative variable, and screening out the secondary derivative variables of which the group stability attribute data meets the secondary screening condition from each primary derivative variable;

performing tertiary screening processing according to the characteristic association attribute data of each secondary derivative variable, and screening out tertiary derivative variables with characteristic association attribute data meeting tertiary screening conditions from each secondary derivative variable;

And carrying out symbol consistency screening processing and variance expansion attribute screening processing on the basis of each three-level derivative variable to obtain each first modeling variable.

4. A method according to claim 3, wherein the feature-related attribute data comprises feature statistical verification attribute data and feature combination attribute data, and the three-level screening condition comprises a stepwise regression feature screening condition corresponding to the feature statistical verification attribute data and a sequential feature screening condition corresponding to the feature combination attribute data; performing tertiary screening processing according to the characteristic association attribute data of each secondary derivative variable, screening out tertiary derivative variables with characteristic association attribute data meeting tertiary screening conditions from each secondary derivative variable, including:

according to the feature statistical verification attribute data of each secondary derivative variable, stepwise regression feature screening processing is carried out, and sub-derivative variables of which the feature statistical verification attribute data meet stepwise regression feature screening conditions are screened out from each secondary derivative variable;

and carrying out sequence feature screening processing based on the feature combination attribute data of each sub-derivative variable, and screening three-stage derivative variables of which the feature combination attribute data meets sequence feature screening conditions from each sub-derivative variable.

5. The method according to any one of claims 1 to 4, wherein the screening to obtain the second modeling variables matched with the second target objects according to the distribution statistical data, the first modeling variables and the second derivative variables includes:

determining distribution trends of a first modeling variable and a second derivative variable of the same service attribute based on the distribution statistical data;

and determining the first modeling variable and the second derivative variable which have the same service attribute and the same distribution trend as second modeling variables matched with the second target objects.

6. The method for constructing the abnormal data prediction model is characterized by comprising the following steps:

7. A method of determining anomaly data, the method comprising:

8. A modeling variable determination apparatus, characterized in that the apparatus comprises:

9. An abnormal data prediction model construction apparatus, characterized in that the apparatus comprises:

10. An abnormal data determination apparatus, characterized in that the apparatus comprises:

11. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.