CN113239024A

CN113239024A - Bank abnormal data detection method based on outlier detection

Info

Publication number: CN113239024A
Application number: CN202110434414.5A
Authority: CN
Inventors: 郭鹏飞; 王灏
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-08-10
Anticipated expiration: 2041-04-22
Also published as: CN113239024B

Abstract

The invention provides a bank abnormal data detection method based on outlier detection, and relates to the technical field of outlier detection. The method comprises the steps of firstly defining bank user information data in a triple form; then, learning the coupling in and among the attributes of the user information data, and constructing an attribute in-coupling space and an attribute among-coupling space; heterogeneous learning in a user information data kernel space is further performed; the similarity of user information data is learned, and the similarity between objects in the heterogeneous kernel space is determined; and finally, calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers, and selecting the most possible outlier to realize the detection of the abnormal data of the bank. The method has high outlier detection rate, is suitable for detecting the outlier of high-dimensional big data, and further improves the detection of abnormal bank data through accurately detecting the outlier of the bank user information data.

Description

Bank abnormal data detection method based on outlier detection

Technical Field

The invention relates to the technical field of outlier detection, in particular to a bank abnormal data detection method based on outlier detection.

Background

At present, each bank system has a large number of users, so that a large number of user information data can be generated, and banks can formulate better customer service schemes, bank positioning schemes and the like based on the data. However, the data has many abnormal data, and the bank can perform key management on the client and discover fraud and other behaviors in time through detecting the abnormal data.

Outlier detection is one of the important components of the data mining field, with the emphasis on identifying partial data objects in the real dataset that are inconsistent or unexpected with normal data. Unlike most normal objects, outliers are rare objects, they have different situations from normal objects, their existence has a large influence on normal data analysis, may mislead the data analysis result, and detecting outliers has an important role for many real-world applications, such as detecting network intrusion, detecting special diseases, bank fraud, etc. Identifying outliers is very difficult, especially for data that has a complex relationship that is not independently identically distributed.

The coupled learning is a sub-field of a new data mining field, and the original data is analyzed by modeling the original data by analyzing various complex relations in the non-independent and identically distributed complex data. The coupling learning proposes: learning whether more coupling will improve the representation quality of the data, and the correlation between two attributes can greatly reduce redundant information and achieve better representation performance. Another key issue is the need to learn different types of couplings while acquiring complementary and inconsistent interactions and relationships. Different interactions and different data distributions in the data have different types of coupling, and the data set has various distributions, which are both reasons for the existence of heterogeneous coupling. Early coupling learning focused on single or only a few strong couplings, with few considerations of multiple couplings and weak couplings, and often some weak couplings could determine much of the important content, where previous work was ignored.

The coupling between attribute values represents the way and degree that attribute values are coupled in attributes, and the coupling between attributes represents the way and degree that attributes are coupled. However, existing work involves only a single coupling, ignoring many other features in the classification data. The outlier detection method for learning heterogeneous coupling and hierarchical coupling can more accurately represent the complex relationship in the original data, so that outlier objects different from normal objects can be more accurately identified.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a bank abnormal data detection method based on outlier detection, so as to implement the detection of the bank abnormal data.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a bank abnormal data detection method based on outlier detection comprises the following steps:

step 1: defining bank user information data in a triple form;

defining user information data as a triplet C<O,A,V>Wherein O ═ O_i|i∈N_oIs of n_oA set of objects for an individual user; a ═ a_j|j∈N_aIs of n_aAn attribute set of individual attributes;

is provided with n_vA set of attribute values of the values;

is an attribute value

Having an attribute of a_jIs/are as follows

An attribute value, N_o，N_aAnd

the object set in O, the attribute set in A and the V are respectively^(j)An index set of classification values; for the firsti objects o_iJ-th attribute a_jIs expressed as

Step 2: learning the coupling in the attribute of the user information data and constructing an attribute coupling space;

attribute in-coupling represents the interaction between attribute values and the distribution of values in attributes; measuring attribute incoupling according to the attribute intracistribution through a value frequency function; for classification values in jth attribute

Function of value and frequency

And mapping the attribute in-coupling between the classification value and other classification values in the attribute into a one-dimensional attribute in-coupling vector, wherein the following formula is shown as follows:

wherein, g^(j)(·)：V^(j)Will value of → O

Mapping to having a value in the jth attribute

Represents a set of counts of a set;

then the attribute in-coupling space for the jth attribute

Is formed from the attribute in-coupling vector obtained in the attribute in equation (1) and is defined as follows:

thus, for a group having n_aUser information data of an attribute, the attribute in-coupling space being

And step 3: learning the coupling between user information data attributes and constructing a coupling space between attributes;

the inter-attribute coupling refers to the interaction between the context information of the attribute and the attribute values of other attributes; this attribute-based interaction and context information supplements the value distribution and interaction captured by the attribute in-coupling;

the inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given an attribute a_jAn attribute value v of^(j)And attribute a_kProperty value v of^(k)Conditional probability function p (v) of information between two attribute values^(j)|v^(k)) The definition is as follows:

wherein, n returns the intersection of the two sets;

inter-attribute coupling learning function

Attribute values based on information conditional probability functions

Interaction with other attributes is embedded as | V_*The i-dimension coupling vector between attributes is shown by the following formula:

wherein, V_*＝{V^(k)|k∈N_aK ≠ j } is divided by a_jSet of attribute values in all but v_*i∈V_*Is a set V_*The attribute value of (1);

the inter-attribute coupling space of the jth attribute

Is composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:

thus, for a group having n_aUser information data of individual attributes, coupling space between attributes

And 4, step 4: heterogeneous learning in a user information data kernel space is carried out through the constructed attribute in-coupling space and the attribute inter-coupling space;

through the constructed attribute in-coupling space and the attribute inter-coupling space, a complete heterogeneous coupling space set M is further constructed, and the set is a heterogeneous attribute in-coupling space M_IaAnd an inter-attribute coupling space M_IeThe set of (c) is shown by the following formula:

M＝M_Ia∪M_Ie (6)

namely, it is

Converting each coupled space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupled space; generating a set of n_kOne kernel space

n_kWhere F is the set of kernel functions used for the conversion, the pth kernel space is represented by kernel matrix K_pCoverage, p is less than or equal to n_kThe kernel matrix K_pBy coupling spaces M_jKernel function k through attributes_p(-) make up, as shown in the following equation:

wherein the content of the first and second substances,

is formed by M_jNumber of attribute values represented, m_jIs a coupling space M_jRepresents the jth attribute value;

by a set of transformation matrices

Reconstructing kernel space

Is { K'₁，…，K′_nkH, wherein the p-th kernel matrix K'_pA sensitivity profile containing only the p-th kernel appropriate for the respective coupling; the weighing structure kernel space is heterogeneous kernel space; k'_pIs defined as:

K′_p＝T_p·K_p (8)

will T_pSpecified as a diagonal matrix, as shown in the following equation:

wherein alpha is_pjIs the weight of the jth attribute value in the pth kernel space;

and 5: performing similarity learning of user information data, and determining similarity between objects in a heterogeneous kernel space;

first defining a similarity measure between objects in the heterogeneous kernel spaces, and then learning a weight of each kernel space based on the similarity measure to reflect their contribution; given a property dataset, for the p-th kernel matrix, let i and i ' denote indices of values in the p-th kernel space corresponding to the i-th object and the i ' -th object, respectively, and use K '_p，iRepresenting the ith object in the pth heterogeneous kernel space, the similarity S measured by the linear kernels of the ith and ith' objects in that space_p，ii’As shown in the following equation:

the final similarity S between the ith object and the ith' object_iiDefined as a linear combination of similarity measures from heterogeneous kernel space to filter redundant information and integrate complementary information between couplings, as shown in the following equation:

wherein, beta_p≧ 0 is the weight of the similarity in the pth heterogeneous kernel space,

representing a diagonal matrix;

step 6: calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most probable outlier to realize the detection of abnormal data of the bank;

defining the outlier of the ith object as the sum of the similarity of the object i and all the objects, and dividing the sum by the number of the objects, wherein the formula is as follows:

adopt the produced beneficial effect of above-mentioned technical scheme to lie in: according to the bank abnormal data detection method based on outlier detection, provided by the invention, the complex coupling relation between different data objects is studied more deeply, and the real relation between the data objects can be reflected more accurately; the heterogeneous coupling is applied to outlier detection, so that the accuracy of an outlier detection algorithm is effectively improved; the method has high outlier detection rate, is suitable for detecting the outlier of high-dimensional big data, and further improves the detection of abnormal bank data through accurately detecting the outlier of the bank user information data.

Drawings

Fig. 1 is a flowchart of a bank abnormal data detection method based on outlier detection according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In a real-world situation, there is an interactive relationship between any data objects. Data are often non-independently and identically distributed, that is, more or less complicated coupling relationships exist between the same features, between different features, and even between different objects. An important idea of non-independent co-distribution is coupled learning. When learning each object, a hierarchical thinking needs to be taken into consideration for the same feature, different features and coupling between different data objects. Therefore, the method of the invention adds the heterogeneous coupling into the coupling analysis, and can more accurately analyze the complex relation between the original data objects, thereby improving the accuracy of outlier detection.

In this embodiment, user information data in a certain bank within a period of time is taken as an example, and the outlier in the user information data within the period of time is detected by using the bank abnormal data detection method based on outlier detection of the present invention, so as to implement detection of abnormal data. In this embodiment, a bank abnormal data detection method based on outlier detection, as shown in fig. 1, includes the following steps:

step 1: defining bank user information data in a triple form;

defining user information data as a triplet C<O,A,V>Wherein O ═ O_i|i∈N_oIs of n_oA set of objects for an individual user; a ═ a_j|j∈N_aIs of n_aA set of attributes for an individual attribute (i.e., customer information such as gender, learning, marital status, job status, deposit status, recent transaction status, etc.);

is provided with n_vA set of attribute values of values (e.g., male and female, junior middle school calendar, subject calendar, deposit amount, recent transaction, etc.);

is an attribute value

A set of (e.g. a set of all objects whose marital status is married) having an attribute of a_jIs/are as follows

An attribute value, N_o，N_aAnd

the object set in O, the attribute set in A and the V are respectively^(j)An index set of classification values; for the ith object o_iJ-th attribute a_jIs expressed as

attribute in-coupling represents the interaction between attribute values and the distribution of values in attributes; measuring attribute incoupling according to the attribute intracistribution through a value frequency function; although the numerical frequency function has only one input value, it measures the numerical distribution for all numerical values. For example, information of various personnel ratios, ratios of married to unmarried personnel, ratios of different academic calendars and the like in bank customers can be analyzed, and the data can be better analyzed.

For classification values in jth attribute

Function of value and frequency

wherein, g^(j)(·)：V^(j)Will value of → O

Mapping to having a value in the jth attribute

Represents a set of counts of a set;

then the attribute in-coupling space for the jth attribute

The attribute in-coupling space represents only one-dimensional embedding of the classification data space with respect to each attribute; the following inter-attribute coupling takes into account the interaction between attributes;

inter-attribute coupling refers to the interaction between contextual (and/or semantic) information of attribute values of attributes with other attributes; this attribute-based interaction and context information supplements the value distribution and interaction captured by the attribute in-coupling; for example, one user may have a low credit, no transaction record, a sudden large transaction record, and another user may have a monthly fixed credit and debit, which is suspicious over the previous client behavior, indicating that the attributes are interrelated and better interact with the in-attribute coupling to help us analyze the client.

wherein, n returns the intersection of the two sets;

inter-attribute coupling learning function

Based on information stripsPiece probability function, attribute value

the inter-attribute coupling space of the jth attribute

If | V |>2|V^(j)1, then the inter-attribute coupling learning function projects the classification value into a higher dimensional space. Because the dimension of the coupling space between attributes is equal to | V | - | V^(j)And the degree of freedom of the jth attribute (equal to the dimension of the classification value into a virtual variable) is | V^(j)1, thus capturing the value coupling caused by other attributes, which complements the attribute in-coupling to form a complete representation of the classification attribute space.

from the view point of the attribute and the attribute between, a complete heterogeneous coupling space set M is further constructed through the constructed attribute in-coupling space and the attribute between-coupling space, and the set is a heterogeneous attribute in-coupling space M_IaAnd an inter-attribute coupling space M_IeThe set of (c) is shown by the following formula:

M＝M_Ia∪M_Ie (6)

namely, it is

To efficiently integrate heterogeneous coupling into a learned coupling space set, the learned heterogeneous coupling space is converted into a uniform space in which the heterogeneous couplings are comparable. Specifically, each coupled space is converted into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to the converted coupled space; generating a set of n_kOne kernel space

wherein the content of the first and second substances,

is formed by M_jNumber of attribute values represented, m_jIs a coupling space M_jA vector of (1) represents the jthAn attribute value;

to reveal heterogeneity within the coupling, we learn the weights of the values in each kernel space. In particular, by a set of transformation matrices

Reconstructing kernel space

K′_p＝T_p·K_p (8)

will T_pSpecified as a diagonal matrix, as shown in the following equation:

wherein alpha is_pjIs the weight of the jth attribute value in the pth kernel space; alpha is alpha_pjThe larger the value, the stronger the coupling of the jth attribute value displayed by the coupling space corresponding to the pth kernel space.

The above is to convert the bank customer information into a plurality of kernel spaces, and represent the relationship between the customer information by using a kernel matrix, and in order to reveal heterogeneity in coupling, we learn the weight of the attribute value in each kernel space. It learns a set of transformation matrices to reconstruct the kernel space and find the attribute values with large weights, i.e. the attribute values have high coupling strength.

to further capture the heterogeneity between couplings, we chose to learn the impact of each heterogeneous kernel space on the final result. It first defines the similarities between objects in the heterogeneous kernel spaceA similarity measure, and then learning a weight of each kernel space based on the similarity measure to reflect their contribution; given a property dataset, for the p-th kernel matrix, let i and i ' denote indices of values in the p-th kernel space corresponding to the i-th object and the i ' -th object, respectively, and use K '_p，i(in K'_pRow i) of (a) indicates the ith object in the p-th heterogeneous kernel space, the similarity S measured by the linear kernels of the ith and ith' objects in that space_p，ii’As shown in the following equation:

representing a diagonal matrix;

and performing similarity learning on the bank client information, calculating the similarity of the bank client information, and measuring the similarity between each object, so that the bank client information which is dissimilar is found, and the data with problems is abnormal data. (this process filters out some useless attributes such as gender, etc. because they do not help our detection, i.e. low coupling attributes).

calculating the similarity between two objects by the formula (11), calculating the sum of the similarity of the object i and all objects, and dividing by the number of the objects (removing the object i) to obtain the normality of the object, wherein the lower the normality is, the more probable the outlier is, so the top-k method is adopted to generate the outlier sequence and select the most probable outlier. Defining the outlier of the ith object as the sum of the similarity of the object i and all the objects, and dividing the sum by the number of the objects, wherein the formula is as follows:

through the similarity learning, the similarity between the objects can be calculated, and the client with low similarity ranks the data abnormity higher. When the client data is evaluated abnormally and the high-ranking object is evaluated, extra attention is needed to detect whether the client data is fraudulent or not, so that the fraudulent behaviors can be found clearly.

After the outliers are detected by the method of the invention, they can be deleted because the outliers are the few data points in the data set that are significantly different from the mainstream data. After the data is deleted, the remaining normal data can be analyzed, and the protection measures of the bank can be made better. After abnormal data is removed, the method can also analyze the condition of the customer, thereby formulating a better customer service scheme, a better bank positioning scheme and the like. At the same time, the detected outliers can also be analyzed, and these small data objects may represent important information.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A bank abnormal data detection method based on outlier detection is characterized in that: the method comprises the following steps:

step 1: defining bank user information data in a triple form;

step 6: and calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most probable outlier to realize the detection of the abnormal data of the bank.

2. The bank abnormal data detection method based on outlier detection according to claim 1, characterized in that: the specific method of the step 1 comprises the following steps:

is provided with n_vA set of attribute values of the values;

is aSex value

Having an attribute of a_jIs/are as follows

An attribute value, N_o，N_aAnd

3. The bank abnormal data detection method based on outlier detection according to claim 2, characterized in that: the specific method of the step 2 comprises the following steps:

Function of value and frequency

wherein, g^(j)(·)：V^(j)Will value of → O

Mapping to having a value in the jth attribute

Represents a set of counts of a set;

then the attribute in-coupling space for the jth attribute

4. The bank abnormal data detection method based on outlier detection according to claim 3, characterized in that: the specific method of the step 3 comprises the following steps:

the inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given an attribute a_jAn attribute value v of^(j)And attribute a_kProperty value v of^(k)Conditional summary of information between two attribute valuesRate function p (v)^(j)|v^(k)) The definition is as follows:

wherein, n returns the intersection of the two sets;

inter-attribute coupling learning function

Attribute values based on information conditional probability functions

the inter-attribute coupling space of the jth attribute

5. The bank anomaly data detection method based on outlier detection according to claim 4, wherein: the specific method of the step 4 comprises the following steps:

M＝M_Ia∪M_Ie (6)

namely, it is

wherein the content of the first and second substances,

by a set of transformation matrices

Reconstructing kernel space

K′_p＝T_p·K_p (8)

will T_pSpecified as a diagonal matrix, as shown in the following equation:

wherein alpha is_pjIs the weight of the jth attribute value in the pth kernel space.

6. The bank abnormal data detection method based on outlier detection according to claim 5, characterized in that: the specific method of the step 5 comprises the following steps:

representing a diagonal matrix.

7. The bank anomaly data detection method based on outlier detection according to claim 6, wherein: the step 6 defines the outlier of the ith object as the sum of the similarities of the object i and all the objects, and then divides the sum by the number of the objects, as shown in the following formula: