CN113239024B

CN113239024B - Bank abnormal data detection method based on outlier detection

Info

Publication number: CN113239024B
Application number: CN202110434414.5A
Authority: CN
Inventors: 郭鹏飞; 王灏
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2023-11-07
Anticipated expiration: 2041-04-22
Also published as: CN113239024A

Abstract

The invention provides a bank abnormal data detection method based on outlier detection, and relates to the technical field of outlier detection. Firstly, defining bank user information data in a triplet form; then, the coupling in the user information data attributes and among the attributes is learned, and an intra-attribute coupling space and an inter-attribute coupling space are constructed; further, heterogeneous learning in the kernel space of the user information data is performed; learning the similarity of user information data and determining the similarity between objects in heterogeneous kernel space; and finally, calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outlier, and selecting the most likely outlier to realize the detection of the abnormal data of the bank. The method has high outlier detection rate, and can be suitable for outlier detection of high-dimensional big data, so that detection of abnormal data of the bank is improved through accurate detection of outliers of information data of the bank user.

Description

Bank abnormal data detection method based on outlier detection

Technical Field

The invention relates to the technical field of outlier detection, in particular to a bank abnormal data detection method based on outlier detection.

Background

At present, a large number of users exist in each bank system, so that a large number of user information data can be generated, and based on the data, a better customer service scheme, a positioning scheme of the bank and the like can be formulated by the bank. However, many abnormal data exist in the data, and the bank aspect can perform key management on the clients through detecting the abnormal data so as to discover fraud and other actions in time.

Outlier detection is one of the important components of the data mining field, focusing on identifying partial data objects in the real data set that are inconsistent or not in line with normal data. Unlike most normal subjects, outliers are rare subjects, which have conditions different from those of normal subjects, their presence has a large impact on normal data analysis and may mislead the results of data analysis, and detection of outliers has an important role for many real-world applications, such as in detecting network intrusion, detecting special diseases, bank fraud, etc. Identifying outliers is very difficult, especially for data with non-independent co-distributions that have complex relationships.

Coupling learning is a sub-field of an emerging data mining field, and is performed by analyzing various complex relationships in dependent co-distributed complex data, and modeling the original data so as to analyze the original data. The coupling learning proposes: learning whether more couplings would improve the representation quality of the data, the correlation between the two attributes can greatly reduce redundant information and achieve better representation performance. Another key issue is the need to obtain interaction and relationships that complement and disagree with each other while learning different types of couplings. Different interactions in the data and different data distributions have different types of coupling, and the data sets have various distributions, both of which are the reasons for the existence of heterogeneous coupling. Early coupling learning focused on single or only a few strong couplings, rarely considering multiple couplings and weak couplings, and often some weak couplings can determine much important content, which was where previous work was neglected.

The coupling between attribute values, which represents the manner and degree of coupling of attribute values in attributes, and the coupling between attributes, which represents the manner and degree of coupling between attributes, have proven themselves to be efficient based on the coupling between attribute values and the coupling between attribute values. However, existing work involves only a single coupling, ignoring many other features in the classification data. The outlier detection method for learning heterogeneous coupling and hierarchical coupling can more accurately represent complex relations in original data, so that outlier objects different from normal objects can be more accurately identified.

Disclosure of Invention

The invention aims to solve the technical problem of providing a bank abnormal data detection method based on outlier detection to realize detection of bank abnormal data.

In order to solve the technical problems, the invention adopts the following technical scheme: a bank abnormal data detection method based on outlier detection comprises the following steps:

step l: defining bank user information data in a triplet form;

defining user information data as triples C =<O，A，V>Wherein o= { O _i ｜i∈N _o Is provided with N _o A set of objects for individual users; a= { a _j |j∈N _a Is provided with N _a A set of attributes for each attribute;is provided with N _a A set of attribute values for the individual values; />Is an attribute value +.>Has an attribute of a _j Is->Attribute values, N _o ，N _a Andrespectively an object set in O, an attribute set in A and V ^(j) An index set of attribute values; for the ith object o _i The j-th attribute a _j The attribute value of (2) is expressed as +.>

Step 2: learning the coupling in the user information data attribute and constructing an attribute in-coupling space;

attribute in-coupling represents interactions between attribute values and value distributions in the attributes; measuring intra-attribute coupling according to intra-attribute distribution through a value frequency function; for attribute values in the jth attributeValue frequency function->Associating the attribute value with the genusThe attribute internal coupling between other attribute values in the property is mapped into a one-dimensional attribute internal coupling vector, and the formula is as follows:

wherein g ^(j) (·)：V ^(j) O will valueMapping to have a value in the j-th attribute +.>I represents the count of a group;

then the attribute in-coupling space of the jth attributeIs composed of the attribute in-coupling vectors obtained in the attribute in equation (1), defined as follows:

thus, for having n _a User information data of each attribute, wherein the in-attribute coupling space is

Step 3: learning the coupling among the user information data attributes and constructing a coupling space among the attributes;

inter-attribute coupling refers to interactions between attributes and context information of attribute values of other attributes; such attribute-based interactions and context information supplement the value distributions and interactions captured by the attribute in-coupling;

probability table for coupling between attributes by information conditionThe information conditional probability reveals the distribution of attribute values in the space spanned by other attribute values; given attribute a _j Is a property value v of ^(j) And attribute a _k Attribute value v of (2) ^(k) Information conditional probability function p (v ^(j) |v ^(k) ) The definition is as follows:

wherein, the U returns the intersection of the two sets;

inter-attribute coupling learning functionBased on the information conditional probability function, attribute value +.>Interactions with other attributes are embedded as |V _* The inter-attribute coupling vector of the dimension is shown in the following formula:

wherein V is _* ＝{V ^(k) |k∈N _a K+.j is divided by a _j A set of attribute values in all attributes except v _*i ∈V _* Is set V _* Attribute values in (a);

inter-attribute coupling space for the jth attributeIs composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:

thus, for a cell having N _a User trust of individual attributesInformation data, inter-attribute coupling space

Step 4: heterogeneous learning in the kernel space of the user information data is performed through the constructed intra-attribute coupling space and inter-attribute coupling space;

in-coupling space M through constructed attributes _Ia And inter-attribute coupling space M _Ie Further construct a complete set of heterogeneous coupling spaces M, which is an attribute in-coupling space M _Ia And inter-attribute coupling space M _Ie Is shown in the following formula:

M＝M _Ia ∪M _Ie (6)

i.e.

Converting each coupling space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupling space; generating a group of n _k Kernel spacen _k = |m|×|f|, where F is the set of kernel functions for conversion, the p-th kernel space is defined by the kernel matrix K _p Covering, p is less than or equal to n _k The kernel matrix K _p From the isomerically coupled space M _j Kernel function k through attributes _p (. Cndot. ) is represented by the formula:

wherein,is made up of M _j Number of attribute values represented, m _j Is the isomerism coupling space M _j Represents the j-th attribute value;

through a set of transformation matricesReconstruction kernel space->Is { K' ₁ ，…，K′ _nk -wherein the p-th kernel matrix K' _p Only the sensitive distribution of the p-th kernel appropriate for the corresponding coupling is included; the kernel space of the weighing structure is the heterogeneous kernel space; k'. _p The definition is as follows:

K′ _p ＝T _p ·K _p (8)

will T _p Is defined as a diagonal matrix, as shown in the following formula:

wherein alpha is _pj Is the weight of the jth attribute value in the p-th kernel space;

step 5: performing similarity learning of user information data, and determining similarity between objects in heterogeneous kernel space;

firstly defining similarity measures between objects in heterogeneous kernel spaces, and then learning the weight of each kernel space based on the similarity measures to reflect the contribution of the kernel spaces; given an attribute data set, for the p-th kernel matrix, let i and i ' represent indexes of values in the p-th kernel space corresponding to the i-th object and the i-th object, respectively, and K ' is used ' _p，i Representing the ith object in the p-th heterogeneous kernel space, similarity S measured by the linear kernels of the ith and i' th objects in that space _p，ii， The following formula is shown:

final similarity between the ith object and the ith' objectS _ii' Defined as a linear combination of similarity metrics from heterogeneous kernel spaces to filter redundant information and integrate complementary information between couplings, as shown in the following equation:

wherein beta is _p 0 is the weight of the similarity in the p-th heterogeneous kernel space,representing a diagonal matrix;

step 6: calculating the outliers of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most likely outliers to realize the detection of abnormal data of the bank;

defining the outlier of the ith object as the sum of the similarity of the object i and all objects, dividing by the number of objects, as shown in the following formula:

the beneficial effects of adopting above-mentioned technical scheme to produce lie in: according to the bank abnormal data detection method based on outlier detection, the complex coupling relation among different data objects is studied more deeply, and the real relation among the data objects can be reflected more accurately; the heterogeneous coupling is applied to outlier detection, so that the accuracy of an outlier detection algorithm is effectively improved; the method has high outlier detection rate, and can be suitable for outlier detection of high-dimensional big data, so that detection of abnormal data of the bank is improved through accurate detection of outliers of information data of the bank user.

Drawings

Fig. 1 is a flowchart of a method for detecting abnormal bank data based on outlier detection according to an embodiment of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

In a real-world situation, there is an interactive relationship between any data objects. Often, the data are non-independent and distributed, i.e. the same characteristics, different characteristics and even different objects have more or less complex coupling relations. An important idea of non-independent co-distribution is coupling learning. When learning each object, the same characteristics, different characteristics and the coupling between different data objects need to be considered for hierarchical thinking. Therefore, the method adds heterogeneous coupling into coupling analysis, and can more accurately analyze complex relations between original data objects, thereby improving the accuracy of outlier detection.

In this embodiment, taking user information data in a certain period of time of a bank as an example, the method for detecting abnormal data of the bank based on outlier detection is adopted to detect outlier in the user information data in the period of time, so as to realize detection of abnormal data. In this embodiment, a method for detecting abnormal bank data based on outlier detection, as shown in fig. 1, includes the following steps:

step 1: defining bank user information data in a triplet form;

defining user information data as triples C =<O,A,V>Wherein o= { O _i |i∈N _o Is provided with N _o A set of objects for individual users; a= { a _j |j∈N _a Is provided with N _a Attribute sets of individual attributes (i.e., customer information such as gender, learning, marital status, job status, deposit status, recent transaction status, etc.);is provided with N _a A set of attribute values for the individual values (e.g., male and female, junior middle school, family, deposit amount, recent transactions, etc.); />Is an attribute valueFor example, a set of all objects for which marital status is married) having attribute a _j Is->Attribute values, N _o ，N _a And->Respectively an object set in O, an attribute set in A and V ^(j) An index set of attribute values; for the ith object o _i The j-th attribute a _j The attribute value of (2) is expressed as +.>

attribute in-coupling represents interactions between attribute values and value distributions in the attributes; measuring intra-attribute coupling according to intra-attribute distribution through a value frequency function; although the numerical frequency function has only one input value, it will measure the numerical distribution for all values. For example, the information of the ratio of various people in the bank clients, the ratio of married people to unmarketed people, the ratio of different students and the like can be analyzed, and the better analysis of the data is facilitated.

For attribute values in the jth attributeValue frequency function->Mapping the attribute internal coupling between the attribute value and other attribute values in the attribute into a one-dimensional attribute internal coupling vector, wherein the one-dimensional attribute internal coupling vector is represented by the following formula:

wherein g ^(j) (·)：V ^(j) O will valueMapping to have a value in the j-th attribute +.>Is, |·| represents the count of a group;

thus, for having n _a User information data of each attribute, wherein the in-attribute coupling space is The attribute in-coupling space represents only one-dimensional embedding of the classification data space with respect to each attribute; the following inter-attribute coupling takes into account interactions between attributes;

inter-attribute coupling refers to interactions between the attribute and context (and/or semantic) information of the attribute values of other attributes; such attribute-based interactions and context information supplement the value distributions and interactions captured by the attribute in-coupling; for example, one user has low deposit, no transaction record, suddenly has a large transaction record, and the other user has fixed deposit and expenditure each month, so that the behavior of the former customer is suspicious, which means that the attributes are interrelated, and the interaction with the attribute in-coupling better helps us analyze the condition of the customers.

The inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given attribute a _j Is a property value v of ^(j) And attribute a _k Attribute value v of (2) ^(k) Information conditional probability function p (v ^(j) |v ^(k) ) The definition is as follows:

wherein, the U returns the intersection of the two sets;

inter-attribute coupling space for the jth attributeIs the attribute obtained from equation (4)The inter-coupling vector is formed as follows:

thus, for having n _a User information data of individual attributes, inter-attribute coupling space

If |V| > 2|V ^(j) -1, the inter-attribute coupling learning function projects the attribute value into a higher dimensional space. Because the dimension of the inter-attribute coupling space is equal to |V| -V ^(j) I, and the degree of freedom of the j-th attribute (equal to the dimension of converting the attribute value into a virtual variable) is |v ^(j) And 1, thus capturing the value coupling caused by other attributes, which complements the attribute in-coupling to form a complete representation of the inter-attribute coupling space.

from the intra-attribute and inter-attribute point of view, the space M is coupled through constructed attributes _Ia And inter-attribute coupling space M _Ie Further construct a complete set of heterogeneous coupling spaces M, which is an attribute in-coupling space M _Ia And inter-attribute coupling space M _Ie Is shown in the following formula:

M＝M _Ia ∪M _Ie (6)

i.e.

In order to integrate the heterogeneous coupling efficiently into the learned set of coupling spaces, the learned heterogeneous coupling spaces are converted into uniform spaces in which the heterogeneous coupling is comparable. In particular, each coupling space is converted to its corresponding kernel space using multiple kernels, where each kernel space corresponds to a turnThe exchanged coupling space; generating a group of n _k Kernel spacen _k = |m|×|f| where F is the set of kernel functions for conversion, the p-th kernel space is defined by the kernel matrix K _p Covering, p is less than or equal to n _k The kernel matrix K _p From the isomerically coupled space M _j Kernel function k through attributes _p (. Cndot. ) is represented by the formula:

to reveal heterogeneity within the coupling, we learn the weights of the values in each kernel space. Specifically, by a set of transformation matricesReconstruction kernel space->Is { K' ₁ ，…，K′ _nk -wherein the p-th kernel matrix K' _p Only the sensitive distribution of the p-th kernel appropriate for the corresponding coupling is included; the kernel space of the weighing structure is the heterogeneous kernel space; k'. _p The definition is as follows:

K′ _p ＝T _p ·K _p (8)

will T _p Is defined as a diagonal matrix, as shown in the following formula:

wherein alpha is _pj Is the weight of the jth attribute value in the p-th kernel space; alpha _pj The larger the coupling space corresponding to the p-th kernel space, the stronger the coupling representing the j-th attribute value displayed.

The above-mentioned contents are that bank customer information is converted into a plurality of kernel spaces, and the relation between the customer information is represented by a kernel matrix, so that in order to reveal heterogeneity in coupling, we learn the weight of attribute values in each kernel space. It learns a set of transformation matrices to reconstruct kernel space, finding out the attribute value with heavy weight, i.e. the attribute value has high coupling strength.

to further capture the heterogeneity between couplings, we chose to learn the effect of each heterogeneous kernel space on the final result. It first defines a similarity measure between objects in heterogeneous kernel spaces, and then learns weights for each kernel space based on the similarity measure to reflect their contributions; given an attribute data set, for the p-th kernel matrix, let i and i ' represent indexes of values in the p-th kernel space corresponding to the i-th object and the i-th object, respectively, and K ' is used ' _p,i (at K' _p Line i) of (b) represents the ith object in the ith heterogeneous kernel space, similarity S measured by the linear kernels of the ith and ith' objects in that space _p,ii’ The following formula is shown:

final similarity S between the ith object and the ith' object _ii’ Defined as a linear combination of similarity metrics from heterogeneous kernel spaces to filter redundant information and integrate complementary information between couplings, as shown in the following equation:

and (3) carrying out similarity learning on the bank client information, calculating the similarity of the bank client information and the bank client information, and measuring the similarity between each object, so that people are helped to find dissimilar bank client information, and the problematic data are abnormal data. (this process filters out some useless attributes, such as gender, etc., as they are not helpful for our detection, i.e., low coupling attributes).

and (3) calculating the similarity between the two objects by using the formula (11), calculating the sum of the similarity between the object i and all the objects, dividing the sum by the number of the objects (removing the object i), namely the normal degree of the object, wherein the lower the normal degree is, the more likely the outliers are, so that an outlier ranking is generated by using a top-k method, and the most likely outlier value is selected. Defining the outlier of the ith object as the sum of the similarity of the object i and all objects, dividing by the number of objects, as shown in the following formula:

through the above similarity learning, we can calculate the similarity between the objects, and the clients with low similarity have high data anomaly ranks. By evaluating the customer data for anomalies, we need to pay extra attention to evaluating the highly ranked object, and to detect if it is fraudulent, so that fraud can be found clearly.

After the outliers are detected by the method of the invention, they can be deleted because the outliers are very few data points in the dataset that are significantly different from the main stream data. After deleting, the remaining normal data can be analyzed, and the protective measures of the bank can be formulated better. After the abnormal data is removed, the method can analyze the condition of the client, thereby making a better client service scheme, a bank positioning scheme and the like. At the same time, the detected outliers may also be analyzed, which small data objects may represent important information.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. A bank abnormal data detection method based on outlier detection is characterized in that: the method comprises the following steps:

step 1: defining bank user information data in a triplet form;

defining user information data as triples C =<O,A,V>Wherein o= { O _i |i∈N _o Is provided with N _o A set of objects for individual users; a= { a _j |j∈N _a Is provided with N _a A set of attributes for each attribute;is provided with N _a A set of attribute values for the individual values;is an attribute value +.>Has an attribute of a _j Is->Attribute values, N _o ，N _a And->Respectively an object set in O, an attribute set in A and V ^(j) An index set of attribute values; for the ith object o _i The j-th attribute a _j The attribute value of (2) is expressed as +.>

attribute in-coupling represents interactions between attribute values and value distributions in the attributes; measuring intra-attribute coupling according to intra-attribute distribution through a value frequency function; for attribute values in the jth attributeValue frequency function->Mapping the attribute internal coupling between the attribute value and other attribute values in the attribute into a one-dimensional attribute internal coupling vector, wherein the one-dimensional attribute internal coupling vector is represented by the following formula:

thus, for a cell having N _a User information data of each attribute, wherein the in-attribute coupling space is

wherein, the U returns the intersection of the two sets;

inter-attributeCoupling learning functionBased on the information conditional probability function, attribute value +.>Interactions with other attributes are embedded as |V _* The inter-attribute coupling vector of the dimension is shown in the following formula:

thus, for a cell having N _a User information data of individual attributes, inter-attribute coupling space

in-coupling space M through constructed attributes _Ia And inter-attribute coupling space M _Ie Further construct a complete set of heterogeneous coupling spaces M, which is intra-attributeCoupling space M _Ia And inter-attribute coupling space M _Ie Is shown in the following formula:

M＝M _Ia ∪M _Ie (6)

i.e.

Converting each coupling space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupling space; generating a group of n _k Kernel spaceWhere F is the set of kernel functions for conversion, the p-th kernel space is defined by the kernel matrix K _p Covering, p is less than or equal to n _k The kernel matrix K _p From the isomerically coupled space M _j Kernel function k through attributes _p (. Cndot. ) is represented by the formula:

through a set of transformation matricesReconstruction kernel space->Is { K ₁ ′，…，K′ _nk A p-th kernel matrix K _p ' include only the p-th inner adapted for corresponding couplingSensitive distribution of cores; the kernel space of the weighing structure is the heterogeneous kernel space; k (K) _p ^′ The definition is as follows:

K _p ^′ ＝T _p ·K _p (8)

will T _p Is defined as a diagonal matrix, as shown in the following formula:

step 6: based on the similarity between objects in the heterogeneous kernel space, calculating the outliers of each object, sequencing the outliers by adopting a top-k method, and selecting the most likely outliers to realize the detection of abnormal data of the bank.

2. The method for detecting abnormal data of a bank based on outlier detection according to claim 1, wherein: the specific method in the step 5 is as follows:

firstly defining similarity measures between objects in heterogeneous kernel spaces, and then learning the weight of each kernel space based on the similarity measures to reflect the contribution of the kernel spaces; given an attribute data set, for the p-th kernel matrix, let i and i ' represent indexes of values in the p-th kernel space corresponding to the i-th object and the i-th object, respectively, and K ' is used ' _p,i Representing the ith object in the p-th heterogeneous kernel space, similarity S measured by the linear kernels of the ith and i' th objects in that space _p,ii’ The following formula is shown:

wherein beta is _p 0 is the weight of the similarity in the p-th heterogeneous kernel space,representing a diagonal matrix.

3. The method for detecting abnormal data of a bank based on outlier detection according to claim 2, wherein: step 6 defines the outlier of the ith object as the sum of the similarity of the object i and all objects, and divides the sum by the number of objects, as shown in the following formula: