CN113239024B - Bank abnormal data detection method based on outlier detection - Google Patents

Bank abnormal data detection method based on outlier detection Download PDF

Info

Publication number
CN113239024B
CN113239024B CN202110434414.5A CN202110434414A CN113239024B CN 113239024 B CN113239024 B CN 113239024B CN 202110434414 A CN202110434414 A CN 202110434414A CN 113239024 B CN113239024 B CN 113239024B
Authority
CN
China
Prior art keywords
attribute
coupling
space
kernel
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110434414.5A
Other languages
Chinese (zh)
Other versions
CN113239024A (en
Inventor
郭鹏飞
王灏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN202110434414.5A priority Critical patent/CN113239024B/en
Publication of CN113239024A publication Critical patent/CN113239024A/en
Application granted granted Critical
Publication of CN113239024B publication Critical patent/CN113239024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a bank abnormal data detection method based on outlier detection, and relates to the technical field of outlier detection. Firstly, defining bank user information data in a triplet form; then, the coupling in the user information data attributes and among the attributes is learned, and an intra-attribute coupling space and an inter-attribute coupling space are constructed; further, heterogeneous learning in the kernel space of the user information data is performed; learning the similarity of user information data and determining the similarity between objects in heterogeneous kernel space; and finally, calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outlier, and selecting the most likely outlier to realize the detection of the abnormal data of the bank. The method has high outlier detection rate, and can be suitable for outlier detection of high-dimensional big data, so that detection of abnormal data of the bank is improved through accurate detection of outliers of information data of the bank user.

Description

Bank abnormal data detection method based on outlier detection
Technical Field
The invention relates to the technical field of outlier detection, in particular to a bank abnormal data detection method based on outlier detection.
Background
At present, a large number of users exist in each bank system, so that a large number of user information data can be generated, and based on the data, a better customer service scheme, a positioning scheme of the bank and the like can be formulated by the bank. However, many abnormal data exist in the data, and the bank aspect can perform key management on the clients through detecting the abnormal data so as to discover fraud and other actions in time.
Outlier detection is one of the important components of the data mining field, focusing on identifying partial data objects in the real data set that are inconsistent or not in line with normal data. Unlike most normal subjects, outliers are rare subjects, which have conditions different from those of normal subjects, their presence has a large impact on normal data analysis and may mislead the results of data analysis, and detection of outliers has an important role for many real-world applications, such as in detecting network intrusion, detecting special diseases, bank fraud, etc. Identifying outliers is very difficult, especially for data with non-independent co-distributions that have complex relationships.
Coupling learning is a sub-field of an emerging data mining field, and is performed by analyzing various complex relationships in dependent co-distributed complex data, and modeling the original data so as to analyze the original data. The coupling learning proposes: learning whether more couplings would improve the representation quality of the data, the correlation between the two attributes can greatly reduce redundant information and achieve better representation performance. Another key issue is the need to obtain interaction and relationships that complement and disagree with each other while learning different types of couplings. Different interactions in the data and different data distributions have different types of coupling, and the data sets have various distributions, both of which are the reasons for the existence of heterogeneous coupling. Early coupling learning focused on single or only a few strong couplings, rarely considering multiple couplings and weak couplings, and often some weak couplings can determine much important content, which was where previous work was neglected.
The coupling between attribute values, which represents the manner and degree of coupling of attribute values in attributes, and the coupling between attributes, which represents the manner and degree of coupling between attributes, have proven themselves to be efficient based on the coupling between attribute values and the coupling between attribute values. However, existing work involves only a single coupling, ignoring many other features in the classification data. The outlier detection method for learning heterogeneous coupling and hierarchical coupling can more accurately represent complex relations in original data, so that outlier objects different from normal objects can be more accurately identified.
Disclosure of Invention
The invention aims to solve the technical problem of providing a bank abnormal data detection method based on outlier detection to realize detection of bank abnormal data.
In order to solve the technical problems, the invention adopts the following technical scheme: a bank abnormal data detection method based on outlier detection comprises the following steps:
step l: defining bank user information data in a triplet form;
defining user information data as triples C =<O,A,V>Wherein o= { O i |i∈N o Is provided with N o A set of objects for individual users; a= { a j |j∈N a Is provided with N a A set of attributes for each attribute;is provided with N a A set of attribute values for the individual values; />Is an attribute value +.>Has an attribute of a j Is->Attribute values, N o ,N a Andrespectively an object set in O, an attribute set in A and V (j) An index set of attribute values; for the ith object o i The j-th attribute a j The attribute value of (2) is expressed as +.>
Step 2: learning the coupling in the user information data attribute and constructing an attribute in-coupling space;
attribute in-coupling represents interactions between attribute values and value distributions in the attributes; measuring intra-attribute coupling according to intra-attribute distribution through a value frequency function; for attribute values in the jth attributeValue frequency function->Associating the attribute value with the genusThe attribute internal coupling between other attribute values in the property is mapped into a one-dimensional attribute internal coupling vector, and the formula is as follows:
wherein g (j) (·):V (j) O will valueMapping to have a value in the j-th attribute +.>I represents the count of a group;
then the attribute in-coupling space of the jth attributeIs composed of the attribute in-coupling vectors obtained in the attribute in equation (1), defined as follows:
thus, for having n a User information data of each attribute, wherein the in-attribute coupling space is
Step 3: learning the coupling among the user information data attributes and constructing a coupling space among the attributes;
inter-attribute coupling refers to interactions between attributes and context information of attribute values of other attributes; such attribute-based interactions and context information supplement the value distributions and interactions captured by the attribute in-coupling;
probability table for coupling between attributes by information conditionThe information conditional probability reveals the distribution of attribute values in the space spanned by other attribute values; given attribute a j Is a property value v of (j) And attribute a k Attribute value v of (2) (k) Information conditional probability function p (v (j) |v (k) ) The definition is as follows:
wherein, the U returns the intersection of the two sets;
inter-attribute coupling learning functionBased on the information conditional probability function, attribute value +.>Interactions with other attributes are embedded as |V * The inter-attribute coupling vector of the dimension is shown in the following formula:
wherein V is * ={V (k) |k∈N a K+.j is divided by a j A set of attribute values in all attributes except v *i ∈V * Is set V * Attribute values in (a);
inter-attribute coupling space for the jth attributeIs composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:
thus, for a cell having N a User trust of individual attributesInformation data, inter-attribute coupling space
Step 4: heterogeneous learning in the kernel space of the user information data is performed through the constructed intra-attribute coupling space and inter-attribute coupling space;
in-coupling space M through constructed attributes Ia And inter-attribute coupling space M Ie Further construct a complete set of heterogeneous coupling spaces M, which is an attribute in-coupling space M Ia And inter-attribute coupling space M Ie Is shown in the following formula:
M=M Ia ∪M Ie (6)
i.e.
Converting each coupling space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupling space; generating a group of n k Kernel spacen k = |m|×|f|, where F is the set of kernel functions for conversion, the p-th kernel space is defined by the kernel matrix K p Covering, p is less than or equal to n k The kernel matrix K p From the isomerically coupled space M j Kernel function k through attributes p (. Cndot. ) is represented by the formula:
wherein,is made up of M j Number of attribute values represented, m j Is the isomerism coupling space M j Represents the j-th attribute value;
through a set of transformation matricesReconstruction kernel space->Is { K' 1 ,…,K′ nk -wherein the p-th kernel matrix K' p Only the sensitive distribution of the p-th kernel appropriate for the corresponding coupling is included; the kernel space of the weighing structure is the heterogeneous kernel space; k'. p The definition is as follows:
K′ p =T p ·K p (8)
will T p Is defined as a diagonal matrix, as shown in the following formula:
wherein alpha is pj Is the weight of the jth attribute value in the p-th kernel space;
step 5: performing similarity learning of user information data, and determining similarity between objects in heterogeneous kernel space;
firstly defining similarity measures between objects in heterogeneous kernel spaces, and then learning the weight of each kernel space based on the similarity measures to reflect the contribution of the kernel spaces; given an attribute data set, for the p-th kernel matrix, let i and i ' represent indexes of values in the p-th kernel space corresponding to the i-th object and the i-th object, respectively, and K ' is used ' p,i Representing the ith object in the p-th heterogeneous kernel space, similarity S measured by the linear kernels of the ith and i' th objects in that space p,ii, The following formula is shown:
final similarity between the ith object and the ith' objectS ii' Defined as a linear combination of similarity metrics from heterogeneous kernel spaces to filter redundant information and integrate complementary information between couplings, as shown in the following equation:
wherein beta is p 0 is the weight of the similarity in the p-th heterogeneous kernel space,representing a diagonal matrix;
step 6: calculating the outliers of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most likely outliers to realize the detection of abnormal data of the bank;
defining the outlier of the ith object as the sum of the similarity of the object i and all objects, dividing by the number of objects, as shown in the following formula:
the beneficial effects of adopting above-mentioned technical scheme to produce lie in: according to the bank abnormal data detection method based on outlier detection, the complex coupling relation among different data objects is studied more deeply, and the real relation among the data objects can be reflected more accurately; the heterogeneous coupling is applied to outlier detection, so that the accuracy of an outlier detection algorithm is effectively improved; the method has high outlier detection rate, and can be suitable for outlier detection of high-dimensional big data, so that detection of abnormal data of the bank is improved through accurate detection of outliers of information data of the bank user.
Drawings
Fig. 1 is a flowchart of a method for detecting abnormal bank data based on outlier detection according to an embodiment of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
In a real-world situation, there is an interactive relationship between any data objects. Often, the data are non-independent and distributed, i.e. the same characteristics, different characteristics and even different objects have more or less complex coupling relations. An important idea of non-independent co-distribution is coupling learning. When learning each object, the same characteristics, different characteristics and the coupling between different data objects need to be considered for hierarchical thinking. Therefore, the method adds heterogeneous coupling into coupling analysis, and can more accurately analyze complex relations between original data objects, thereby improving the accuracy of outlier detection.
In this embodiment, taking user information data in a certain period of time of a bank as an example, the method for detecting abnormal data of the bank based on outlier detection is adopted to detect outlier in the user information data in the period of time, so as to realize detection of abnormal data. In this embodiment, a method for detecting abnormal bank data based on outlier detection, as shown in fig. 1, includes the following steps:
step 1: defining bank user information data in a triplet form;
defining user information data as triples C =<O,A,V>Wherein o= { O i |i∈N o Is provided with N o A set of objects for individual users; a= { a j |j∈N a Is provided with N a Attribute sets of individual attributes (i.e., customer information such as gender, learning, marital status, job status, deposit status, recent transaction status, etc.);is provided with N a A set of attribute values for the individual values (e.g., male and female, junior middle school, family, deposit amount, recent transactions, etc.); />Is an attribute valueFor example, a set of all objects for which marital status is married) having attribute a j Is->Attribute values, N o ,N a And->Respectively an object set in O, an attribute set in A and V (j) An index set of attribute values; for the ith object o i The j-th attribute a j The attribute value of (2) is expressed as +.>
Step 2: learning the coupling in the user information data attribute and constructing an attribute in-coupling space;
attribute in-coupling represents interactions between attribute values and value distributions in the attributes; measuring intra-attribute coupling according to intra-attribute distribution through a value frequency function; although the numerical frequency function has only one input value, it will measure the numerical distribution for all values. For example, the information of the ratio of various people in the bank clients, the ratio of married people to unmarketed people, the ratio of different students and the like can be analyzed, and the better analysis of the data is facilitated.
For attribute values in the jth attributeValue frequency function->Mapping the attribute internal coupling between the attribute value and other attribute values in the attribute into a one-dimensional attribute internal coupling vector, wherein the one-dimensional attribute internal coupling vector is represented by the following formula:
wherein g (j) (·):V (j) O will valueMapping to have a value in the j-th attribute +.>Is, |·| represents the count of a group;
then the attribute in-coupling space of the jth attributeIs composed of the attribute in-coupling vectors obtained in the attribute in equation (1), defined as follows:
thus, for having n a User information data of each attribute, wherein the in-attribute coupling space is The attribute in-coupling space represents only one-dimensional embedding of the classification data space with respect to each attribute; the following inter-attribute coupling takes into account interactions between attributes;
step 3: learning the coupling among the user information data attributes and constructing a coupling space among the attributes;
inter-attribute coupling refers to interactions between the attribute and context (and/or semantic) information of the attribute values of other attributes; such attribute-based interactions and context information supplement the value distributions and interactions captured by the attribute in-coupling; for example, one user has low deposit, no transaction record, suddenly has a large transaction record, and the other user has fixed deposit and expenditure each month, so that the behavior of the former customer is suspicious, which means that the attributes are interrelated, and the interaction with the attribute in-coupling better helps us analyze the condition of the customers.
The inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given attribute a j Is a property value v of (j) And attribute a k Attribute value v of (2) (k) Information conditional probability function p (v (j) |v (k) ) The definition is as follows:
wherein, the U returns the intersection of the two sets;
inter-attribute coupling learning functionBased on the information conditional probability function, attribute value +.>Interactions with other attributes are embedded as |V * The inter-attribute coupling vector of the dimension is shown in the following formula:
wherein V is * ={V (k) |k∈N a K+.j is divided by a j A set of attribute values in all attributes except v *i ∈V * Is set V * Attribute values in (a);
inter-attribute coupling space for the jth attributeIs the attribute obtained from equation (4)The inter-coupling vector is formed as follows:
thus, for having n a User information data of individual attributes, inter-attribute coupling space
If |V| > 2|V (j) -1, the inter-attribute coupling learning function projects the attribute value into a higher dimensional space. Because the dimension of the inter-attribute coupling space is equal to |V| -V (j) I, and the degree of freedom of the j-th attribute (equal to the dimension of converting the attribute value into a virtual variable) is |v (j) And 1, thus capturing the value coupling caused by other attributes, which complements the attribute in-coupling to form a complete representation of the inter-attribute coupling space.
Step 4: heterogeneous learning in the kernel space of the user information data is performed through the constructed intra-attribute coupling space and inter-attribute coupling space;
from the intra-attribute and inter-attribute point of view, the space M is coupled through constructed attributes Ia And inter-attribute coupling space M Ie Further construct a complete set of heterogeneous coupling spaces M, which is an attribute in-coupling space M Ia And inter-attribute coupling space M Ie Is shown in the following formula:
M=M Ia ∪M Ie (6)
i.e.
In order to integrate the heterogeneous coupling efficiently into the learned set of coupling spaces, the learned heterogeneous coupling spaces are converted into uniform spaces in which the heterogeneous coupling is comparable. In particular, each coupling space is converted to its corresponding kernel space using multiple kernels, where each kernel space corresponds to a turnThe exchanged coupling space; generating a group of n k Kernel spacen k = |m|×|f| where F is the set of kernel functions for conversion, the p-th kernel space is defined by the kernel matrix K p Covering, p is less than or equal to n k The kernel matrix K p From the isomerically coupled space M j Kernel function k through attributes p (. Cndot. ) is represented by the formula:
wherein,is made up of M j Number of attribute values represented, m j Is the isomerism coupling space M j Represents the j-th attribute value;
to reveal heterogeneity within the coupling, we learn the weights of the values in each kernel space. Specifically, by a set of transformation matricesReconstruction kernel space->Is { K' 1 ,…,K′ nk -wherein the p-th kernel matrix K' p Only the sensitive distribution of the p-th kernel appropriate for the corresponding coupling is included; the kernel space of the weighing structure is the heterogeneous kernel space; k'. p The definition is as follows:
K′ p =T p ·K p (8)
will T p Is defined as a diagonal matrix, as shown in the following formula:
wherein alpha is pj Is the weight of the jth attribute value in the p-th kernel space; alpha pj The larger the coupling space corresponding to the p-th kernel space, the stronger the coupling representing the j-th attribute value displayed.
The above-mentioned contents are that bank customer information is converted into a plurality of kernel spaces, and the relation between the customer information is represented by a kernel matrix, so that in order to reveal heterogeneity in coupling, we learn the weight of attribute values in each kernel space. It learns a set of transformation matrices to reconstruct kernel space, finding out the attribute value with heavy weight, i.e. the attribute value has high coupling strength.
Step 5: performing similarity learning of user information data, and determining similarity between objects in heterogeneous kernel space;
to further capture the heterogeneity between couplings, we chose to learn the effect of each heterogeneous kernel space on the final result. It first defines a similarity measure between objects in heterogeneous kernel spaces, and then learns weights for each kernel space based on the similarity measure to reflect their contributions; given an attribute data set, for the p-th kernel matrix, let i and i ' represent indexes of values in the p-th kernel space corresponding to the i-th object and the i-th object, respectively, and K ' is used ' p,i (at K' p Line i) of (b) represents the ith object in the ith heterogeneous kernel space, similarity S measured by the linear kernels of the ith and ith' objects in that space p,ii’ The following formula is shown:
final similarity S between the ith object and the ith' object ii’ Defined as a linear combination of similarity metrics from heterogeneous kernel spaces to filter redundant information and integrate complementary information between couplings, as shown in the following equation:
wherein beta is p 0 is the weight of the similarity in the p-th heterogeneous kernel space,representing a diagonal matrix;
and (3) carrying out similarity learning on the bank client information, calculating the similarity of the bank client information and the bank client information, and measuring the similarity between each object, so that people are helped to find dissimilar bank client information, and the problematic data are abnormal data. (this process filters out some useless attributes, such as gender, etc., as they are not helpful for our detection, i.e., low coupling attributes).
Step 6: calculating the outliers of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most likely outliers to realize the detection of abnormal data of the bank;
and (3) calculating the similarity between the two objects by using the formula (11), calculating the sum of the similarity between the object i and all the objects, dividing the sum by the number of the objects (removing the object i), namely the normal degree of the object, wherein the lower the normal degree is, the more likely the outliers are, so that an outlier ranking is generated by using a top-k method, and the most likely outlier value is selected. Defining the outlier of the ith object as the sum of the similarity of the object i and all objects, dividing by the number of objects, as shown in the following formula:
through the above similarity learning, we can calculate the similarity between the objects, and the clients with low similarity have high data anomaly ranks. By evaluating the customer data for anomalies, we need to pay extra attention to evaluating the highly ranked object, and to detect if it is fraudulent, so that fraud can be found clearly.
After the outliers are detected by the method of the invention, they can be deleted because the outliers are very few data points in the dataset that are significantly different from the main stream data. After deleting, the remaining normal data can be analyzed, and the protective measures of the bank can be formulated better. After the abnormal data is removed, the method can analyze the condition of the client, thereby making a better client service scheme, a bank positioning scheme and the like. At the same time, the detected outliers may also be analyzed, which small data objects may represent important information.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (3)

1. A bank abnormal data detection method based on outlier detection is characterized in that: the method comprises the following steps:
step 1: defining bank user information data in a triplet form;
defining user information data as triples C =<O,A,V>Wherein o= { O i |i∈N o Is provided with N o A set of objects for individual users; a= { a j |j∈N a Is provided with N a A set of attributes for each attribute;is provided with N a A set of attribute values for the individual values;is an attribute value +.>Has an attribute of a j Is->Attribute values, N o ,N a And->Respectively an object set in O, an attribute set in A and V (j) An index set of attribute values; for the ith object o i The j-th attribute a j The attribute value of (2) is expressed as +.>
Step 2: learning the coupling in the user information data attribute and constructing an attribute in-coupling space;
attribute in-coupling represents interactions between attribute values and value distributions in the attributes; measuring intra-attribute coupling according to intra-attribute distribution through a value frequency function; for attribute values in the jth attributeValue frequency function->Mapping the attribute internal coupling between the attribute value and other attribute values in the attribute into a one-dimensional attribute internal coupling vector, wherein the one-dimensional attribute internal coupling vector is represented by the following formula:
wherein g (j) (·):V (j) O will valueMapping to have a value in the j-th attribute +.>Is, |·| represents the count of a group;
then the attribute in-coupling space of the jth attributeIs composed of the attribute in-coupling vectors obtained in the attribute in equation (1), defined as follows:
thus, for a cell having N a User information data of each attribute, wherein the in-attribute coupling space is
Step 3: learning the coupling among the user information data attributes and constructing a coupling space among the attributes;
inter-attribute coupling refers to interactions between attributes and context information of attribute values of other attributes; such attribute-based interactions and context information supplement the value distributions and interactions captured by the attribute in-coupling;
the inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given attribute a j Is a property value v of (j) And attribute a k Attribute value v of (2) (k) Information conditional probability function p (v (j) |v (k) ) The definition is as follows:
wherein, the U returns the intersection of the two sets;
inter-attributeCoupling learning functionBased on the information conditional probability function, attribute value +.>Interactions with other attributes are embedded as |V * The inter-attribute coupling vector of the dimension is shown in the following formula:
wherein V is * ={V (k) |k∈N a K+.j is divided by a j A set of attribute values in all attributes except v *i ∈V * Is set V * Attribute values in (a);
inter-attribute coupling space for the jth attributeIs composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:
thus, for a cell having N a User information data of individual attributes, inter-attribute coupling space
Step 4: heterogeneous learning in the kernel space of the user information data is performed through the constructed intra-attribute coupling space and inter-attribute coupling space;
in-coupling space M through constructed attributes Ia And inter-attribute coupling space M Ie Further construct a complete set of heterogeneous coupling spaces M, which is intra-attributeCoupling space M Ia And inter-attribute coupling space M Ie Is shown in the following formula:
M=M Ia ∪M Ie (6)
i.e.
Converting each coupling space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupling space; generating a group of n k Kernel spaceWhere F is the set of kernel functions for conversion, the p-th kernel space is defined by the kernel matrix K p Covering, p is less than or equal to n k The kernel matrix K p From the isomerically coupled space M j Kernel function k through attributes p (. Cndot. ) is represented by the formula:
wherein,is made up of M j Number of attribute values represented, m j Is the isomerism coupling space M j Represents the j-th attribute value;
through a set of transformation matricesReconstruction kernel space->Is { K 1 ′,…,K′ nk A p-th kernel matrix K p ' include only the p-th inner adapted for corresponding couplingSensitive distribution of cores; the kernel space of the weighing structure is the heterogeneous kernel space; k (K) p The definition is as follows:
K p =T p ·K p (8)
will T p Is defined as a diagonal matrix, as shown in the following formula:
wherein alpha is pj Is the weight of the jth attribute value in the p-th kernel space;
step 5: performing similarity learning of user information data, and determining similarity between objects in heterogeneous kernel space;
step 6: based on the similarity between objects in the heterogeneous kernel space, calculating the outliers of each object, sequencing the outliers by adopting a top-k method, and selecting the most likely outliers to realize the detection of abnormal data of the bank.
2. The method for detecting abnormal data of a bank based on outlier detection according to claim 1, wherein: the specific method in the step 5 is as follows:
firstly defining similarity measures between objects in heterogeneous kernel spaces, and then learning the weight of each kernel space based on the similarity measures to reflect the contribution of the kernel spaces; given an attribute data set, for the p-th kernel matrix, let i and i ' represent indexes of values in the p-th kernel space corresponding to the i-th object and the i-th object, respectively, and K ' is used ' p,i Representing the ith object in the p-th heterogeneous kernel space, similarity S measured by the linear kernels of the ith and i' th objects in that space p,ii’ The following formula is shown:
final similarity S between the ith object and the ith' object ii’ Defined as a linear combination of similarity metrics from heterogeneous kernel spaces to filter redundant information and integrate complementary information between couplings, as shown in the following equation:
wherein beta is p 0 is the weight of the similarity in the p-th heterogeneous kernel space,representing a diagonal matrix.
3. The method for detecting abnormal data of a bank based on outlier detection according to claim 2, wherein: step 6 defines the outlier of the ith object as the sum of the similarity of the object i and all objects, and divides the sum by the number of objects, as shown in the following formula:
CN202110434414.5A 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection Active CN113239024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110434414.5A CN113239024B (en) 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110434414.5A CN113239024B (en) 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection

Publications (2)

Publication Number Publication Date
CN113239024A CN113239024A (en) 2021-08-10
CN113239024B true CN113239024B (en) 2023-11-07

Family

ID=77128868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110434414.5A Active CN113239024B (en) 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection

Country Status (1)

Country Link
CN (1) CN113239024B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844102A (en) * 2016-03-25 2016-08-10 中国农业大学 Self-adaptive parameter-free spatial outlier detection algorithm
CN106446256A (en) * 2016-10-17 2017-02-22 鞍钢集团矿业有限公司 Real-time industrial production information sensation system based on context calculation
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN108038211A (en) * 2017-12-13 2018-05-15 南京大学 A kind of unsupervised relation data method for detecting abnormality based on context
CN109525453A (en) * 2018-11-02 2019-03-26 长沙学院 Networking CPS method for detecting abnormality and system based on node dependence
CN110826686A (en) * 2018-08-07 2020-02-21 艾玛迪斯简易股份公司 Machine learning system and method with attribute sequence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11169865B2 (en) * 2018-09-18 2021-11-09 Nec Corporation Anomalous account detection from transaction data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844102A (en) * 2016-03-25 2016-08-10 中国农业大学 Self-adaptive parameter-free spatial outlier detection algorithm
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN106446256A (en) * 2016-10-17 2017-02-22 鞍钢集团矿业有限公司 Real-time industrial production information sensation system based on context calculation
CN108038211A (en) * 2017-12-13 2018-05-15 南京大学 A kind of unsupervised relation data method for detecting abnormality based on context
CN110826686A (en) * 2018-08-07 2020-02-21 艾玛迪斯简易股份公司 Machine learning system and method with attribute sequence
CN109525453A (en) * 2018-11-02 2019-03-26 长沙学院 Networking CPS method for detecting abnormality and system based on node dependence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
异质信息网络中离群点检测方法研究;刘露;《中国博士学位论文全文数据库信息科技辑(月刊)》(第09期);全文 *

Also Published As

Publication number Publication date
CN113239024A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
US20230013306A1 (en) Sensitive Data Classification
Ahmed et al. Sentiment analysis of online food reviews using big data analytics
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
Chen et al. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs
Chen Improved TFIDF in big news retrieval: An empirical study
US10019442B2 (en) Method and system for peer detection
Feng et al. Computational social indicators: a case study of chinese university ranking
Wongkoblap et al. A multilevel predictive model for detecting social network users with depression
Zhang et al. A novel hybrid correlation measure for probabilistic linguistic term sets and crisp numbers and its application in customer relationship management
Qudsi et al. Predictive data mining of chronic diseases using decision tree: a case study of health insurance company in Indonesia
Chung et al. Inventor profile mining approach for prospective human resource scouting
CN105447117A (en) User clustering method and apparatus
Bartolucci et al. Ranking scientific journals via latent class models for polytomous item response data
CN113239024B (en) Bank abnormal data detection method based on outlier detection
Shastri et al. Development of a data mining based model for classification of child immunization data
Hamad et al. Sentiment analysis of restaurant reviews in social media using naïve bayes
Pandove et al. General correlation coefficient based agglomerative clustering
Radhi Adaptive learning system of ontology using semantic web to mining data from distributed heterogeneous environment
Daas et al. On the Validity of Using Webpage Texts to Identify the Target Population of a Survey: An Application to Detect Online Platforms
Wu et al. Variance reduced Shapley value estimation for trustworthy data valuation
Sumantri et al. Determination of status of family stage prosperous of Sidareja district using data mining techniques
CN113763032B (en) Commodity purchase intention recognition method and device
Knyazeva et al. A graph-based data mining approach to preventing financial fraud: a case study
Yu et al. Accurate identification of economic hardship students: a data-driven approach
Qun et al. An efficient entropy of sum approach for measuring diversity and interdisciplinarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant