CN113239024A - Bank abnormal data detection method based on outlier detection - Google Patents

Bank abnormal data detection method based on outlier detection Download PDF

Info

Publication number
CN113239024A
CN113239024A CN202110434414.5A CN202110434414A CN113239024A CN 113239024 A CN113239024 A CN 113239024A CN 202110434414 A CN202110434414 A CN 202110434414A CN 113239024 A CN113239024 A CN 113239024A
Authority
CN
China
Prior art keywords
attribute
coupling
space
kernel
bank
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110434414.5A
Other languages
Chinese (zh)
Other versions
CN113239024B (en
Inventor
郭鹏飞
王灏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN202110434414.5A priority Critical patent/CN113239024B/en
Publication of CN113239024A publication Critical patent/CN113239024A/en
Application granted granted Critical
Publication of CN113239024B publication Critical patent/CN113239024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/04Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a bank abnormal data detection method based on outlier detection, and relates to the technical field of outlier detection. The method comprises the steps of firstly defining bank user information data in a triple form; then, learning the coupling in and among the attributes of the user information data, and constructing an attribute in-coupling space and an attribute among-coupling space; heterogeneous learning in a user information data kernel space is further performed; the similarity of user information data is learned, and the similarity between objects in the heterogeneous kernel space is determined; and finally, calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers, and selecting the most possible outlier to realize the detection of the abnormal data of the bank. The method has high outlier detection rate, is suitable for detecting the outlier of high-dimensional big data, and further improves the detection of abnormal bank data through accurately detecting the outlier of the bank user information data.

Description

Bank abnormal data detection method based on outlier detection
Technical Field
The invention relates to the technical field of outlier detection, in particular to a bank abnormal data detection method based on outlier detection.
Background
At present, each bank system has a large number of users, so that a large number of user information data can be generated, and banks can formulate better customer service schemes, bank positioning schemes and the like based on the data. However, the data has many abnormal data, and the bank can perform key management on the client and discover fraud and other behaviors in time through detecting the abnormal data.
Outlier detection is one of the important components of the data mining field, with the emphasis on identifying partial data objects in the real dataset that are inconsistent or unexpected with normal data. Unlike most normal objects, outliers are rare objects, they have different situations from normal objects, their existence has a large influence on normal data analysis, may mislead the data analysis result, and detecting outliers has an important role for many real-world applications, such as detecting network intrusion, detecting special diseases, bank fraud, etc. Identifying outliers is very difficult, especially for data that has a complex relationship that is not independently identically distributed.
The coupled learning is a sub-field of a new data mining field, and the original data is analyzed by modeling the original data by analyzing various complex relations in the non-independent and identically distributed complex data. The coupling learning proposes: learning whether more coupling will improve the representation quality of the data, and the correlation between two attributes can greatly reduce redundant information and achieve better representation performance. Another key issue is the need to learn different types of couplings while acquiring complementary and inconsistent interactions and relationships. Different interactions and different data distributions in the data have different types of coupling, and the data set has various distributions, which are both reasons for the existence of heterogeneous coupling. Early coupling learning focused on single or only a few strong couplings, with few considerations of multiple couplings and weak couplings, and often some weak couplings could determine much of the important content, where previous work was ignored.
The coupling between attribute values represents the way and degree that attribute values are coupled in attributes, and the coupling between attributes represents the way and degree that attributes are coupled. However, existing work involves only a single coupling, ignoring many other features in the classification data. The outlier detection method for learning heterogeneous coupling and hierarchical coupling can more accurately represent the complex relationship in the original data, so that outlier objects different from normal objects can be more accurately identified.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a bank abnormal data detection method based on outlier detection, so as to implement the detection of the bank abnormal data.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a bank abnormal data detection method based on outlier detection comprises the following steps:
step 1: defining bank user information data in a triple form;
defining user information data as a triplet C<O,A,V>Wherein O ═ Oi|i∈NoIs of noA set of objects for an individual user; a ═ aj|j∈NaIs of naAn attribute set of individual attributes;
Figure BDA0003032522660000021
is provided with nvA set of attribute values of the values;
Figure BDA0003032522660000022
Figure BDA0003032522660000023
is an attribute value
Figure BDA0003032522660000024
Having an attribute of ajIs/are as follows
Figure BDA0003032522660000025
An attribute value, No,NaAnd
Figure BDA0003032522660000026
the object set in O, the attribute set in A and the V are respectively(j)An index set of classification values; for the firsti objects oiJ-th attribute ajIs expressed as
Figure BDA0003032522660000027
Step 2: learning the coupling in the attribute of the user information data and constructing an attribute coupling space;
attribute in-coupling represents the interaction between attribute values and the distribution of values in attributes; measuring attribute incoupling according to the attribute intracistribution through a value frequency function; for classification values in jth attribute
Figure BDA0003032522660000028
Function of value and frequency
Figure BDA0003032522660000029
And mapping the attribute in-coupling between the classification value and other classification values in the attribute into a one-dimensional attribute in-coupling vector, wherein the following formula is shown as follows:
Figure BDA00030325226600000210
wherein, g(j)(·):V(j)Will value of → O
Figure BDA00030325226600000211
Mapping to having a value in the jth attribute
Figure BDA00030325226600000212
Represents a set of counts of a set;
then the attribute in-coupling space for the jth attribute
Figure BDA00030325226600000213
Is formed from the attribute in-coupling vector obtained in the attribute in equation (1) and is defined as follows:
Figure BDA00030325226600000214
thus, for a group having naUser information data of an attribute, the attribute in-coupling space being
Figure BDA00030325226600000215
Figure BDA00030325226600000216
And step 3: learning the coupling between user information data attributes and constructing a coupling space between attributes;
the inter-attribute coupling refers to the interaction between the context information of the attribute and the attribute values of other attributes; this attribute-based interaction and context information supplements the value distribution and interaction captured by the attribute in-coupling;
the inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given an attribute ajAn attribute value v of(j)And attribute akProperty value v of(k)Conditional probability function p (v) of information between two attribute values(j)|v(k)) The definition is as follows:
Figure BDA0003032522660000031
wherein, n returns the intersection of the two sets;
inter-attribute coupling learning function
Figure BDA0003032522660000032
Attribute values based on information conditional probability functions
Figure BDA0003032522660000033
Interaction with other attributes is embedded as | V*The i-dimension coupling vector between attributes is shown by the following formula:
Figure BDA0003032522660000034
wherein, V*={V(k)|k∈NaK ≠ j } is divided by ajSet of attribute values in all but v*i∈V*Is a set V*The attribute value of (1);
the inter-attribute coupling space of the jth attribute
Figure BDA0003032522660000035
Is composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:
Figure BDA0003032522660000036
thus, for a group having naUser information data of individual attributes, coupling space between attributes
Figure BDA0003032522660000037
Figure BDA0003032522660000038
And 4, step 4: heterogeneous learning in a user information data kernel space is carried out through the constructed attribute in-coupling space and the attribute inter-coupling space;
through the constructed attribute in-coupling space and the attribute inter-coupling space, a complete heterogeneous coupling space set M is further constructed, and the set is a heterogeneous attribute in-coupling space MIaAnd an inter-attribute coupling space MIeThe set of (c) is shown by the following formula:
M=MIa∪MIe (6)
namely, it is
Figure BDA0003032522660000039
Converting each coupled space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupled space; generating a set of nkOne kernel space
Figure BDA00030325226600000310
nkWhere F is the set of kernel functions used for the conversion, the pth kernel space is represented by kernel matrix KpCoverage, p is less than or equal to nkThe kernel matrix KpBy coupling spaces MjKernel function k through attributesp(-) make up, as shown in the following equation:
Figure BDA00030325226600000311
wherein the content of the first and second substances,
Figure BDA0003032522660000041
is formed by MjNumber of attribute values represented, mjIs a coupling space MjRepresents the jth attribute value;
by a set of transformation matrices
Figure BDA0003032522660000042
Reconstructing kernel space
Figure BDA0003032522660000043
Is { K'1,…,K′nkH, wherein the p-th kernel matrix K'pA sensitivity profile containing only the p-th kernel appropriate for the respective coupling; the weighing structure kernel space is heterogeneous kernel space; k'pIs defined as:
K′p=Tp·Kp (8)
will TpSpecified as a diagonal matrix, as shown in the following equation:
Figure BDA0003032522660000044
wherein alpha ispjIs the weight of the jth attribute value in the pth kernel space;
and 5: performing similarity learning of user information data, and determining similarity between objects in a heterogeneous kernel space;
first defining a similarity measure between objects in the heterogeneous kernel spaces, and then learning a weight of each kernel space based on the similarity measure to reflect their contribution; given a property dataset, for the p-th kernel matrix, let i and i ' denote indices of values in the p-th kernel space corresponding to the i-th object and the i ' -th object, respectively, and use K 'p,iRepresenting the ith object in the pth heterogeneous kernel space, the similarity S measured by the linear kernels of the ith and ith' objects in that spacep,ii’As shown in the following equation:
Figure BDA0003032522660000045
the final similarity S between the ith object and the ith' objectiiDefined as a linear combination of similarity measures from heterogeneous kernel space to filter redundant information and integrate complementary information between couplings, as shown in the following equation:
Figure BDA0003032522660000046
wherein, betap≧ 0 is the weight of the similarity in the pth heterogeneous kernel space,
Figure BDA0003032522660000047
representing a diagonal matrix;
step 6: calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most probable outlier to realize the detection of abnormal data of the bank;
defining the outlier of the ith object as the sum of the similarity of the object i and all the objects, and dividing the sum by the number of the objects, wherein the formula is as follows:
Figure BDA0003032522660000051
adopt the produced beneficial effect of above-mentioned technical scheme to lie in: according to the bank abnormal data detection method based on outlier detection, provided by the invention, the complex coupling relation between different data objects is studied more deeply, and the real relation between the data objects can be reflected more accurately; the heterogeneous coupling is applied to outlier detection, so that the accuracy of an outlier detection algorithm is effectively improved; the method has high outlier detection rate, is suitable for detecting the outlier of high-dimensional big data, and further improves the detection of abnormal bank data through accurately detecting the outlier of the bank user information data.
Drawings
Fig. 1 is a flowchart of a bank abnormal data detection method based on outlier detection according to an embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
In a real-world situation, there is an interactive relationship between any data objects. Data are often non-independently and identically distributed, that is, more or less complicated coupling relationships exist between the same features, between different features, and even between different objects. An important idea of non-independent co-distribution is coupled learning. When learning each object, a hierarchical thinking needs to be taken into consideration for the same feature, different features and coupling between different data objects. Therefore, the method of the invention adds the heterogeneous coupling into the coupling analysis, and can more accurately analyze the complex relation between the original data objects, thereby improving the accuracy of outlier detection.
In this embodiment, user information data in a certain bank within a period of time is taken as an example, and the outlier in the user information data within the period of time is detected by using the bank abnormal data detection method based on outlier detection of the present invention, so as to implement detection of abnormal data. In this embodiment, a bank abnormal data detection method based on outlier detection, as shown in fig. 1, includes the following steps:
step 1: defining bank user information data in a triple form;
defining user information data as a triplet C<O,A,V>Wherein O ═ Oi|i∈NoIs of noA set of objects for an individual user; a ═ aj|j∈NaIs of naA set of attributes for an individual attribute (i.e., customer information such as gender, learning, marital status, job status, deposit status, recent transaction status, etc.);
Figure BDA0003032522660000052
is provided with nvA set of attribute values of values (e.g., male and female, junior middle school calendar, subject calendar, deposit amount, recent transaction, etc.);
Figure BDA0003032522660000053
is an attribute value
Figure BDA0003032522660000054
A set of (e.g. a set of all objects whose marital status is married) having an attribute of ajIs/are as follows
Figure BDA0003032522660000061
An attribute value, No,NaAnd
Figure BDA0003032522660000062
the object set in O, the attribute set in A and the V are respectively(j)An index set of classification values; for the ith object oiJ-th attribute ajIs expressed as
Figure BDA0003032522660000063
Step 2: learning the coupling in the attribute of the user information data and constructing an attribute coupling space;
attribute in-coupling represents the interaction between attribute values and the distribution of values in attributes; measuring attribute incoupling according to the attribute intracistribution through a value frequency function; although the numerical frequency function has only one input value, it measures the numerical distribution for all numerical values. For example, information of various personnel ratios, ratios of married to unmarried personnel, ratios of different academic calendars and the like in bank customers can be analyzed, and the data can be better analyzed.
For classification values in jth attribute
Figure BDA0003032522660000064
Function of value and frequency
Figure BDA0003032522660000065
And mapping the attribute in-coupling between the classification value and other classification values in the attribute into a one-dimensional attribute in-coupling vector, wherein the following formula is shown as follows:
Figure BDA0003032522660000066
wherein, g(j)(·):V(j)Will value of → O
Figure BDA0003032522660000067
Mapping to having a value in the jth attribute
Figure BDA0003032522660000068
Represents a set of counts of a set;
then the attribute in-coupling space for the jth attribute
Figure BDA0003032522660000069
Is formed from the attribute in-coupling vector obtained in the attribute in equation (1) and is defined as follows:
Figure BDA00030325226600000610
thus, for a group having naUser information data of an attribute, the attribute in-coupling space being
Figure BDA00030325226600000611
Figure BDA00030325226600000612
The attribute in-coupling space represents only one-dimensional embedding of the classification data space with respect to each attribute; the following inter-attribute coupling takes into account the interaction between attributes;
and step 3: learning the coupling between user information data attributes and constructing a coupling space between attributes;
inter-attribute coupling refers to the interaction between contextual (and/or semantic) information of attribute values of attributes with other attributes; this attribute-based interaction and context information supplements the value distribution and interaction captured by the attribute in-coupling; for example, one user may have a low credit, no transaction record, a sudden large transaction record, and another user may have a monthly fixed credit and debit, which is suspicious over the previous client behavior, indicating that the attributes are interrelated and better interact with the in-attribute coupling to help us analyze the client.
The inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given an attribute ajAn attribute value v of(j)And attribute akProperty value v of(k)Conditional probability function p (v) of information between two attribute values(j)|v(k)) The definition is as follows:
Figure BDA0003032522660000071
wherein, n returns the intersection of the two sets;
inter-attribute coupling learning function
Figure BDA0003032522660000072
Based on information stripsPiece probability function, attribute value
Figure BDA0003032522660000073
Interaction with other attributes is embedded as | V*The i-dimension coupling vector between attributes is shown by the following formula:
Figure BDA0003032522660000074
wherein, V*={V(k)|k∈NaK ≠ j } is divided by ajSet of attribute values in all but v*i∈V*Is a set V*The attribute value of (1);
the inter-attribute coupling space of the jth attribute
Figure BDA0003032522660000075
Is composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:
Figure BDA0003032522660000076
thus, for a group having naUser information data of individual attributes, coupling space between attributes
Figure BDA0003032522660000077
Figure BDA0003032522660000078
If | V |>2|V(j)1, then the inter-attribute coupling learning function projects the classification value into a higher dimensional space. Because the dimension of the coupling space between attributes is equal to | V | - | V(j)And the degree of freedom of the jth attribute (equal to the dimension of the classification value into a virtual variable) is | V(j)1, thus capturing the value coupling caused by other attributes, which complements the attribute in-coupling to form a complete representation of the classification attribute space.
And 4, step 4: heterogeneous learning in a user information data kernel space is carried out through the constructed attribute in-coupling space and the attribute inter-coupling space;
from the view point of the attribute and the attribute between, a complete heterogeneous coupling space set M is further constructed through the constructed attribute in-coupling space and the attribute between-coupling space, and the set is a heterogeneous attribute in-coupling space MIaAnd an inter-attribute coupling space MIeThe set of (c) is shown by the following formula:
M=MIa∪MIe (6)
namely, it is
Figure BDA0003032522660000079
To efficiently integrate heterogeneous coupling into a learned coupling space set, the learned heterogeneous coupling space is converted into a uniform space in which the heterogeneous couplings are comparable. Specifically, each coupled space is converted into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to the converted coupled space; generating a set of nkOne kernel space
Figure BDA0003032522660000081
nkWhere F is the set of kernel functions used for the conversion, the pth kernel space is represented by kernel matrix KpCoverage, p is less than or equal to nkThe kernel matrix KpBy coupling spaces MjKernel function k through attributesp(-) make up, as shown in the following equation:
Figure BDA0003032522660000082
wherein the content of the first and second substances,
Figure BDA0003032522660000083
is formed by MjNumber of attribute values represented, mjIs a coupling space MjA vector of (1) represents the jthAn attribute value;
to reveal heterogeneity within the coupling, we learn the weights of the values in each kernel space. In particular, by a set of transformation matrices
Figure BDA0003032522660000084
Reconstructing kernel space
Figure BDA0003032522660000085
Is { K'1,…,K′nkH, wherein the p-th kernel matrix K'pA sensitivity profile containing only the p-th kernel appropriate for the respective coupling; the weighing structure kernel space is heterogeneous kernel space; k'pIs defined as:
K′p=Tp·Kp (8)
will TpSpecified as a diagonal matrix, as shown in the following equation:
Figure BDA0003032522660000086
wherein alpha ispjIs the weight of the jth attribute value in the pth kernel space; alpha is alphapjThe larger the value, the stronger the coupling of the jth attribute value displayed by the coupling space corresponding to the pth kernel space.
The above is to convert the bank customer information into a plurality of kernel spaces, and represent the relationship between the customer information by using a kernel matrix, and in order to reveal heterogeneity in coupling, we learn the weight of the attribute value in each kernel space. It learns a set of transformation matrices to reconstruct the kernel space and find the attribute values with large weights, i.e. the attribute values have high coupling strength.
And 5: performing similarity learning of user information data, and determining similarity between objects in a heterogeneous kernel space;
to further capture the heterogeneity between couplings, we chose to learn the impact of each heterogeneous kernel space on the final result. It first defines the similarities between objects in the heterogeneous kernel spaceA similarity measure, and then learning a weight of each kernel space based on the similarity measure to reflect their contribution; given a property dataset, for the p-th kernel matrix, let i and i ' denote indices of values in the p-th kernel space corresponding to the i-th object and the i ' -th object, respectively, and use K 'p,i(in K'pRow i) of (a) indicates the ith object in the p-th heterogeneous kernel space, the similarity S measured by the linear kernels of the ith and ith' objects in that spacep,ii’As shown in the following equation:
Figure BDA0003032522660000091
the final similarity S between the ith object and the ith' objectiiDefined as a linear combination of similarity measures from heterogeneous kernel space to filter redundant information and integrate complementary information between couplings, as shown in the following equation:
Figure BDA0003032522660000092
wherein, betap≧ 0 is the weight of the similarity in the pth heterogeneous kernel space,
Figure BDA0003032522660000093
representing a diagonal matrix;
and performing similarity learning on the bank client information, calculating the similarity of the bank client information, and measuring the similarity between each object, so that the bank client information which is dissimilar is found, and the data with problems is abnormal data. (this process filters out some useless attributes such as gender, etc. because they do not help our detection, i.e. low coupling attributes).
Step 6: calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most probable outlier to realize the detection of abnormal data of the bank;
calculating the similarity between two objects by the formula (11), calculating the sum of the similarity of the object i and all objects, and dividing by the number of the objects (removing the object i) to obtain the normality of the object, wherein the lower the normality is, the more probable the outlier is, so the top-k method is adopted to generate the outlier sequence and select the most probable outlier. Defining the outlier of the ith object as the sum of the similarity of the object i and all the objects, and dividing the sum by the number of the objects, wherein the formula is as follows:
Figure BDA0003032522660000094
through the similarity learning, the similarity between the objects can be calculated, and the client with low similarity ranks the data abnormity higher. When the client data is evaluated abnormally and the high-ranking object is evaluated, extra attention is needed to detect whether the client data is fraudulent or not, so that the fraudulent behaviors can be found clearly.
After the outliers are detected by the method of the invention, they can be deleted because the outliers are the few data points in the data set that are significantly different from the mainstream data. After the data is deleted, the remaining normal data can be analyzed, and the protection measures of the bank can be made better. After abnormal data is removed, the method can also analyze the condition of the customer, thereby formulating a better customer service scheme, a better bank positioning scheme and the like. At the same time, the detected outliers can also be analyzed, and these small data objects may represent important information.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (7)

1. A bank abnormal data detection method based on outlier detection is characterized in that: the method comprises the following steps:
step 1: defining bank user information data in a triple form;
step 2: learning the coupling in the attribute of the user information data and constructing an attribute coupling space;
and step 3: learning the coupling between user information data attributes and constructing a coupling space between attributes;
and 4, step 4: heterogeneous learning in a user information data kernel space is carried out through the constructed attribute in-coupling space and the attribute inter-coupling space;
and 5: performing similarity learning of user information data, and determining similarity between objects in a heterogeneous kernel space;
step 6: and calculating the outlier of each object based on the similarity between the objects in the heterogeneous kernel space, sequencing the outliers by adopting a top-k method, and selecting the most probable outlier to realize the detection of the abnormal data of the bank.
2. The bank abnormal data detection method based on outlier detection according to claim 1, characterized in that: the specific method of the step 1 comprises the following steps:
defining user information data as a triplet C<O,A,V>Wherein O ═ Oi|i∈NoIs of noA set of objects for an individual user; a ═ aj|j∈NaIs of naAn attribute set of individual attributes;
Figure FDA0003032522650000011
is provided with nvA set of attribute values of the values;
Figure FDA0003032522650000012
Figure FDA0003032522650000013
is aSex value
Figure FDA0003032522650000014
Having an attribute of ajIs/are as follows
Figure FDA0003032522650000015
An attribute value, No,NaAnd
Figure FDA0003032522650000016
the object set in O, the attribute set in A and the V are respectively(j)An index set of classification values; for the ith object oiJ-th attribute ajIs expressed as
Figure FDA0003032522650000017
3. The bank abnormal data detection method based on outlier detection according to claim 2, characterized in that: the specific method of the step 2 comprises the following steps:
attribute in-coupling represents the interaction between attribute values and the distribution of values in attributes; measuring attribute incoupling according to the attribute intracistribution through a value frequency function; for classification values in jth attribute
Figure FDA0003032522650000018
Function of value and frequency
Figure FDA0003032522650000019
And mapping the attribute in-coupling between the classification value and other classification values in the attribute into a one-dimensional attribute in-coupling vector, wherein the following formula is shown as follows:
Figure FDA00030325226500000110
wherein, g(j)(·):V(j)Will value of → O
Figure FDA00030325226500000111
Mapping to having a value in the jth attribute
Figure FDA00030325226500000112
Represents a set of counts of a set;
then the attribute in-coupling space for the jth attribute
Figure FDA00030325226500000113
Is formed from the attribute in-coupling vector obtained in the attribute in equation (1) and is defined as follows:
Figure FDA00030325226500000114
thus, for a group having naUser information data of an attribute, the attribute in-coupling space being
Figure FDA0003032522650000021
Figure FDA0003032522650000022
4. The bank abnormal data detection method based on outlier detection according to claim 3, characterized in that: the specific method of the step 3 comprises the following steps:
the inter-attribute coupling refers to the interaction between the context information of the attribute and the attribute values of other attributes; this attribute-based interaction and context information supplements the value distribution and interaction captured by the attribute in-coupling;
the inter-attribute coupling is represented by an information conditional probability that reveals the distribution of attribute values in the space spanned by other attribute values; given an attribute ajAn attribute value v of(j)And attribute akProperty value v of(k)Conditional summary of information between two attribute valuesRate function p (v)(j)|v(k)) The definition is as follows:
Figure FDA0003032522650000023
wherein, n returns the intersection of the two sets;
inter-attribute coupling learning function
Figure FDA0003032522650000024
Attribute values based on information conditional probability functions
Figure FDA0003032522650000025
Interaction with other attributes is embedded as | V*The i-dimension coupling vector between attributes is shown by the following formula:
Figure FDA0003032522650000026
wherein, V*={V(k)|k∈NaK ≠ j } is divided by ajSet of attribute values in all but v*i∈V*Is a set V*The attribute value of (1);
the inter-attribute coupling space of the jth attribute
Figure FDA0003032522650000027
Is composed of the inter-attribute coupling vector obtained in equation (4), as shown in the following equation:
Figure FDA0003032522650000028
thus, for a group having naUser information data of individual attributes, coupling space between attributes
Figure FDA0003032522650000029
Figure FDA00030325226500000210
5. The bank anomaly data detection method based on outlier detection according to claim 4, wherein: the specific method of the step 4 comprises the following steps:
through the constructed attribute in-coupling space and the attribute inter-coupling space, a complete heterogeneous coupling space set M is further constructed, and the set is a heterogeneous attribute in-coupling space MIaAnd an inter-attribute coupling space MIeThe set of (c) is shown by the following formula:
M=MIa∪MIe (6)
namely, it is
Figure FDA0003032522650000031
Converting each coupled space into its respective kernel space using a plurality of kernels, wherein each kernel space corresponds to a converted coupled space; generating a set of nkOne kernel space
Figure FDA0003032522650000032
nkWhere F is the set of kernel functions used for the conversion, the pth kernel space is represented by kernel matrix KpCoverage, p is less than or equal to nkThe kernel matrix KpBy coupling spaces MjKernel function k through attributesp(-) make up, as shown in the following equation:
Figure FDA0003032522650000033
wherein the content of the first and second substances,
Figure FDA0003032522650000034
is formed by MjNumber of attribute values represented, mjIs a coupling space MjRepresents the jth attribute value;
by a set of transformation matrices
Figure FDA0003032522650000035
Reconstructing kernel space
Figure FDA0003032522650000036
Is { K'1,…,K′nkH, wherein the p-th kernel matrix K'pA sensitivity profile containing only the p-th kernel appropriate for the respective coupling; the weighing structure kernel space is heterogeneous kernel space; k'pIs defined as:
K′p=Tp·Kp (8)
will TpSpecified as a diagonal matrix, as shown in the following equation:
Figure FDA0003032522650000037
wherein alpha ispjIs the weight of the jth attribute value in the pth kernel space.
6. The bank abnormal data detection method based on outlier detection according to claim 5, characterized in that: the specific method of the step 5 comprises the following steps:
first defining a similarity measure between objects in the heterogeneous kernel spaces, and then learning a weight of each kernel space based on the similarity measure to reflect their contribution; given a property dataset, for the p-th kernel matrix, let i and i ' denote indices of values in the p-th kernel space corresponding to the i-th object and the i ' -th object, respectively, and use K 'p,iRepresenting the ith object in the pth heterogeneous kernel space, the similarity S measured by the linear kernels of the ith and ith' objects in that spacep,ii’As shown in the following equation:
Figure FDA0003032522650000038
Figure FDA0003032522650000041
the final similarity S between the ith object and the ith' objectiiDefined as a linear combination of similarity measures from heterogeneous kernel space to filter redundant information and integrate complementary information between couplings, as shown in the following equation:
Figure FDA0003032522650000042
wherein, betap≧ 0 is the weight of the similarity in the pth heterogeneous kernel space,
Figure FDA0003032522650000043
representing a diagonal matrix.
7. The bank anomaly data detection method based on outlier detection according to claim 6, wherein: the step 6 defines the outlier of the ith object as the sum of the similarities of the object i and all the objects, and then divides the sum by the number of the objects, as shown in the following formula:
Figure FDA0003032522650000044
CN202110434414.5A 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection Active CN113239024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110434414.5A CN113239024B (en) 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110434414.5A CN113239024B (en) 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection

Publications (2)

Publication Number Publication Date
CN113239024A true CN113239024A (en) 2021-08-10
CN113239024B CN113239024B (en) 2023-11-07

Family

ID=77128868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110434414.5A Active CN113239024B (en) 2021-04-22 2021-04-22 Bank abnormal data detection method based on outlier detection

Country Status (1)

Country Link
CN (1) CN113239024B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844102A (en) * 2016-03-25 2016-08-10 中国农业大学 Self-adaptive parameter-free spatial outlier detection algorithm
CN106446256A (en) * 2016-10-17 2017-02-22 鞍钢集团矿业有限公司 Real-time industrial production information sensation system based on context calculation
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN108038211A (en) * 2017-12-13 2018-05-15 南京大学 A kind of unsupervised relation data method for detecting abnormality based on context
CN109525453A (en) * 2018-11-02 2019-03-26 长沙学院 Networking CPS method for detecting abnormality and system based on node dependence
CN110826686A (en) * 2018-08-07 2020-02-21 艾玛迪斯简易股份公司 Machine learning system and method with attribute sequence
US20200089556A1 (en) * 2018-09-18 2020-03-19 Nec Laboratories America, Inc. Anomalous account detection from transaction data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844102A (en) * 2016-03-25 2016-08-10 中国农业大学 Self-adaptive parameter-free spatial outlier detection algorithm
CN106503086A (en) * 2016-10-11 2017-03-15 成都云麒麟软件有限公司 The detection method of distributed local outlier
CN106446256A (en) * 2016-10-17 2017-02-22 鞍钢集团矿业有限公司 Real-time industrial production information sensation system based on context calculation
CN108038211A (en) * 2017-12-13 2018-05-15 南京大学 A kind of unsupervised relation data method for detecting abnormality based on context
CN110826686A (en) * 2018-08-07 2020-02-21 艾玛迪斯简易股份公司 Machine learning system and method with attribute sequence
US20200089556A1 (en) * 2018-09-18 2020-03-19 Nec Laboratories America, Inc. Anomalous account detection from transaction data
CN109525453A (en) * 2018-11-02 2019-03-26 长沙学院 Networking CPS method for detecting abnormality and system based on node dependence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘露: "异质信息网络中离群点检测方法研究", 《中国博士学位论文全文数据库信息科技辑(月刊)》, no. 09 *

Also Published As

Publication number Publication date
CN113239024B (en) 2023-11-07

Similar Documents

Publication Publication Date Title
US20220199263A1 (en) Systems and methods for topological data analysis using nearest neighbors
Johnson et al. Deep learning and data sampling with imbalanced big data
CN103370722B (en) The system and method that actual volatility is predicted by small echo and nonlinear kinetics
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
US10032167B2 (en) Abnormal pattern analysis method, abnormal pattern analysis apparatus performing the same and storage medium storing the same
CN112990386B (en) User value clustering method and device, computer equipment and storage medium
Hewapathirana Change detection in dynamic attributed networks
Aviad et al. A decision support method, based on bounded rationality concepts, to reveal feature saliency in clustering problems
Qudsi et al. Predictive data mining of chronic diseases using decision tree: a case study of health insurance company in Indonesia
CN112631889B (en) Portrayal method, device, equipment and readable storage medium for application system
Duman et al. Heath care fraud detection methods and new approaches
Bartolucci et al. Ranking scientific journals via latent class models for polytomous item response data
Ohanuba et al. Topological data analysis via unsupervised machine learning for recognizing atmospheric river patterns on flood detection
CN112527602A (en) Business data statistical method and device, computer equipment and storage medium
CN113239024A (en) Bank abnormal data detection method based on outlier detection
CN112991079B (en) Multi-card co-occurrence medical treatment fraud detection method, system, cloud end and medium
Hamad et al. Sentiment analysis of restaurant reviews in social media using naïve bayes
Quan et al. Learning fair representations by separating the relevance of potential information
Siregar et al. Classification data for direct marketing using deep learning
Lu et al. Tensor mutual information and its applications
Wang et al. Research on effect evaluation of online advertisement based on resampling method
Daas et al. On the Validity of Using Webpage Texts to Identify the Target Population of a Survey: An Application to Detect Online Platforms
Huang et al. A clustering-based method for business hall efficiency analysis
CN117077641B (en) Medical data synthesis method and device
CN117312397B (en) Talent supply chain management method and system based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant