CN112381546A - Method for detecting abnormal risk account based on time series clustering - Google Patents
Method for detecting abnormal risk account based on time series clustering Download PDFInfo
- Publication number
- CN112381546A CN112381546A CN202011389228.6A CN202011389228A CN112381546A CN 112381546 A CN112381546 A CN 112381546A CN 202011389228 A CN202011389228 A CN 202011389228A CN 112381546 A CN112381546 A CN 112381546A
- Authority
- CN
- China
- Prior art keywords
- sequence
- transaction
- time
- data
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 230000006399 behavior Effects 0.000 abstract description 6
- 241000283690 Bos taurus Species 0.000 abstract description 5
- 238000012502 risk assessment Methods 0.000 abstract description 2
- 238000011160 research Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001680 brushing effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q20/00—Payment architectures, schemes or protocols
- G06Q20/38—Payment protocols; Details thereof
- G06Q20/40—Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
- G06Q20/401—Transaction verification
- G06Q20/4016—Transaction verification involving fraud or risk level assessment in transaction processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Security & Cryptography (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Finance (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method for detecting abnormal risk accounts based on time series clustering, which is based on the current situation and existing problems of e-commerce transaction risk assessment, researches and analyzes a method for evaluating transaction risk of an e-commerce platform at home and abroad by taking a risk management methodology as guidance and combining risk transaction service characteristic analysis. The invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Description
Technical Field
The invention relates to the technical field of emerging information, in particular to a method for detecting an abnormal risk account based on time series clustering.
Background
In recent years, electronic commerce develops rapidly, and online ordering and shopping become the most important consumption form of people. After prosperous business, a large amount of transaction risk problems such as bill brushing, illegal arbitrage, yellow cattle rushing and the like on some electronic commerce platforms are developed intensively, the backward transaction risk management means becomes a bottleneck which hinders the healthy development of the platforms, great loss is brought to the e-commerce platforms, the rights and interests of common customers are also influenced, and the fairness, the justice and the authenticity of the e-commerce platforms are questioned by the public. The e-commerce is weak in experience in transaction risk management, the risk management is relatively late to start, and the e-commerce is subjected to transaction risk evaluation mainly by the experience of business personnel. With the increase of business volume and the increase of business complexity, especially the specialized development of black industrial chains such as cattle companies and water force companies, the traditional transaction risk management mode cannot meet the development requirement of risk management. Therefore, a set of scientific and intelligent transaction risk assessment system is constructed to detect abnormal risks and black product accounts, and the system has great significance for the healthy development of the e-commerce platform.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for detecting abnormal risk accounts based on time series clustering.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a method for detecting abnormal risk accounts based on time series clustering, which comprises the following steps:
step 3, generating a time sequence, namely arranging the operation names and the transaction names of the users according to the time sequence to form a first operation-transaction name sequence according to the data processed in the step 2; arranging each user operation name and transaction time point according to a time sequence to form a second operation-transaction time point sequence; subtracting the previous time point from the next time point according to the time sequence to form a third operation-transaction time interval sequence;
step 4, time series numeralization, namely, for a first operation-transaction name sequence, modeling the name sequence by using a seq2seq method, then vectorizing each name of the sequence by using the model, and finally adding results of all the vectorized names in the sequence to calculate an average value; for the second operation-transaction time point sequence, the calculation result of subtracting a fixed time node (such as 2020, 01/01) from each time point of the sequence is stored as day, hour, minute or second as required; for the third operation-transaction time interval sequence, directly converting into days, hours, minutes or seconds according to requirements;
step 5, clustering and grouping time series: taking the vector of the first operation-transaction name sequence in the step 4 as an input value, clustering and grouping by using a MeanShift clustering algorithm, grouping users with similar operations and transactions in time into the same group, and numbering each group;
and 6, calculating indexes in the time sequence group: and calculating the service indexes such as black sample ratio, average operation transaction time, average sequence length and the like of the users in each group, and determining the users exceeding the service indexes as abnormal risk accounts.
Compared with the prior art, the invention has the following beneficial effects:
the invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a general schematic of the system of the present invention;
FIG. 2 is a diagram of the seq2seq model of the present invention;
FIG. 3 is a diagram of a MeanShift clustering model of the present invention;
FIG. 4 is a computed composite index map of the present invention;
FIG. 5 is one of the schematic diagrams of an embodiment of the present invention;
FIG. 6 is a schematic diagram of another embodiment of the present invention;
fig. 7 is a third schematic diagram of the embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
According to the method for detecting the abnormal risk account based on the time series clustering, provided by the embodiment of the invention, clustering grouping is carried out by utilizing the similarity of the time series according to the operation data and the transaction data of a user, and then various risk indexes in the group are calculated, so that the abnormal risk account is found out. The specific implementation steps are shown in fig. 1.
The following is a detailed description:
first, the comprehensive data of a user in a period of time, including transaction data, operation data and basic attribute data, are acquired. Taking winged payment as an example, the fields of the transaction data comprise user id, transaction name, transaction type, whether marketing preference is given or not and transaction time; the field of the operation data comprises a user id, an operation name, an operation type and operation time; the fields of the basic attributes of the user include the user id, the home identity and whether to be blacklisted. After the data are received, the three types of data are connected by using the user id, and the connected data are subjected to the work of removing the duplicate, cleaning and the like. After the three kinds of data are connected, the data are sequenced according to the sequence of the user, the transaction and the operation.
Next, the sorted data is preprocessed. For trade and operation names, the longitude of the later modeling is influenced because the trade and operation names have a large number of names with low occupation ratio in the data set, the number and occupation ratio of each name in the whole data set are counted first, and the names with small number and low occupation ratio are deleted. For time nodes where transactions and operations occur, since time node data is not numerical data,
no relevant numerical calculations can be performed, so time node data is converted into numerical data in two different ways. The first is to form a time node sequence by subtracting a fixed time node from a current time node, and the second is to form a time interval sequence by subtracting a previous time node from a later time node.
After the data preprocessing is completed, the transaction operation name sequence is encoded numerically, and here, the seq2seq algorithm shown in fig. 2 is used to encode the transaction operation name sequence numerically. The Sequence-to-Sequence is a generic Encoder-Decoder framework, and is known as Sequence-to-Sequence in the name of Sequence 2 Seq. The method is also a very important and popular sequence model in the current natural language processing technology, breaks through the traditional fixed-size input problem framework, opens the way to apply the classical deep neural network model to the sequence tasks of translation and intelligent question answering, and is proved to have very good performance in the mutual translation among main languages and the application of man-machine short-question answering in a voice assistant. The so-called Seq2Seq task mainly refers to the mapping problem from Sequence to Sequence, where Sequence is understood herein as a string Sequence (as in the name of trade operation in this patent), and when we want to obtain another string Sequence (as translated, e.g. semantically corresponding) after a string Sequence is given, this task can be called Seq2 Seq. The Seq2Seq is a neural network of an Encoder-Decoder structure, as shown in fig. 5, whose input is a Sequence (Sequence) and output is also a Sequence (Sequence), and hence the name "Seq 2 Seq". In the Encoder, a variable-length sequence is converted into a fixed-length vector expression, and the Decode converts the fixed-length vector into a variable-length signal sequence of a target; the most basic Seq2Seq model comprises three parts (some parts are not shown in fig. 5), namely an Encoder, a Decoder and an intermediate state vector C connecting the Encoder and the Decoder, wherein the Encoder encodes a state vector C (also called semantic encoding) with a fixed size through learning input, then the Encoder transmits the C to the Decoder, and the Decoder outputs a corresponding sequence through learning the state vector C.
BasicSeq2Seq has many drawbacks, and the process of first Encoder encoding the input into a fixed-size state vector (hiddenstate) is actually a "lossy compression of information" process. If the amount of information is larger, the process of converting the vector causes larger loss of information. Meanwhile, as the sequence length is increased, meaning that the sequence in the time dimension is long, the RNN model also has gradient diffusion. Finally, the component of the underlying model that connects the Encoder and Decode modules is simply a fixed-size state vector, which makes it impossible for the Decode to directly focus on more details of the input information. Due to the various deficiencies of the BasicSeq2Seq, the patent uses the Seq2Seq model based on Attention. The principle of the Attention mechanism: to solve the problem of BasicSeq2Seq, an Attention model was then introduced. The Attention model is characterized in that the Decoder does not encode the whole input sequence into a middle semantic vector with fixed length, but calculates new words according to the currently generated new words, so that the input at each moment is different, and the problem of word information loss is solved. An Encoder-Decoder model introduced with Attention is shown in FIG. 6;
the simple Encoder-Decoder framework does not effectively focus on the input target, which makes the model like seq2seq not exert the maximum efficacy when used alone. For example, in fig. 6, the encoder encodes the input into a context variable C, and each output Y is decoded using this C indiscriminately. What the attention model does is to encode the encoder into different C according to each time step of the sequence, and when decoding, the encoder outputs the encoded C by combining with each different C, so that the obtained result is more accurate. The input sequence and the output sequence in the patent are the same and are name sequences of operation transactions, the model is converged through continuous iteration and training, and the Encoder part of the model is stored after the model is converged. When the model is used, the name sequence of the operation transaction is input to the Encoder part, a numerical value vector is output, and the numerical value vector is used for representing the name sequence of the operation transaction, so that the numerical coding of the transaction operation name sequence is realized.
After the numerical coding is carried out on the transaction operation name sequence, the numerical vectors subjected to the numerical coding need to be clustered, and the MeanShift clustering algorithm shown in the figure 3 is mainly used in the patent. The Meanshift algorithm is a hill climbing algorithm based on kernel density estimation, and can be used in application scenes such as clustering, image segmentation and tracking. The key operation of the MeanShift algorithm is to calculate the shift vector of the center point through the data density change in the region of interest, so as to move the center point for the next iteration until reaching the position of maximum density (the center point is unchanged). This can be done starting from each data point, during which the number of times data appear in the region of interest is counted. This parameter will be the basis for classification at the end. Unlike the K-Means algorithm, the Means Shift algorithm can automatically determine the number of classes. As with the K-Means algorithm, both use the mean of the data points in the set to move the center point.
As shown in fig. 7, the steps related to the MeanShift clustering algorithm are as follows:
1. randomly selecting one point from the unmarked data points as a starting center point center;
2. finding out all data points appearing in an area with the center as the center radius as radius, considering that the points belong to a cluster C, and adding 1 to the frequency of appearance of the recorded data points in the cluster;
3. taking the center as a central point, calculating vectors from the center to each element in the set M, and adding the vectors to obtain a vector shift;
center + shift, that is, the center moves along the shift direction by an | shift |;
5. repeating the steps 2, 3 and 4 until shift is very small (namely iteration is converged), remembering the center at this time, and noting that all the points encountered in the iteration process should be classified into a cluster C;
6. if the distance between the center of the current cluster C and the centers of other existing clusters C2 is smaller than the threshold value during convergence, merging C2 and C, and correspondingly merging the occurrence times of data points, otherwise, taking C as a new cluster;
7. repeat 1, 2, 3, 4, 5 until all points are marked as visited;
8. and (4) classification: and according to each class, for the access frequency of each point, taking the class with the maximum access frequency as the class to which the current point set belongs.
The method and the device use the MeanShift algorithm to cluster and group the vector of the numerical coding of the transaction operation name sequence, thereby effectively grouping the transaction operation name sequences in the same group in a similar or similar way.
The last important step of the method is to calculate the business index related to the risk in each group. The main services include: the method comprises the following steps of group black user proportion, group marketing user proportion, group user attribution province proportion, group user sequence average length proportion, group average time interval for transaction operation and other indexes, and then according to the business rules, the calculated group with the index exceeding the threshold value is an abnormal group. The accounts within the exception group are exception risk accounts.
Compared with the prior art, the invention has the following beneficial effects:
the invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (1)
1. A method for detecting abnormal risk accounts based on time series clustering is characterized by comprising the following steps:
step 1, data acquisition: acquiring user transaction data, user operation data and user basic attribute data of an area to be researched, wherein the user transaction data and the operation data comprise detailed names and time of each operation and transaction, and the user basic attribute data comprise a unique user identifier, a blacklist or blacklist, geographical position related information and the like;
step 2, data preprocessing: grouping the data acquired in the step 1 according to users, and sequencing the data in each user group according to the time sequence of operation and transaction occurrence;
step 3, generating a time sequence, namely arranging the operation names and the transaction names of the users according to the time sequence to form a first operation-transaction name sequence according to the data processed in the step 2; arranging each user operation name and transaction time point according to a time sequence to form a second operation-transaction time point sequence; subtracting the previous time point from the next time point according to the time sequence to form a third operation-transaction time interval sequence;
step 4, time series numeralization, namely, for a first operation-transaction name sequence, modeling the name sequence by using a seq2seq method, then vectorizing each name of the sequence by using the model, and finally adding results of all the vectorized names in the sequence to calculate an average value; for the second operation-transaction time point sequence, the calculation result of subtracting a fixed time node (such as 2020, 01/01) from each time point of the sequence is stored as day, hour, minute or second as required; for the third operation-transaction time interval sequence, directly converting into days, hours, minutes or seconds according to requirements;
step 5, clustering and grouping time series: taking the vector of the first operation-transaction name sequence in the step 4 as an input value, clustering and grouping by using a MeanShift clustering algorithm, grouping users with similar operations and transactions in time into the same group, and numbering each group;
and 6, calculating indexes in the time sequence group: and calculating the service indexes such as black sample ratio, average operation transaction time, average sequence length and the like of the users in each group, and determining the users exceeding the service indexes as abnormal risk accounts.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011389228.6A CN112381546A (en) | 2020-12-01 | 2020-12-01 | Method for detecting abnormal risk account based on time series clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011389228.6A CN112381546A (en) | 2020-12-01 | 2020-12-01 | Method for detecting abnormal risk account based on time series clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112381546A true CN112381546A (en) | 2021-02-19 |
Family
ID=74589852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011389228.6A Pending CN112381546A (en) | 2020-12-01 | 2020-12-01 | Method for detecting abnormal risk account based on time series clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112381546A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204843A (en) * | 2023-04-24 | 2023-06-02 | 北京芯盾时代科技有限公司 | Abnormal account detection method and device, electronic equipment and storage medium |
CN117010905A (en) * | 2023-10-08 | 2023-11-07 | 中国建设银行股份有限公司 | Dynamic identification processing method and device for transaction risk list data |
CN117455497A (en) * | 2023-11-12 | 2024-01-26 | 北京营加品牌管理有限公司 | Transaction risk detection method and device |
-
2020
- 2020-12-01 CN CN202011389228.6A patent/CN112381546A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204843A (en) * | 2023-04-24 | 2023-06-02 | 北京芯盾时代科技有限公司 | Abnormal account detection method and device, electronic equipment and storage medium |
CN117010905A (en) * | 2023-10-08 | 2023-11-07 | 中国建设银行股份有限公司 | Dynamic identification processing method and device for transaction risk list data |
CN117010905B (en) * | 2023-10-08 | 2023-12-29 | 中国建设银行股份有限公司 | Dynamic identification processing method and device for transaction risk list data |
CN117455497A (en) * | 2023-11-12 | 2024-01-26 | 北京营加品牌管理有限公司 | Transaction risk detection method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112381546A (en) | Method for detecting abnormal risk account based on time series clustering | |
CN112100369B (en) | Semantic-combined network fault association rule generation method and network fault detection method | |
CN111488582B (en) | Intelligent contract reentrant vulnerability detection method based on graph neural network | |
CN111159387B (en) | Recommendation method based on multi-dimensional alarm information text similarity analysis | |
CN111127146A (en) | Information recommendation method and system based on convolutional neural network and noise reduction self-encoder | |
CN103577592A (en) | Network community user friend recommending method based on character similarity matching calculation | |
CN117237559B (en) | Digital twin city-oriented three-dimensional model data intelligent analysis method and system | |
CN113283902B (en) | Multichannel blockchain phishing node detection method based on graphic neural network | |
CN109446519B (en) | Text feature extraction method fusing data category information | |
CN112417063B (en) | Heterogeneous relation network-based compatible function item recommendation method | |
CN115357728A (en) | Large model knowledge graph representation method based on Transformer | |
CN112328859A (en) | False news detection method based on knowledge-aware attention network | |
CN115049472B (en) | Unsupervised credit card anomaly detection method based on multidimensional feature tensor | |
CN111026852B (en) | Financial event-oriented hybrid causal relationship discovery method | |
CN112507224A (en) | Service recommendation method of man-machine object fusion system based on heterogeneous network representation learning | |
CN116934270A (en) | Library book borrowing management system based on data analysis | |
Bakirli et al. | DTreeSim: A new approach to compute decision tree similarity using re-mining | |
CN106097090A (en) | A kind of taxpayer interests theoretical based on figure associate group's recognition methods | |
CN109033952A (en) | M-sequence recognition methods based on sparse self-encoding encoder | |
Velikova et al. | Decision trees for monotone price models | |
CN112069392B (en) | Method and device for preventing and controlling network-related crime, computer equipment and storage medium | |
CN113988083A (en) | Factual information coding and evaluating method for shipping news abstract generation | |
CN115578100A (en) | Payment verification mode identification method and device, electronic equipment and storage medium | |
Frank et al. | Applications of neural networks to telecommunications systems | |
CN118395985B (en) | Named entity identification method based on knowledge distillation and variation self-encoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210219 |