CN112381546A - Method for detecting abnormal risk account based on time series clustering - Google Patents

Method for detecting abnormal risk account based on time series clustering Download PDF

Info

Publication number
CN112381546A
CN112381546A CN202011389228.6A CN202011389228A CN112381546A CN 112381546 A CN112381546 A CN 112381546A CN 202011389228 A CN202011389228 A CN 202011389228A CN 112381546 A CN112381546 A CN 112381546A
Authority
CN
China
Prior art keywords
sequence
transaction
time
data
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011389228.6A
Other languages
Chinese (zh)
Inventor
施炎
徐德华
徐华建
余杰潮
汤敏伟
李�真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Electronic Commerce Co Ltd
Original Assignee
Tianyi Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Electronic Commerce Co Ltd filed Critical Tianyi Electronic Commerce Co Ltd
Priority to CN202011389228.6A priority Critical patent/CN112381546A/en
Publication of CN112381546A publication Critical patent/CN112381546A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/40Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
    • G06Q20/401Transaction verification
    • G06Q20/4016Transaction verification involving fraud or risk level assessment in transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for detecting abnormal risk accounts based on time series clustering, which is based on the current situation and existing problems of e-commerce transaction risk assessment, researches and analyzes a method for evaluating transaction risk of an e-commerce platform at home and abroad by taking a risk management methodology as guidance and combining risk transaction service characteristic analysis. The invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.

Description

Method for detecting abnormal risk account based on time series clustering
Technical Field
The invention relates to the technical field of emerging information, in particular to a method for detecting an abnormal risk account based on time series clustering.
Background
In recent years, electronic commerce develops rapidly, and online ordering and shopping become the most important consumption form of people. After prosperous business, a large amount of transaction risk problems such as bill brushing, illegal arbitrage, yellow cattle rushing and the like on some electronic commerce platforms are developed intensively, the backward transaction risk management means becomes a bottleneck which hinders the healthy development of the platforms, great loss is brought to the e-commerce platforms, the rights and interests of common customers are also influenced, and the fairness, the justice and the authenticity of the e-commerce platforms are questioned by the public. The e-commerce is weak in experience in transaction risk management, the risk management is relatively late to start, and the e-commerce is subjected to transaction risk evaluation mainly by the experience of business personnel. With the increase of business volume and the increase of business complexity, especially the specialized development of black industrial chains such as cattle companies and water force companies, the traditional transaction risk management mode cannot meet the development requirement of risk management. Therefore, a set of scientific and intelligent transaction risk assessment system is constructed to detect abnormal risks and black product accounts, and the system has great significance for the healthy development of the e-commerce platform.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for detecting abnormal risk accounts based on time series clustering.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a method for detecting abnormal risk accounts based on time series clustering, which comprises the following steps:
step 1, data acquisition: acquiring user transaction data, user operation data and user basic attribute data of an area to be researched, wherein the user transaction data and the operation data comprise detailed names and time of each operation and transaction, and the user basic attribute data comprise a unique user identifier, a blacklist or blacklist, geographical position related information and the like;
step 2, data preprocessing: grouping the data acquired in the step 1 according to users, and sequencing the data in each user group according to the time sequence of operation and transaction occurrence;
step 3, generating a time sequence, namely arranging the operation names and the transaction names of the users according to the time sequence to form a first operation-transaction name sequence according to the data processed in the step 2; arranging each user operation name and transaction time point according to a time sequence to form a second operation-transaction time point sequence; subtracting the previous time point from the next time point according to the time sequence to form a third operation-transaction time interval sequence;
step 4, time series numeralization, namely, for a first operation-transaction name sequence, modeling the name sequence by using a seq2seq method, then vectorizing each name of the sequence by using the model, and finally adding results of all the vectorized names in the sequence to calculate an average value; for the second operation-transaction time point sequence, the calculation result of subtracting a fixed time node (such as 2020, 01/01) from each time point of the sequence is stored as day, hour, minute or second as required; for the third operation-transaction time interval sequence, directly converting into days, hours, minutes or seconds according to requirements;
step 5, clustering and grouping time series: taking the vector of the first operation-transaction name sequence in the step 4 as an input value, clustering and grouping by using a MeanShift clustering algorithm, grouping users with similar operations and transactions in time into the same group, and numbering each group;
and 6, calculating indexes in the time sequence group: and calculating the service indexes such as black sample ratio, average operation transaction time, average sequence length and the like of the users in each group, and determining the users exceeding the service indexes as abnormal risk accounts.
Compared with the prior art, the invention has the following beneficial effects:
the invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a general schematic of the system of the present invention;
FIG. 2 is a diagram of the seq2seq model of the present invention;
FIG. 3 is a diagram of a MeanShift clustering model of the present invention;
FIG. 4 is a computed composite index map of the present invention;
FIG. 5 is one of the schematic diagrams of an embodiment of the present invention;
FIG. 6 is a schematic diagram of another embodiment of the present invention;
fig. 7 is a third schematic diagram of the embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
According to the method for detecting the abnormal risk account based on the time series clustering, provided by the embodiment of the invention, clustering grouping is carried out by utilizing the similarity of the time series according to the operation data and the transaction data of a user, and then various risk indexes in the group are calculated, so that the abnormal risk account is found out. The specific implementation steps are shown in fig. 1.
The following is a detailed description:
first, the comprehensive data of a user in a period of time, including transaction data, operation data and basic attribute data, are acquired. Taking winged payment as an example, the fields of the transaction data comprise user id, transaction name, transaction type, whether marketing preference is given or not and transaction time; the field of the operation data comprises a user id, an operation name, an operation type and operation time; the fields of the basic attributes of the user include the user id, the home identity and whether to be blacklisted. After the data are received, the three types of data are connected by using the user id, and the connected data are subjected to the work of removing the duplicate, cleaning and the like. After the three kinds of data are connected, the data are sequenced according to the sequence of the user, the transaction and the operation.
Next, the sorted data is preprocessed. For trade and operation names, the longitude of the later modeling is influenced because the trade and operation names have a large number of names with low occupation ratio in the data set, the number and occupation ratio of each name in the whole data set are counted first, and the names with small number and low occupation ratio are deleted. For time nodes where transactions and operations occur, since time node data is not numerical data,
no relevant numerical calculations can be performed, so time node data is converted into numerical data in two different ways. The first is to form a time node sequence by subtracting a fixed time node from a current time node, and the second is to form a time interval sequence by subtracting a previous time node from a later time node.
After the data preprocessing is completed, the transaction operation name sequence is encoded numerically, and here, the seq2seq algorithm shown in fig. 2 is used to encode the transaction operation name sequence numerically. The Sequence-to-Sequence is a generic Encoder-Decoder framework, and is known as Sequence-to-Sequence in the name of Sequence 2 Seq. The method is also a very important and popular sequence model in the current natural language processing technology, breaks through the traditional fixed-size input problem framework, opens the way to apply the classical deep neural network model to the sequence tasks of translation and intelligent question answering, and is proved to have very good performance in the mutual translation among main languages and the application of man-machine short-question answering in a voice assistant. The so-called Seq2Seq task mainly refers to the mapping problem from Sequence to Sequence, where Sequence is understood herein as a string Sequence (as in the name of trade operation in this patent), and when we want to obtain another string Sequence (as translated, e.g. semantically corresponding) after a string Sequence is given, this task can be called Seq2 Seq. The Seq2Seq is a neural network of an Encoder-Decoder structure, as shown in fig. 5, whose input is a Sequence (Sequence) and output is also a Sequence (Sequence), and hence the name "Seq 2 Seq". In the Encoder, a variable-length sequence is converted into a fixed-length vector expression, and the Decode converts the fixed-length vector into a variable-length signal sequence of a target; the most basic Seq2Seq model comprises three parts (some parts are not shown in fig. 5), namely an Encoder, a Decoder and an intermediate state vector C connecting the Encoder and the Decoder, wherein the Encoder encodes a state vector C (also called semantic encoding) with a fixed size through learning input, then the Encoder transmits the C to the Decoder, and the Decoder outputs a corresponding sequence through learning the state vector C.
BasicSeq2Seq has many drawbacks, and the process of first Encoder encoding the input into a fixed-size state vector (hiddenstate) is actually a "lossy compression of information" process. If the amount of information is larger, the process of converting the vector causes larger loss of information. Meanwhile, as the sequence length is increased, meaning that the sequence in the time dimension is long, the RNN model also has gradient diffusion. Finally, the component of the underlying model that connects the Encoder and Decode modules is simply a fixed-size state vector, which makes it impossible for the Decode to directly focus on more details of the input information. Due to the various deficiencies of the BasicSeq2Seq, the patent uses the Seq2Seq model based on Attention. The principle of the Attention mechanism: to solve the problem of BasicSeq2Seq, an Attention model was then introduced. The Attention model is characterized in that the Decoder does not encode the whole input sequence into a middle semantic vector with fixed length, but calculates new words according to the currently generated new words, so that the input at each moment is different, and the problem of word information loss is solved. An Encoder-Decoder model introduced with Attention is shown in FIG. 6;
the simple Encoder-Decoder framework does not effectively focus on the input target, which makes the model like seq2seq not exert the maximum efficacy when used alone. For example, in fig. 6, the encoder encodes the input into a context variable C, and each output Y is decoded using this C indiscriminately. What the attention model does is to encode the encoder into different C according to each time step of the sequence, and when decoding, the encoder outputs the encoded C by combining with each different C, so that the obtained result is more accurate. The input sequence and the output sequence in the patent are the same and are name sequences of operation transactions, the model is converged through continuous iteration and training, and the Encoder part of the model is stored after the model is converged. When the model is used, the name sequence of the operation transaction is input to the Encoder part, a numerical value vector is output, and the numerical value vector is used for representing the name sequence of the operation transaction, so that the numerical coding of the transaction operation name sequence is realized.
After the numerical coding is carried out on the transaction operation name sequence, the numerical vectors subjected to the numerical coding need to be clustered, and the MeanShift clustering algorithm shown in the figure 3 is mainly used in the patent. The Meanshift algorithm is a hill climbing algorithm based on kernel density estimation, and can be used in application scenes such as clustering, image segmentation and tracking. The key operation of the MeanShift algorithm is to calculate the shift vector of the center point through the data density change in the region of interest, so as to move the center point for the next iteration until reaching the position of maximum density (the center point is unchanged). This can be done starting from each data point, during which the number of times data appear in the region of interest is counted. This parameter will be the basis for classification at the end. Unlike the K-Means algorithm, the Means Shift algorithm can automatically determine the number of classes. As with the K-Means algorithm, both use the mean of the data points in the set to move the center point.
As shown in fig. 7, the steps related to the MeanShift clustering algorithm are as follows:
1. randomly selecting one point from the unmarked data points as a starting center point center;
2. finding out all data points appearing in an area with the center as the center radius as radius, considering that the points belong to a cluster C, and adding 1 to the frequency of appearance of the recorded data points in the cluster;
3. taking the center as a central point, calculating vectors from the center to each element in the set M, and adding the vectors to obtain a vector shift;
center + shift, that is, the center moves along the shift direction by an | shift |;
5. repeating the steps 2, 3 and 4 until shift is very small (namely iteration is converged), remembering the center at this time, and noting that all the points encountered in the iteration process should be classified into a cluster C;
6. if the distance between the center of the current cluster C and the centers of other existing clusters C2 is smaller than the threshold value during convergence, merging C2 and C, and correspondingly merging the occurrence times of data points, otherwise, taking C as a new cluster;
7. repeat 1, 2, 3, 4, 5 until all points are marked as visited;
8. and (4) classification: and according to each class, for the access frequency of each point, taking the class with the maximum access frequency as the class to which the current point set belongs.
The method and the device use the MeanShift algorithm to cluster and group the vector of the numerical coding of the transaction operation name sequence, thereby effectively grouping the transaction operation name sequences in the same group in a similar or similar way.
The last important step of the method is to calculate the business index related to the risk in each group. The main services include: the method comprises the following steps of group black user proportion, group marketing user proportion, group user attribution province proportion, group user sequence average length proportion, group average time interval for transaction operation and other indexes, and then according to the business rules, the calculated group with the index exceeding the threshold value is an abnormal group. The accounts within the exception group are exception risk accounts.
Compared with the prior art, the invention has the following beneficial effects:
the invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A method for detecting abnormal risk accounts based on time series clustering is characterized by comprising the following steps:
step 1, data acquisition: acquiring user transaction data, user operation data and user basic attribute data of an area to be researched, wherein the user transaction data and the operation data comprise detailed names and time of each operation and transaction, and the user basic attribute data comprise a unique user identifier, a blacklist or blacklist, geographical position related information and the like;
step 2, data preprocessing: grouping the data acquired in the step 1 according to users, and sequencing the data in each user group according to the time sequence of operation and transaction occurrence;
step 3, generating a time sequence, namely arranging the operation names and the transaction names of the users according to the time sequence to form a first operation-transaction name sequence according to the data processed in the step 2; arranging each user operation name and transaction time point according to a time sequence to form a second operation-transaction time point sequence; subtracting the previous time point from the next time point according to the time sequence to form a third operation-transaction time interval sequence;
step 4, time series numeralization, namely, for a first operation-transaction name sequence, modeling the name sequence by using a seq2seq method, then vectorizing each name of the sequence by using the model, and finally adding results of all the vectorized names in the sequence to calculate an average value; for the second operation-transaction time point sequence, the calculation result of subtracting a fixed time node (such as 2020, 01/01) from each time point of the sequence is stored as day, hour, minute or second as required; for the third operation-transaction time interval sequence, directly converting into days, hours, minutes or seconds according to requirements;
step 5, clustering and grouping time series: taking the vector of the first operation-transaction name sequence in the step 4 as an input value, clustering and grouping by using a MeanShift clustering algorithm, grouping users with similar operations and transactions in time into the same group, and numbering each group;
and 6, calculating indexes in the time sequence group: and calculating the service indexes such as black sample ratio, average operation transaction time, average sequence length and the like of the users in each group, and determining the users exceeding the service indexes as abnormal risk accounts.
CN202011389228.6A 2020-12-01 2020-12-01 Method for detecting abnormal risk account based on time series clustering Pending CN112381546A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011389228.6A CN112381546A (en) 2020-12-01 2020-12-01 Method for detecting abnormal risk account based on time series clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011389228.6A CN112381546A (en) 2020-12-01 2020-12-01 Method for detecting abnormal risk account based on time series clustering

Publications (1)

Publication Number Publication Date
CN112381546A true CN112381546A (en) 2021-02-19

Family

ID=74589852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011389228.6A Pending CN112381546A (en) 2020-12-01 2020-12-01 Method for detecting abnormal risk account based on time series clustering

Country Status (1)

Country Link
CN (1) CN112381546A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204843A (en) * 2023-04-24 2023-06-02 北京芯盾时代科技有限公司 Abnormal account detection method and device, electronic equipment and storage medium
CN117010905A (en) * 2023-10-08 2023-11-07 中国建设银行股份有限公司 Dynamic identification processing method and device for transaction risk list data
CN117455497A (en) * 2023-11-12 2024-01-26 北京营加品牌管理有限公司 Transaction risk detection method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204843A (en) * 2023-04-24 2023-06-02 北京芯盾时代科技有限公司 Abnormal account detection method and device, electronic equipment and storage medium
CN117010905A (en) * 2023-10-08 2023-11-07 中国建设银行股份有限公司 Dynamic identification processing method and device for transaction risk list data
CN117010905B (en) * 2023-10-08 2023-12-29 中国建设银行股份有限公司 Dynamic identification processing method and device for transaction risk list data
CN117455497A (en) * 2023-11-12 2024-01-26 北京营加品牌管理有限公司 Transaction risk detection method and device

Similar Documents

Publication Publication Date Title
CN112381546A (en) Method for detecting abnormal risk account based on time series clustering
CN112100369B (en) Semantic-combined network fault association rule generation method and network fault detection method
CN111488582B (en) Intelligent contract reentrant vulnerability detection method based on graph neural network
CN111159387B (en) Recommendation method based on multi-dimensional alarm information text similarity analysis
CN111127146A (en) Information recommendation method and system based on convolutional neural network and noise reduction self-encoder
CN103577592A (en) Network community user friend recommending method based on character similarity matching calculation
CN117237559B (en) Digital twin city-oriented three-dimensional model data intelligent analysis method and system
CN113283902B (en) Multichannel blockchain phishing node detection method based on graphic neural network
CN109446519B (en) Text feature extraction method fusing data category information
CN112417063B (en) Heterogeneous relation network-based compatible function item recommendation method
CN115357728A (en) Large model knowledge graph representation method based on Transformer
CN112328859A (en) False news detection method based on knowledge-aware attention network
CN115049472B (en) Unsupervised credit card anomaly detection method based on multidimensional feature tensor
CN111026852B (en) Financial event-oriented hybrid causal relationship discovery method
CN112507224A (en) Service recommendation method of man-machine object fusion system based on heterogeneous network representation learning
CN116934270A (en) Library book borrowing management system based on data analysis
Bakirli et al. DTreeSim: A new approach to compute decision tree similarity using re-mining
CN106097090A (en) A kind of taxpayer interests theoretical based on figure associate group's recognition methods
CN109033952A (en) M-sequence recognition methods based on sparse self-encoding encoder
Velikova et al. Decision trees for monotone price models
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium
CN113988083A (en) Factual information coding and evaluating method for shipping news abstract generation
CN115578100A (en) Payment verification mode identification method and device, electronic equipment and storage medium
Frank et al. Applications of neural networks to telecommunications systems
CN118395985B (en) Named entity identification method based on knowledge distillation and variation self-encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210219