Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for detecting abnormal risk accounts based on time series clustering.
In order to solve the technical problems, the invention provides the following technical scheme:
the invention discloses a method for detecting abnormal risk accounts based on time series clustering, which comprises the following steps:
step 1, data acquisition: acquiring user transaction data, user operation data and user basic attribute data of an area to be researched, wherein the user transaction data and the operation data comprise detailed names and time of each operation and transaction, and the user basic attribute data comprise a unique user identifier, a blacklist or blacklist, geographical position related information and the like;
step 2, data preprocessing: grouping the data acquired in the step 1 according to users, and sequencing the data in each user group according to the time sequence of operation and transaction occurrence;
step 3, generating a time sequence, namely arranging the operation names and the transaction names of the users according to the time sequence to form a first operation-transaction name sequence according to the data processed in the step 2; arranging each user operation name and transaction time point according to a time sequence to form a second operation-transaction time point sequence; subtracting the previous time point from the next time point according to the time sequence to form a third operation-transaction time interval sequence;
step 4, time series numeralization, namely, for a first operation-transaction name sequence, modeling the name sequence by using a seq2seq method, then vectorizing each name of the sequence by using the model, and finally adding results of all the vectorized names in the sequence to calculate an average value; for the second operation-transaction time point sequence, the calculation result of subtracting a fixed time node (such as 2020, 01/01) from each time point of the sequence is stored as day, hour, minute or second as required; for the third operation-transaction time interval sequence, directly converting into days, hours, minutes or seconds according to requirements;
step 5, clustering and grouping time series: taking the vector of the first operation-transaction name sequence in the step 4 as an input value, clustering and grouping by using a MeanShift clustering algorithm, grouping users with similar operations and transactions in time into the same group, and numbering each group;
and 6, calculating indexes in the time sequence group: and calculating the service indexes such as black sample ratio, average operation transaction time, average sequence length and the like of the users in each group, and determining the users exceeding the service indexes as abnormal risk accounts.
Compared with the prior art, the invention has the following beneficial effects:
the invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Example 1
According to the method for detecting the abnormal risk account based on the time series clustering, provided by the embodiment of the invention, clustering grouping is carried out by utilizing the similarity of the time series according to the operation data and the transaction data of a user, and then various risk indexes in the group are calculated, so that the abnormal risk account is found out. The specific implementation steps are shown in fig. 1.
The following is a detailed description:
first, the comprehensive data of a user in a period of time, including transaction data, operation data and basic attribute data, are acquired. Taking winged payment as an example, the fields of the transaction data comprise user id, transaction name, transaction type, whether marketing preference is given or not and transaction time; the field of the operation data comprises a user id, an operation name, an operation type and operation time; the fields of the basic attributes of the user include the user id, the home identity and whether to be blacklisted. After the data are received, the three types of data are connected by using the user id, and the connected data are subjected to the work of removing the duplicate, cleaning and the like. After the three kinds of data are connected, the data are sequenced according to the sequence of the user, the transaction and the operation.
Next, the sorted data is preprocessed. For trade and operation names, the longitude of the later modeling is influenced because the trade and operation names have a large number of names with low occupation ratio in the data set, the number and occupation ratio of each name in the whole data set are counted first, and the names with small number and low occupation ratio are deleted. For time nodes where transactions and operations occur, since time node data is not numerical data,
no relevant numerical calculations can be performed, so time node data is converted into numerical data in two different ways. The first is to form a time node sequence by subtracting a fixed time node from a current time node, and the second is to form a time interval sequence by subtracting a previous time node from a later time node.
After the data preprocessing is completed, the transaction operation name sequence is encoded numerically, and here, the seq2seq algorithm shown in fig. 2 is used to encode the transaction operation name sequence numerically. The Sequence-to-Sequence is a generic Encoder-Decoder framework, and is known as Sequence-to-Sequence in the name of Sequence 2 Seq. The method is also a very important and popular sequence model in the current natural language processing technology, breaks through the traditional fixed-size input problem framework, opens the way to apply the classical deep neural network model to the sequence tasks of translation and intelligent question answering, and is proved to have very good performance in the mutual translation among main languages and the application of man-machine short-question answering in a voice assistant. The so-called Seq2Seq task mainly refers to the mapping problem from Sequence to Sequence, where Sequence is understood herein as a string Sequence (as in the name of trade operation in this patent), and when we want to obtain another string Sequence (as translated, e.g. semantically corresponding) after a string Sequence is given, this task can be called Seq2 Seq. The Seq2Seq is a neural network of an Encoder-Decoder structure, as shown in fig. 5, whose input is a Sequence (Sequence) and output is also a Sequence (Sequence), and hence the name "Seq 2 Seq". In the Encoder, a variable-length sequence is converted into a fixed-length vector expression, and the Decode converts the fixed-length vector into a variable-length signal sequence of a target; the most basic Seq2Seq model comprises three parts (some parts are not shown in fig. 5), namely an Encoder, a Decoder and an intermediate state vector C connecting the Encoder and the Decoder, wherein the Encoder encodes a state vector C (also called semantic encoding) with a fixed size through learning input, then the Encoder transmits the C to the Decoder, and the Decoder outputs a corresponding sequence through learning the state vector C.
BasicSeq2Seq has many drawbacks, and the process of first Encoder encoding the input into a fixed-size state vector (hiddenstate) is actually a "lossy compression of information" process. If the amount of information is larger, the process of converting the vector causes larger loss of information. Meanwhile, as the sequence length is increased, meaning that the sequence in the time dimension is long, the RNN model also has gradient diffusion. Finally, the component of the underlying model that connects the Encoder and Decode modules is simply a fixed-size state vector, which makes it impossible for the Decode to directly focus on more details of the input information. Due to the various deficiencies of the BasicSeq2Seq, the patent uses the Seq2Seq model based on Attention. The principle of the Attention mechanism: to solve the problem of BasicSeq2Seq, an Attention model was then introduced. The Attention model is characterized in that the Decoder does not encode the whole input sequence into a middle semantic vector with fixed length, but calculates new words according to the currently generated new words, so that the input at each moment is different, and the problem of word information loss is solved. An Encoder-Decoder model introduced with Attention is shown in FIG. 6;
the simple Encoder-Decoder framework does not effectively focus on the input target, which makes the model like seq2seq not exert the maximum efficacy when used alone. For example, in fig. 6, the encoder encodes the input into a context variable C, and each output Y is decoded using this C indiscriminately. What the attention model does is to encode the encoder into different C according to each time step of the sequence, and when decoding, the encoder outputs the encoded C by combining with each different C, so that the obtained result is more accurate. The input sequence and the output sequence in the patent are the same and are name sequences of operation transactions, the model is converged through continuous iteration and training, and the Encoder part of the model is stored after the model is converged. When the model is used, the name sequence of the operation transaction is input to the Encoder part, a numerical value vector is output, and the numerical value vector is used for representing the name sequence of the operation transaction, so that the numerical coding of the transaction operation name sequence is realized.
After the numerical coding is carried out on the transaction operation name sequence, the numerical vectors subjected to the numerical coding need to be clustered, and the MeanShift clustering algorithm shown in the figure 3 is mainly used in the patent. The Meanshift algorithm is a hill climbing algorithm based on kernel density estimation, and can be used in application scenes such as clustering, image segmentation and tracking. The key operation of the MeanShift algorithm is to calculate the shift vector of the center point through the data density change in the region of interest, so as to move the center point for the next iteration until reaching the position of maximum density (the center point is unchanged). This can be done starting from each data point, during which the number of times data appear in the region of interest is counted. This parameter will be the basis for classification at the end. Unlike the K-Means algorithm, the Means Shift algorithm can automatically determine the number of classes. As with the K-Means algorithm, both use the mean of the data points in the set to move the center point.
As shown in fig. 7, the steps related to the MeanShift clustering algorithm are as follows:
1. randomly selecting one point from the unmarked data points as a starting center point center;
2. finding out all data points appearing in an area with the center as the center radius as radius, considering that the points belong to a cluster C, and adding 1 to the frequency of appearance of the recorded data points in the cluster;
3. taking the center as a central point, calculating vectors from the center to each element in the set M, and adding the vectors to obtain a vector shift;
center + shift, that is, the center moves along the shift direction by an | shift |;
5. repeating the steps 2, 3 and 4 until shift is very small (namely iteration is converged), remembering the center at this time, and noting that all the points encountered in the iteration process should be classified into a cluster C;
6. if the distance between the center of the current cluster C and the centers of other existing clusters C2 is smaller than the threshold value during convergence, merging C2 and C, and correspondingly merging the occurrence times of data points, otherwise, taking C as a new cluster;
7. repeat 1, 2, 3, 4, 5 until all points are marked as visited;
8. and (4) classification: and according to each class, for the access frequency of each point, taking the class with the maximum access frequency as the class to which the current point set belongs.
The method and the device use the MeanShift algorithm to cluster and group the vector of the numerical coding of the transaction operation name sequence, thereby effectively grouping the transaction operation name sequences in the same group in a similar or similar way.
The last important step of the method is to calculate the business index related to the risk in each group. The main services include: the method comprises the following steps of group black user proportion, group marketing user proportion, group user attribution province proportion, group user sequence average length proportion, group average time interval for transaction operation and other indexes, and then according to the business rules, the calculated group with the index exceeding the threshold value is an abnormal group. The accounts within the exception group are exception risk accounts.
Compared with the prior art, the invention has the following beneficial effects:
the invention designs a method for detecting an abnormal risk account based on time series clustering. Compared with the traditional wind control rule and direct clustering; the data of the time series is used for expanding the range of the clustering data, and similar or similar behaviors can be gathered together in the time dimension, so that the risk accounts of behaviors such as bill swiping, illegal arbitrage, cattle first-aid purchase and the like can be effectively detected.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.