CN108197795B

CN108197795B - Malicious group account identification method, device, terminal and storage medium

Info

Publication number: CN108197795B
Application number: CN201711460104.0A
Authority: CN
Inventors: 郭秀军; 吴建; 徐勋; 罗顺风; 杨弢
Original assignee: Zhejiang Geely Holding Group Co Ltd; Hangzhou Youxing Technology Co Ltd
Current assignee: Zhejiang Geely Holding Group Co Ltd; Hangzhou Youxing Technology Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2020-11-03
Anticipated expiration: 2037-12-28
Also published as: CN108197795A

Abstract

The invention relates to the technical field of big data mining, and provides a malicious group account identification method, a malicious group account identification device, a malicious group account identification terminal and a storage medium, wherein the malicious group account identification method comprises the following steps: acquiring feature data sets of all accounts to be identified; standardizing each characteristic data set to obtain a corresponding standard data set; predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set; extracting features of each standard data set to obtain a second weight coefficient corresponding to each standard data set; and obtaining malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets. According to the invention, the association weight among the accounts is comprehensively considered from multiple dimensions, so that the accuracy of the classification result when the community discovery algorithm is used for processing the actual problem is improved.

Description

Malicious group account identification method, device, terminal and storage medium

Technical Field

The invention relates to the technical field of big data mining, in particular to a malicious group account identification method, a malicious group account identification device, a malicious group account identification terminal and a storage medium.

Background

With the development of internet technology, the internet trip market in china has entered a high-speed development period, and in order to promote the market, a trip platform often adopts a high-strength subsidy strategy to attract more drivers to use the platform. Meanwhile, in order to earn high subsidies, various illegal cheating means are generated. Moreover, there is a trend to show that illegal cheating behavior has gradually evolved from individual crime to group crime. On the other hand, with the development of technologies such as machine learning and artificial intelligence, methods for intelligently identifying groups using data mining methods have become more sophisticated. However, the application of the current group identification method in the practical scene is limited to the consideration of a single factor of the relationship between the group accounts, and therefore, the identification accuracy is not high.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method, an apparatus, a terminal and a storage medium for identifying a malicious group account, so as to improve the above problem.

In order to achieve the above purpose, the embodiment of the present invention adopts the following technical solutions:

in a first aspect, an embodiment of the present invention provides a method for identifying a malicious community account, where the method includes: acquiring feature data sets of all accounts to be identified; standardizing each characteristic data set to obtain a corresponding standard data set; predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set; extracting features of each standard data set to obtain a second weight coefficient corresponding to each standard data set; and obtaining malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets.

In a second aspect, an embodiment of the present invention further provides a malicious group account identification apparatus, where the apparatus includes a feature data set acquisition module, a data standardization module, a data prediction module, a feature extraction module, and a malicious account division module. The characteristic data set acquisition module is used for acquiring characteristic data sets of all accounts to be identified; the data standardization module is used for carrying out standardization processing on each characteristic data set to obtain a corresponding standard data set; the data prediction module is used for predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set; the characteristic extraction module is used for extracting characteristics of each standard data set to obtain a second weight coefficient corresponding to each standard data set; and the malicious account division module is used for obtaining malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets.

In a third aspect, an embodiment of the present invention further provides a terminal, where the terminal includes: one or more processors; memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the malicious community account identification method described above.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the malicious party account identification method described above.

Compared with the prior art, the malicious group account identification method, the malicious group account identification device, the malicious group account identification terminal and the malicious group account identification storage medium provided by the embodiment of the invention have the advantages that firstly, each characteristic data set is subjected to standardization processing to obtain a standard data set corresponding to an account to be identified; and then, respectively predicting and extracting features of each standard data set to obtain a first weight coefficient and a second weight coefficient corresponding to each standard data set, and finally, combining the first weight coefficient and the second weight coefficient to identify the malicious group account. Compared with the prior art, the method and the device have the advantages that various factors influencing the relationship among the accounts to be identified are considered from multiple dimensions, different weights are given according to the influence degrees of the factors on the account relationship, all the weights are finally integrated into one comprehensive weight, and the comprehensive weight is used for identifying the malicious group accounts, so that the accuracy of identifying the malicious group accounts is improved.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram illustrating a terminal according to an embodiment of the present invention.

Fig. 2 shows a flowchart of a malicious community account identification method provided by an embodiment of the present invention.

Fig. 3 is a flowchart illustrating sub-steps of step S102 shown in fig. 2.

Fig. 4 is a flowchart illustrating sub-steps of step S104 shown in fig. 2.

Fig. 5 shows a schematic diagram of a relational network diagram provided by an example of the present invention.

Fig. 6 is a flowchart illustrating sub-steps of step S105 shown in fig. 2.

Fig. 7 is a block diagram illustrating a malicious community account identification apparatus according to an embodiment of the present invention.

Icon: 100-a terminal; 101-a memory; 102-a memory controller; 103-a processor; 200-malicious community account identification means; 201-a feature data set acquisition module; 202-a data normalization module; 203-a data prediction module; 204-a feature extraction module; 205-malicious account partitioning module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a block diagram illustrating a terminal 100 according to an embodiment of the present invention. The terminal 100 may be, but is not limited to, a smart phone, a tablet computer, a Personal Computer (PC), a server, and the like. The operating system of the terminal 100 may be, but is not limited to, an Android system, an ios (internet operating system) system, a Windows phone system, a Windows system, and the like. The terminal 100 includes a malicious community account identifying apparatus 200, a memory 101, a storage controller 102, and a processor 103.

The memory 101, the memory controller 102 and the processor 103 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The malicious party account identifying apparatus 200 includes at least one software functional module that may be stored in the memory 101 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the terminal 100. The processor 103 is used for executing executable modules stored in the memory 101, such as software functional modules and computer programs included in the malicious community account identifying apparatus 200.

The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 101 is configured to store a program, and the processor 103 executes the program after receiving the execution instruction.

The processor 103 may be an integrated circuit chip having signal processing capabilities. The processor 103 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), a voice processor, a video processor, and the like; but may also be a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor 103 may be any conventional processor or the like.

First embodiment

Referring to fig. 2, fig. 2 is a flowchart illustrating a processing method for identifying a malicious community account according to an embodiment of the present invention. The processing method comprises the following steps:

step S101, acquiring feature data sets of all accounts to be identified.

In the embodiment of the present invention, the account to be identified may be a user registered on the transaction platform for the purpose of transaction, and the user is suspected to perform malicious transaction on the transaction platform, for example, the user registered on the online car contracting platform, the user may be a driver or a passenger, if the user is suspected to obtain illegal benefits on the online car contracting platform by an extraordinary means, for example, suspected to utilize a platform bug or to intrude the platform by an illegal means to intentionally make fake transaction data to cheat a high-value subsidy, and the like, the user involved in the malicious transaction is the account to be identified. The characteristic attributes are attributes capable of characterizing transaction information between each account to be identified and other accounts, and may include, but are not limited to, a risk level of a transaction between each account to be identified and other accounts, a transaction location, a transaction number, a transaction amount, and the like. The characteristic data set is a set of values of all characteristic attributes of each account to be identified, the characteristic data set is extracted according to original data of the account to be identified, and the original data can be derived from, but not limited to, a daily total transaction table, a daily total transfer table, a refund record table, a transaction equipment information table, a basic information table, an order record detailed table, a daily total table, an acquisition resource table, an attrition prediction table, a call record table, a consultation record table of the account to be identifiedThe period of acquiring the raw data in the data tables such as the complaint table is 30 days, namely, the raw data is acquired every 30 days. The feature data set may be represented as: { x_{1_1}＝3，x_{1_2}＝1，x_{1_3}＝10…x_{1_125}37} which means: the first number in the subscript represents the serial number of the feature data set and the second number in the subscript represents the serial number of the feature attribute of the corresponding feature data set. In one embodiment of the present invention, the first attribute represents a transaction risk level, the second attribute represents a transaction location, the third attribute represents a transaction number, and the 125 th attribute represents a transaction amount. For example, x_{1_1}Meaning that the first characteristic property of the first characteristic data set, i.e. the transaction risk level, is 3, x_{1_2}Meaning 1 is that the second characteristic attribute of the first characteristic data set, i.e. the transaction location, is 1,1 denotes a city, x_{1_3}The meaning of 10 is that the third feature of the first feature data set, i.e. the number of transactions, is 10, … …, x_{1_125}What is meant by 37 is that the 125 th characteristic attribute of the first characteristic data set, i.e. the transaction amount, is 37 ten thousand dollars. The feature dataset for all accounts to be identified is a collection of feature datasets for each account to be identified, e.g.,

and step S102, carrying out standardization processing on each feature data set to obtain a corresponding standard data set.

In the embodiment of the present invention, first, a feature attribute set corresponding to a feature data set is obtained according to step S101. The feature attribute set is a set of a plurality of feature attributes, each of which corresponds to a column in the feature data set. For example, there are 3 feature data sets, the feature attribute of the 1 st feature data set includes a transaction risk level and a transaction location, the feature attribute of the 2 nd feature data set includes a transaction frequency and a transaction amount, the feature attribute of the 3 rd feature data set includes a transaction risk level and a transaction frequency, and then the feature attribute set of the 3 th feature data set includes a transaction risk level, a transaction location, a transaction frequency and a transaction amount. Secondly, counting the lack samples and the effective samples of each characteristic attribute in each characteristic data set to obtain the lack sample values and the effective sample values. The missing sample of each feature attribute may be a feature data set that does not include the feature attribute or that includes the feature attribute but has a missing value of the feature attribute, and the valid sample of each feature attribute may be a feature data set that includes the feature attribute and has a valid value of the feature attribute, for example, if the 1 st feature data set does not have the feature attribute of transaction number, the 1 st feature data set is a missing sample of the feature attribute of transaction number, the 2 nd feature data set includes the feature attribute of transaction location, and the value of the feature attribute of transaction location is a valid value, the 2 nd feature data set is a valid sample of the feature attribute of transaction location. The missing sample values for each feature attribute may be a ratio of the total number of missing samples for each feature attribute in all feature data sets divided by the total number of feature data sets. For example, there are 10 feature data sets, and there are 8 feature data sets without the attribute of transaction number, and the missing sample value of the attribute of transaction number is 0.8. The valid sample values at each feature attribute may be a ratio of the total number of valid samples of each feature attribute in all feature data sets divided by the total number of feature data sets. Then, the feature data set actually containing valid information is screened out according to the missing sample value and the valid sample value, and the screening method may be to delete the invalid feature attribute column from the feature data sets of all accounts to be identified. The invalid characteristic attribute column may be a column corresponding to a characteristic attribute in which a sample value is missing and reaches the first threshold, or a column corresponding to a characteristic attribute in which a valid sample value does not reach the second threshold. The first threshold may be, but is not limited to, an empirical value. For example, if the first threshold is 0.7, the columns corresponding to the feature attributes lacking the sample value of 0.7 or more are invalid feature attribute columns, and need to be deleted from the feature data set. The second threshold may be, but is not limited to, an empirical value. For example, if the second threshold is 0.8, the column corresponding to the feature attribute having the valid sample value smaller than 0.8 is an invalid feature attribute column, and needs to be deleted from the feature data set. And finally, calculating the sample mean value and the sample standard deviation of each characteristic attribute in the screened characteristic data set. The sample mean value of each feature attribute may be obtained by summing corresponding feature attribute values in the screened feature data set, then dividing the sum by the total number of the screened feature data set to obtain a ratio, and calculating the sample standard deviation according to formula (1). Equation (1) is as follows:

wherein u is_ijAs a characteristic attribute u_iThe value of (a) is,

as a characteristic attribute u_iThe average value of the samples of (a),

as a characteristic attribute u_iSample standard deviation of (2). And finally, carrying out standardization processing on each characteristic data set by using a standardization formula (2) according to the sample mean value and the sample standard deviation of each characteristic attribute to obtain a corresponding standard data set. Equation (2) is as follows:

wherein u is_ijAs a characteristic attribute u_iValue of (a), u_i'_jFor the normalized feature attribute u_iThe value of (a) is,

as a characteristic attribute u_iSample mean of σ_uiAs a characteristic attribute u_iSample standard deviation of (2).

It should be noted that, in this step, the missing attribute value and the valid attribute value in the feature attribute set of each feature data set may also be counted, and the feature data sets of all the accounts to be identified are screened according to the missing attribute value and the valid attribute value. The missing attribute value may be a ratio obtained by dividing the number of attributes, which are missing from the feature attribute set of each feature data set or contain the attribute and whose value is missing, by the total number of attributes in the feature attribute set, and the valid attribute value may be a ratio obtained by dividing the number of attributes, whose attribute value in the feature attribute set is a valid value, of each feature data set by the total number of attributes in the feature attribute set. For example, the feature attribute set of the 1 st feature data set includes 100 attributes, where 90 attribute values are missing and 10 attribute values are valid, and the missing attribute value of the first feature data set is 0.9 and the valid attribute value is 0.1. And when the missing attribute value reaches a third threshold value, the information contained in the characteristic data set is considered to be insufficient, and the characteristic data set is deleted from the characteristic data sets of all accounts to be identified. The third threshold may be, but is not limited to, an empirical value, such as 0.7, and when the missing attribute value is greater than or equal to 0.7, the corresponding feature data set is deleted from the feature data sets of all accounts to be identified. And when the effective attribute value does not reach the fourth threshold value, considering that the value of the effective information contained in the feature data set is not high, and deleting the feature data set from the feature data sets of all accounts to be identified. The fourth threshold may be, but is not limited to, an empirical value, such as 0.8, and when the valid attribute value is less than 0.8, the corresponding feature data set is deleted from the feature data sets of all accounts to be identified.

Referring to fig. 3, step S102 may further include the following sub-steps:

step S1021, a characteristic attribute set of all accounts to be identified is obtained, wherein the characteristic attribute set comprises a plurality of characteristic attributes.

In embodiments of the present invention, the characteristic attribute may be, but is not limited to, a transaction risk level, a transaction location, a transaction number, a transaction amount, and the like.

Step S1022, count the lack samples and valid samples of each feature attribute in each feature data set to obtain the lack sample values and valid sample values.

In the embodiment of the invention, the sum of the lack sample and the valid sample of each feature attribute in each feature data set is less than or equal to the sum of the feature data sets of all accounts to be identified, and the sum of the lack sample value and the valid sample value is less than or equal to 1. For example, the sum of the feature data sets of all the accounts to be identified is 10, wherein 5 feature data sets lack the feature attribute of the transaction number, 4 feature data sets have the feature attribute of the transaction number and the attribute value is valid, the lack sample value of the transaction number is 0.5, and the valid sample value is 0.4.

And S1023, obtaining a sample mean value and a sample standard deviation of each characteristic attribute according to the lack sample value and the effective sample value.

In the embodiment of the invention, the characteristic data set which really contains effective information is screened out according to the lack of sample values and the effective sample values, and the sample mean value and the sample standard deviation of each characteristic attribute in the screened characteristic data set are calculated. Before screening, attribute values of the missing characteristic attributes in each characteristic data set can be supplemented according to actual conditions. The gap filling method can be, but is not limited to, mean gap filling. The processing method of mean value filling comprises the following steps: the attribute values of the characteristic attribute corresponding to the missing attribute value in other characteristic data sets in all accounts to be identified are averaged to obtain an average value, and then the average value is used as the missing attribute value. For example, there are 5 feature data sets, and the value of this attribute of the transaction amount is:

wherein, the attribute value of the 125 th characteristic attribute in the 3 rd characteristic data set is missing, the average value of the 125 th characteristic attribute in the characteristic data set is calculated to be 33, the average value 33 is used to complement the attribute value of the 125 th characteristic attribute in the 3 rd characteristic data set, and the complemented attribute value is:

and step S1024, calculating a standard data set corresponding to each feature data set by using a standardization algorithm according to the sample mean value and the sample standard deviation.

In the embodiment of the present invention, because the data of each feature data set is of different dimensions, for example, the total transaction times of a and B are 10 times, the total transaction amount is 1000 yuan, the transaction risk level of a and B is 3, and the 3 feature attributes are of different magnitudes, the feature data set cannot be directly used as input data of an algorithm to be processed, and it is necessary to firstly perform a standardization process on the feature data set by using a standardization formula to obtain a standard data set corresponding to each feature data set.

Step S103, predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set.

In the embodiment of the invention, the account relation prediction model is a prediction model of a multiple linear regression equation obtained by performing multiple linear regression analysis on historical accounts, wherein the historical accounts are determined malicious group accounts. And substituting each standardized data set into a multiple linear regression equation for prediction to obtain a malicious relationship depth prediction value of each standardized data, and then normalizing the malicious relationship depth prediction value to obtain a first weight coefficient corresponding to each standardized data.

As an embodiment, the method of obtaining the first weight coefficient may include:

firstly, defining multiple linear regression equation h_θ(x)＝θ₀+θ₁x₁+θ₂x₂+L+θ_nx_nWherein n is the number of characteristic attributes, x_jFor the jth feature attribute value in each feature data set. For example, x_jMay be the number of transactions, the amount of transactions, etc. for each pair of transactions. To facilitate the process, the multivariate linear regression equation is simplified to: h is_θ(x)＝θ^Tx＝a+bx_jWhere θ, x both represent an (n +1,1) dimensional column vector, e.g., a transaction amount for 2000 transactions in the historical data may be noted as (2000, 1), representing a 1 dimensional column vector of the transaction amount for 2000 transactions.

Next, a loss function is defined as in equation (3)

Where n is the number of feature attributes (column), m is the number of historical accounts (row), y_iFor the actual value of the result known in the historical account database, h_θ(x_i) Are estimated values.

Third, a calculation formula of the regression coefficient is derived using the least square method, such as the formula (4)

Fourthly, performing virtual variable conversion on the value of each characteristic attribute in each characteristic data set of each historical account to obtain a virtual attribute variable, wherein the conversion method comprises the following steps: obtain the full value range of each feature attribute value in each feature dataset for each historical account, then convert all to virtual attribute variables, use 1 or 0 to indicate whether hit, e.g., x₂The value ranges {1, 2, 3} represent 3 cities, namely Beijing, Shanghai and Guangzhou, and are required to be converted into 3 columns of Beijing, Shanghai and Guangzhou, and then the conversion is carried out in a one-to-one manner. Then x₂1 can be converted into: x is the number of_{2_ Beijing}1, it indicates whether the virtual attribute variable of Beijing takes the value "Yes", x_{2_ shanghai}If the virtual attribute variable is "no" in shanghai, x is 0_{2_ Guangzhou}0 indicates whether the Guangzhou virtual attribute variable takes the value "NO".

Fifthly, substituting the virtual attribute values into the regression equation h with the unknown regression coefficients a and b defined in the first step_θ(x)＝θ^Tx＝a+bx_jAnd solving to obtain the values of unknown regression coefficients a and b, and finally obtaining a complete regression equation.

And sixthly, substituting each standardized data set into the complete regression equation obtained in the fifth step to obtain a corresponding malicious relationship depth prediction value. For example, will standardizeData set { x_{200_1}＝0，x_{200_2}＝1，x_{200_3}＝40…x_{200_125}Substitution of 55 into regression equation h_θ(x)＝θ^Tx＝a+bx_jWherein, the values of a and b are obtained in the fifth step, and the corresponding malicious relationship depth is obtained as { y₂₀₀＝98.233}。

And seventhly, normalizing the malicious relationship depth predicted value obtained in the last step to obtain a corresponding first weight coefficient. The normalization formula is as in formula (5):

w_i＝(w_i-w_min)/(w_max-w_min) (5)

wherein, w_iFor each characteristic value, w_minIs the minimum value among the characteristic values, w_maxFor maximum values of the characteristic values, e.g. the above-mentioned normalized data { x }_{200_1}＝0，x_{200_2}＝1，x_{200_3}＝40…x_{200_125}55, the malicious relationship depth is y₂₀₀98.233, linear normalization processing is performed to obtain a corresponding first weight coefficient of { w }_{1_200}'＝0.878804}。

And step S104, performing feature extraction on each standard data set to obtain a second weight coefficient corresponding to each standard data set.

In this embodiment of the present invention, the calculation method of the second weight coefficient corresponding to each standard data set may be: firstly, obtaining a standard data aggregate according to a standard data aggregate of each account to be identified, wherein the standard data aggregate comprises the standard data aggregate of each account to be identified, performing principal component analysis on the standard data aggregate to obtain a plurality of principal component data in the standard data aggregate, for example, the characteristic attribute aggregate comprises 4 attributes of transaction danger level, transaction place, transaction times and transaction amount, and obtaining the transaction danger level, the transaction place and the transaction times as principal components after the principal component analysis; then, taking the variance contribution rate of the principal component corresponding to each standardized data set as a weight, weighting a plurality of principal components corresponding to each standardized data set to obtain a comprehensive affinity degree value of each standardized data, and then normalizing the comprehensive affinity degree value to obtain a second weight coefficient corresponding to each standardized data set. The variance contribution rate is a weight reflecting the influence of its principal component on each standard data set.

Referring to fig. 4, step S104 may further include the following sub-steps:

and a substep S1041 of performing principal component analysis on the standard data aggregate to obtain a plurality of principal component data in the standard data aggregate, wherein the standard data aggregate comprises the standard data aggregate of each account to be identified.

In the embodiment of the invention, principal component analysis is carried out on the standard data total set, so that the dimension reduction of the feature attribute set is realized, the most relevant feature attribute set is obtained, and the feature equation is solved according to the most relevant feature attribute set to finally obtain the principal component.

As an embodiment, the method of principal component analysis may include:

first, a normalization matrix of a standard data set is obtained, each row in the normalization matrix representing a pair of transactions and each column in the normalization matrix representing a characteristic attribute of one transaction.

Secondly, solving a correlation coefficient matrix of the normalized matrix in the previous step, wherein the correlation coefficient matrix is obtained by using the following formula (6):

wherein the content of the first and second substances,

the correlation coefficient is a statistical index reflecting the degree of closeness of correlation between variables. The correlation coefficient is calculated according to a product difference method, and the degree of correlation between two variables is reflected by multiplying the two dispersion differences on the basis of the dispersion difference of the two variables and the respective average value. The value of the correlation coefficient is between-1 and +1, i.e., -1. ltoreq. r.ltoreq.1. The properties are as follows:

when r is greater than 0, the two variables are positively correlated, and when r is less than 0, the two variables are negatively correlated.

When | r | ═ 1, it means that the two variables are completely linearly related, i.e., are in a functional relationship.

When r is 0, it represents a wireless correlation between two variables.

When 0 is present<|r|<1, indicates that there is some linear correlation between the two variables. The closer the | r | is to 1, the more closely the linear relationship between the two variables is; the closer | r | is to 0, the weaker the linear correlation between the two variables is. Generally, the method can be divided into three stages: low degree linear correlation is that r is less than or equal to 0.4; 0.4<|r|<0.7 is significant correlation; gamma ray not less than 0.7 ≤<1 is highly linear. For example, in the present example, x₁Indicating the level of risk of a transaction between accounts, x₃Indicating the number of transactions between accounts, r₁₃＝r₃₁-0.0423465, indicating that the account risk rating is weakly correlated with the number of transactions between accounts.

Thirdly, solving an eigen equation | R- λ I according to the correlation coefficient matrix obtained in the second step_pI-0, resulting in p characteristic roots λ.

Fourthly, in order to determine the number m of feature roots having the highest correlation among the p feature roots obtained from the third step, using equation (7),

determining m value, namely selecting m characteristic roots with maximum correlation from the p characteristic roots obtained in the third step to ensure that the utilization rate of the information reaches 95 percent, and carrying out lambda treatment on each characteristic root_jJ is 1,2, …, m, and solution equation set Rb is λ_jb obtaining unit feature vector

。

Fifthly, converting the feature vector obtained in the fourth step into a principal component, wherein the conversion formula is as the formula (8):

wherein, U₁Referred to as the first principal component, U₂Referred to as the second principal component, U_mReferred to as the mth principal component. In the embodiment of the invention, the first principal component U₁-12.044529, 11.927115, …, -5.901484, the second principal component U₂… shares 67 principal components, namely-0.427374, -0.079803, …, -0.793191 }.

And a substep S1042, obtaining a second weight coefficient corresponding to each standard data set according to the contribution rate of each main component data to each standard data set.

In the embodiment of the present invention, the second weight coefficient may be solved according to the following method: and taking the variance contribution rate of each principal component data in each standard data set as a weight, carrying out weighted summation on the m principal components of each standard data set to finally obtain a comprehensive weight coefficient, and carrying out normalization processing on the comprehensive weight coefficient to obtain a corresponding second weight coefficient. For example, the second weight factor of the 1 st normative dataset is Result_Weight＝{w_{2_1}0.143221642, the second weight factor of the 200 th standard data set is Result_Weight＝{w_{2_200}＝-0.136319985}。

And step S105, obtaining malicious group accounts in all accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets.

In the embodiment of the present invention, after obtaining the first weight coefficient and the second weight coefficient corresponding to each standard data set, first, the first weight coefficient and the second weight coefficient corresponding to each standard data set are weighted and averaged to obtain a third weight coefficient corresponding to each standard data set; then, according to the third weighting coefficients corresponding to all the standard data sets, restoring a relationship network diagram between all the accounts to be identified, for example, there are 4 standard data sets, the 1 st standard data set represents the transaction initiated by the account 5 to the account 8, and the malicious weighting coefficient is 0.325, the 2 nd standard data set represents the transaction initiated by the account 3 to the account 5, and the malicious weighting coefficient is 0.56, the 3 rd standard data set represents the transaction initiated by the account 6 to the account 3, and the malicious weighting coefficient is 0.84, the 4 th standard data set represents the transaction initiated by the account 6 to the account 5, and the malicious weighting coefficient is 0.66, and the relationship network diagram restored according to this information is as shown in fig. 5; and finally, carrying out clustering calculation on the relation network graph by utilizing a hierarchical clustering greedy algorithm to obtain malicious group accounts in all accounts to be identified.

Referring to fig. 6, step S105 may further include the following sub-steps:

in the substep S1051, the first weight coefficient and the second weight coefficient corresponding to each standard data set are weighted and averaged to obtain a third weight coefficient corresponding to each standard data set.

In the embodiment of the present invention, after performing weighted average on the first weight coefficient and the second weight coefficient corresponding to each standard data set, if the obtained weighted average is too small to facilitate calculation of a subsequent community classification algorithm, the weighted average corresponding to each standard data set may be expanded by a proper factor to be used as a corresponding third weight coefficient. For example, the first weight factor of the 1 st normative dataset is Result_Weight＝{w_{1_1}'0.104926', the first weight factor of the 200 th standard data set is Result_Weight＝{w_{1_200}' -0.878804 }; the second weight factor of the 1 st normal data set is Result_Weight＝{w_{2_1}'0.595968', the second weighting factor of the 200 th standard data set is Result_Weight＝{w_{2_200}' 0.294188}, and after weighted averaging, obtaining a third weight coefficient Result of the 1 st standard data set_Weight＝{W₁3.50447, the third weight factor for the 200 th standard data set is Result_Weight＝{W₂₀₀5.86496, for the convenience of calculation of the following algorithm, the weighted average is enlarged by 10 times, which means that the malicious relationship familiarity is 3.50447 after the first pair of transactions is processed by the first weight and the second weight, and the malicious relationship familiarity is 5.86496 after the 200 th transaction is processed by the first weight and the second weight.

And a substep S1052, restoring the relationship network diagram among all accounts to be identified according to the third weight coefficients corresponding to all the standard data sets.

In the embodiment of the present invention, the node in the relational network graph is an account corresponding to the standard data set, and the edge in the relational network graph is the third weight coefficient of the standard data set corresponding to the account. And restoring the relationship network graph among all accounts to be identified according to the information of the nodes and the edges. For example, Result_Weight＝{W₁3.50447 represents the relationship data for a transaction initiated by account 0 to account 9, there is a directed edge between the node representing account 0 and the node representing account 9 that points from account 0 to account 9, and the weight of the edge is 3.50447.

And a substep S1053 of identifying malicious group accounts in all the accounts to be identified according to the relational network graph by using a hierarchical cluster greedy algorithm.

In an embodiment of the present invention, a malicious community account corresponds to a cluster in the algorithm. The hierarchical clustering greedy algorithm is to continuously add new nodes to existing clusters under the condition that the modularity is not reduced, and finally obtain the cluster with the maximum modularity and the maximum number of nodes, so as to obtain malicious group accounts in all accounts to be identified, wherein the clusters in the algorithm are also called communities, groups and the like. Communities and clusters described below represent the same meaning.

As an embodiment, the method of hierarchical clustering greedy algorithm may include:

first, an initial modularity Q is calculated, wherein the calculation formula is as follows (9):

wherein

For all weights in the network, A_i,jAs a weight between node i and node j,

weights for edges connected to vertex iHeavy, c_iIs the community to which the vertex is assigned, (c)_i,c_j) The method is used for judging whether the vertex i and the vertex j are divided in the same community, if so, returning to 1, otherwise, returning to 0;

for ease of calculation, equation (9) is simplified to equation (10) as follows:

where Σ in is an internal weight of the community c, and Σ tot represents a weight of an edge connected to a point inside the community c, including an edge inside the community and an edge outside the community.

Secondly, dividing any node into communities where the adjacent nodes are located, calculating the new modularity at this time according to the formula (10), if the new modularity is not less than the initial modularity calculated in the first step, attributing the node to the communities, otherwise, not attributing the node to the communities.

And thirdly, taking the new modularity as the initial modularity, and continuing the iteration of the second step until the modularity reaches the maximum or all the nodes are completely divided, thereby finally obtaining a community.

Fourthly, repeating the first to third steps for the rest nodes to finally obtain all communities in the relational network graph, namely all corresponding malicious communities.

In the embodiment of the invention, communities are divided through a hierarchical clustering greedy algorithm, and finally 7 malicious communities are obtained. The first group of related transaction accounts is a total of 12 accounts, including {1,9,21,22,26,30,41,43,44,48,53,61 }; a second group of associated transaction accounts having 7 accounts, comprising {8,12,19,35,46,52,62 }; a third group of related transaction accounts having 9 accounts, including {3,4,5,11,29,38,47,55,64 }; a fourth group of related transaction accounts having 10 accounts, including {0,10,13,15,16,17,20,25,49,54 }; a fifth group of 11 accounts associated with the transaction account, including {7,14,23,27,31,32,33,34,37,59,65 }; a sixth group of related transaction accounts consisting of 8 accounts, including {18,28,40,42,45,50,57,63 }; a seventh group of associated transaction accounts having 9 accounts, including {2,6,24,36,39,51,56,58,60 }; a total of 66 accounts. The specific meaning of each group is, for example, that the first group of associated transaction accounts is a total of 12 accounts, including {1,9,21,22,26,30,41,43,44,48,53,61}, which means: the 12 accounts have similar danger degree and have too close mutual transaction degree, and the probability of malicious group crime is higher.

In the embodiment of the invention, firstly, the feature data sets of all accounts to be identified are obtained, and the feature data sets are extracted from the original data, so that on one hand, the transaction attributes related to suspected malicious transactions are extracted as comprehensively as possible, and the subsequently calculated first and second weight coefficients are ensured to contain enough malicious evaluation standards with multiple dimensions, thereby being beneficial to improving the accuracy of identifying the malicious group accounts; on the other hand, information which is not related to malicious transactions is stripped, the scale of subsequent data needing to be processed is reduced, and the efficiency of data processing is improved. Secondly, when feature extraction is carried out on each standard data set, on one hand, the information of the account to be identified is fully utilized, and the utilization rate of the information reaches 95 percent, so that the obtained second weight coefficient can reflect the feature attribute information with the maximum malicious correlation in the account to be identified as far as possible, and the accuracy of final identification is further improved; on the other hand, the characteristic attribute with smaller malicious relevance is removed, so that the scale of data needing to be processed subsequently is reduced, and the data processing efficiency is improved. And finally, when the hierarchical cluster greedy algorithm is used for calculating the malicious group accounts in all the accounts to be identified, the calculation formula of the modularity is simplified, the calculation time of the algorithm is shortened, and the efficiency of the algorithm is improved.

Second embodiment

Referring to fig. 7, fig. 7 is a block diagram illustrating a malicious community account identification apparatus 200 according to an embodiment of the present invention. The malicious community account identification device 200 is applied to the terminal 100 and comprises a characteristic data set acquisition module 201, a data standardization module 202; a data prediction module 203; a feature extraction module 204; malicious account partitioning module 205.

A feature data set obtaining module 201, configured to obtain feature data sets of all accounts to be identified.

In this embodiment of the present invention, the feature data set obtaining module 201 may be configured to execute step S101.

And the data standardization module 202 is configured to standardize each feature data set to obtain a corresponding standard data set.

In this embodiment of the present invention, the data normalization module 202 may be configured to perform step S102.

In the embodiment of the present invention, the data normalization module 202 may be further configured to perform sub-steps S1021-S1024 of step S102.

The data prediction module 203 is configured to predict each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set.

In this embodiment of the present invention, the data prediction module 203 may be configured to execute step S103.

The feature extraction module 204 is configured to perform feature extraction on each standard data set to obtain a second weight coefficient corresponding to each standard data set.

In this embodiment of the present invention, the feature extraction module 204 may be configured to execute step S104.

In this embodiment of the present invention, the feature extraction module 204 may be further configured to perform sub-steps S1041-S1042 of step S104.

And the malicious account dividing module 205 is configured to obtain malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets.

In this embodiment of the present invention, the malicious account dividing module 205 may be configured to execute step S105.

In this embodiment of the present invention, the malicious account dividing module 205 may be further configured to perform substeps S1051-S1053 of step S105.

An embodiment of the present invention further discloses a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by the processor 103, implements the malicious community account identification method disclosed in the foregoing embodiment of the present invention.

In summary, the present invention provides a malicious group account identification method, apparatus, terminal and storage medium, where the method includes: acquiring feature data sets of all accounts to be identified; standardizing each characteristic data set to obtain a corresponding standard data set; predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set; extracting features of each standard data set to obtain a second weight coefficient corresponding to each standard data set; and obtaining malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets. Compared with the prior art, the method and the device consider various factors influencing the relationship between the accounts to be identified from multiple dimensions, give different weights according to the influence degrees of the factors on the account relationship, finally integrate all the weights into one comprehensive weight, and use the comprehensive weight to identify the malicious group accounts, thereby improving the accuracy of identifying the malicious group accounts.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Claims

1. A method for malicious party account identification, the method comprising:

acquiring feature data sets of all accounts to be identified;

standardizing each characteristic data set to obtain a corresponding standard data set;

predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set and subjected to normalization processing, wherein the preset account relation prediction model is obtained by performing multiple regression analysis on a historical account feature data set, and the historical account is a malicious group account;

performing principal component analysis on the standard data aggregate to obtain a plurality of principal component data in the standard data aggregate, wherein the standard data aggregate comprises a standard data aggregate of each account to be identified;

obtaining a second weight coefficient which corresponds to each standard data set and is subjected to normalization processing according to the contribution rate of each principal component data to each standard data set;

and obtaining malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets.

2. The method of claim 1, wherein the step of normalizing each feature data set to obtain a corresponding standard data set comprises:

acquiring a characteristic attribute set of all accounts to be identified, wherein the characteristic attribute set comprises a plurality of characteristic attributes;

counting the lack samples and the effective samples of each characteristic attribute in each characteristic data set to obtain lack sample values and effective sample values;

obtaining a sample mean value and a sample standard deviation of each characteristic attribute according to the lack sample value and the effective sample value;

and calculating a standard data set corresponding to each characteristic data set by using a standardization algorithm according to the sample mean value and the sample standard deviation.

3. The method according to claim 1, wherein the step of obtaining the malicious community accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets comprises:

carrying out weighted average on the first weight coefficient and the second weight coefficient corresponding to each standard data set to obtain a third weight coefficient corresponding to each standard data set;

restoring a relationship network diagram among all accounts to be identified according to the third weight coefficients corresponding to all the standard data sets;

and identifying malicious group accounts in all the accounts to be identified according to the relation network graph by utilizing a hierarchical cluster greedy algorithm.

4. An apparatus for identifying malicious party accounts, the apparatus comprising:

the characteristic data set acquisition module is used for acquiring the characteristic data sets of all accounts to be identified;

the data standardization module is used for carrying out standardization processing on each characteristic data set to obtain a corresponding standard data set;

the data prediction module is used for predicting each standard data set by using a preset account relation prediction model to obtain a first weight coefficient corresponding to each standard data set and subjected to normalization processing, wherein the preset account relation prediction model is obtained by performing multiple regression analysis on a historical account feature data set, and the historical account is a malicious group account;

a feature extraction module to: performing principal component analysis on the standard data aggregate to obtain a plurality of principal component data in the standard data aggregate, wherein the standard data aggregate comprises a standard data aggregate of each account to be identified; obtaining a second weight coefficient which corresponds to each standard data set and is subjected to normalization processing according to the contribution rate of each principal component data to each standard data set;

and the malicious account division module is used for obtaining malicious group accounts in all the accounts to be identified according to the first weight coefficient and the second weight coefficient corresponding to all the standard data sets.

5. The apparatus of claim 4, wherein the data normalization module is further to:

6. A terminal, characterized in that the terminal comprises:

one or more processors;

memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-3.