CN108197795A

CN108197795A - The account recognition methods of malice group, device, terminal and storage medium

Info

Publication number: CN108197795A
Application number: CN201711460104.0A
Authority: CN
Inventors: 郭秀军; 吴建; 徐勋; 罗顺风; 杨弢
Original assignee: Hangzhou Yu Hang Science And Technology Co Ltd; Zhejiang Geely Holding Group Co Ltd
Current assignee: Hangzhou Yu Hang Science And Technology Co Ltd; Zhejiang Geely Holding Group Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2018-06-22
Anticipated expiration: 2037-12-28
Also published as: CN108197795B

Abstract

The present invention relates to big data digging technology fields, provide a kind of malice group account recognition methods, device, terminal and storage medium, the method includes：Obtain the characteristic data set of all accounts to be identified；Each characteristic data set is standardized, obtains corresponding standard data set；Each standard data set is predicted using preset account Relationship Prediction model, obtains corresponding first weight coefficient of each standard data set；Feature extraction is carried out to each standard data set, obtains corresponding second weight coefficient of each standard data set；According to corresponding first weight coefficient of all standard data sets and the second weight coefficient, the malice group account in all accounts to be identified is obtained.The present invention considers the associated weights between account from multiple dimensions, the accuracy of classification results when improving community discovery algorithm process practical problem.

Description

The account recognition methods of malice group, device, terminal and storage medium

Technical field

The present invention relates to big data digging technology field, in particular to a kind of malice group account recognition methods, dress It puts, terminal and storage medium.

Background technology

With the development of Internet technology, China Internet trip market comes into high-speed development period, in order into quotations Field is promoted, and trip platform is often tactful using the subsidy of high dynamics, more drivers to be attracted to use the platform.At the same time, it is The subsidy of earning great number, various means illegally practised fraud also generate therewith.Moreover, there is trend to show at present, illegal cheating The crime of group's property is gradually evolved to from individuality crime.On the other hand, with the hair of the technologies such as machine learning, artificial intelligence Exhibition, is also reached its maturity using the method for the intelligent recognition group of data mining mode.But group's recognition methods at present is in reality Application in scene is only limitted between group's account the considerations of the single factors of relationship, therefore the accuracy identified is not high.

Invention content

Be designed to provide a kind of malice group account recognition methods, device, terminal and the storage of the embodiment of the present invention are situated between Matter, to improve the above problem.

To achieve these goals, technical solution used in the embodiment of the present invention is as follows：

In a first aspect, an embodiment of the present invention provides a kind of malice group account recognition methods, the method includes：It obtains The characteristic data set of all accounts to be identified；Each characteristic data set is standardized, obtains corresponding criterion numeral According to collection；Each standard data set is predicted using preset account Relationship Prediction model, obtains each standard data set Corresponding first weight coefficient；Feature extraction is carried out to each standard data set, obtains each standard data set corresponding Two weight coefficients；According to corresponding first weight coefficient of all standard data sets and the second weight coefficient, obtain all to be identified Malice group account in account.

Second aspect, the embodiment of the present invention additionally provide a kind of malice group account identification device, and described device includes spy Levy data set acquisition module, data normalization module, data prediction module, characteristic extracting module and malice account division module. Wherein, characteristic data set acquisition module is used to obtain the characteristic data set of all accounts to be identified；Data normalization module is used for Each characteristic data set is standardized, obtains corresponding standard data set；Data prediction module is used for using pre- If account Relationship Prediction model each standard data set is predicted, obtain each standard data set it is corresponding first power Weight coefficient；Characteristic extracting module obtains each standard data set and corresponds to for carrying out feature extraction to each standard data set The second weight coefficient；Malice account division module is used for according to corresponding first weight coefficient of all standard data sets and second Weight coefficient obtains the malice group account in all accounts to be identified.

The third aspect, the embodiment of the present invention additionally provide a kind of terminal, and the terminal includes：One or more processors； Memory, for storing one or more programs, when one or more of programs are performed by one or more of processors When so that one or more of processors realize above-mentioned malice group account recognition methods.

Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, are stored thereon with computer Program, the computer program realize above-mentioned malice group account recognition methods when being executed by processor.

Compared with the prior art, it a kind of malice group account recognition methods provided in an embodiment of the present invention, device, terminal and deposits Storage media first, by being standardized to each characteristic data set, obtains the corresponding normal data of account to be identified Collection；Then, each standard data set is predicted respectively and feature extraction, to obtain each standard data set corresponding first First weight coefficient and the second weight coefficient are finally combined and identify malice group by weight coefficient and the second weight coefficient Account.Compared with prior art, the embodiment of the present invention considered from multiple dimensions the relationship between account to be identified of influencing it is a variety of because Element, and different weights is given for the influence degree of account relationship according to these factors, most ownership is integrated again into one at last A comprehensive weight carries out malice group account using comprehensive weight and identifies, so as to improve the accurate of malice group account identification Property.

For the above objects, features and advantages of the present invention is enable to be clearer and more comprehensible, special embodiment below, and appended by cooperation Attached drawing is described in detail below.

Description of the drawings

It in order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only certain embodiments of the present invention, therefore is not construed as pair The restriction of range, for those of ordinary skill in the art, without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.

Fig. 1 shows the block diagram of terminal provided in an embodiment of the present invention.

Fig. 2 shows malice group account recognition methods flow charts provided in an embodiment of the present invention.

Fig. 3 be Fig. 2 shows step S102 sub-step flow chart.

Fig. 4 be Fig. 2 shows step S104 sub-step flow chart.

Fig. 5 shows the schematic diagram for the relational network figure that present example provides.

Fig. 6 be Fig. 2 shows step S105 sub-step flow chart.

Fig. 7 shows the block diagram of malice group account identification device provided in an embodiment of the present invention.

Icon：100- terminals；101- memories；102- storage controls；103- processors；200- malice group account is known Other device；201- characteristic data set acquisition modules；202- data normalization modules；203- data prediction modules；204- features carry Modulus block；205- malice account division modules.

Specific embodiment

Below in conjunction with attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Ground describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Usually exist The component of the embodiment of the present invention described and illustrated in attached drawing can be configured to arrange and design with a variety of different herein.Cause This, the detailed description of the embodiment of the present invention to providing in the accompanying drawings is not intended to limit claimed invention below Range, but it is merely representative of the selected embodiment of the present invention.Based on the embodiment of the present invention, those skilled in the art are not doing Go out all other embodiments obtained under the premise of creative work, shall fall within the protection scope of the present invention.

It should be noted that：Similar label and letter represents similar terms in following attached drawing, therefore, once a certain Xiang Yi It is defined in a attached drawing, does not then need to that it is further defined and explained in subsequent attached drawing.Meanwhile the present invention's In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that instruction or hint relative importance.

Fig. 1 is please referred to, Fig. 1 shows the block diagram of terminal 100 provided in an embodiment of the present invention.Terminal 100 can be with It is, but is not limited to smart mobile phone, tablet computer, PC (personal computer, PC), server etc..Terminal 100 operating system may be, but not limited to, Android (Android) system, IOS (iPhone operating system) system System, Windows phone systems, Windows systems etc..The terminal 100 includes malice group account identification device 200, deposits Reservoir 101, storage control 102 and processor 103.

The memory 101, storage control 102 and 103 each element of processor are directly or indirectly electrical between each other Connection, to realize the transmission of data or interaction.For example, these elements can pass through one or more communication bus or letter between each other Number line, which is realized, to be electrically connected.Malice group account identification device 200 include it is at least one can be with software or firmware (firmware) Form be stored in memory 101 or be solidificated in the operating system (operating system, OS) of the terminal 100 Software function module.Processor 103 is used to perform the executable module stored in memory 101, such as malice group account is known Software function module and computer program included by other device 200 etc..

Wherein, memory 101 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc.. Wherein, for memory 101 for storing program, the processor 103 performs described program after execute instruction is received.

Processor 103 can be a kind of IC chip, have signal handling capacity.Above-mentioned processor 103 can be with It is general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP), speech processor and video processor etc.；Can also be digital signal processor, application-specific integrated circuit, Field programmable gate array either other programmable logic device, discrete gate or transistor logic, discrete hardware components. It can realize or perform disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be Microprocessor or the processor 103 can also be any conventional processors etc..

First embodiment

Fig. 2 is please referred to, Fig. 2 shows the process flows of identification malice group provided in an embodiment of the present invention account Figure.Processing method includes the following steps：

Step S101 obtains the characteristic data set of all accounts to be identified.

In embodiments of the present invention, account to be identified can be the user registered on transaction platform for the purpose of transaction, And these users are accused of carrying out malice transaction on transaction platform, for example, the user registered on net about vehicle platform, user can be with Be driver can also be passenger, if these users are accused of on net about vehicle platform obtaining unlawful interests with extremity, such as It is accused of using platform loophole or deliberately to gain great number subsidy etc. by cheating, these is related to fraud transaction data with illegal means intrusion platform It is account to be identified to detest the user that meaning is merchandised.Characteristic attribute can be characterized each between account to be identified and other accounts The attribute of Transaction Information, characteristic attribute can include, but are not limited to be traded between each account to be identified and other accounts Danger classes, loco, transaction count and transaction amount etc..Characteristic data set is all features of each account to be identified The set of the value of attribute, characteristic data set are extracted according to the initial data of account to be identified, and initial data can derive from, But be not limited to account to be identified day full dose tran list, day full dose transfers accounts table, reimbursement record sheet, traction equipment information table, basic Information table, day full dose table, obtains the numbers such as resource table, attrition prediction table, message registration table, consulting complaint table at order record full edition According to table, the period for obtaining initial data is 30 days, i.e., obtained an initial data every 30 days.Characteristic data set can represent For：{x_{1_1}=3, x_{1_2}=1, x_{1_3}=10 ... x_{1_125}=37 }, it is meant that：First digit in subscript represents characteristic The serial number of collection, the second digit in subscript represent the serial number of the characteristic attribute of corresponding characteristic data set.Implement in the present invention In example, first characteristic attribute represents transaction danger classes, and second characteristic attribute represents loco, third characteristic attribute Transaction count is represented, the 125th characteristic attribute represents transaction amount.For example, x_{1_1}=3 representatives are meant that first characteristic It is 3 grades according to the i.e. transaction danger classes of first characteristic attribute of collection, x_{1_2}=1 represents and is meant that first characteristic data set Second characteristic attribute, that is, loco is 1,1 to represent some city, x_{1_3}=10 representatives are meant that first characteristic data set Third characteristic attribute, that is, transaction count be 10 times ... ..., x_{1_125}=37 represent and are meant that first characteristic data set 125th characteristic attribute, that is, transaction amount is 370,000 yuan.The characteristic data set of all accounts to be identified is each account to be identified Characteristic data set set, for example,

Step S102 is standardized each characteristic data set to obtain corresponding standard data set.

In embodiments of the present invention, first, the corresponding characteristic attribute collection of characteristic data set is obtained according to step S101.Feature Property set is the set of multiple characteristic attributes, the row in each characteristic attribute character pair data set.For example, existing 3 spies Data set is levied, the characteristic attribute of the 1st characteristic data set includes transaction danger classes and loco, the 2nd characteristic data set Characteristic attribute include transaction count and transaction amount, the characteristic attribute of the 3rd characteristic data set includes transaction danger classes and Transaction count, then the characteristic attribute collection of 3 characteristic data sets include transaction danger classes, loco, transaction count and friendship The easy amount of money.Secondly, it counts each characteristic attribute and lacks sample and effective sample what each characteristic was concentrated, obtain lacking sample This value and effective sample value.The sample that lacks of each characteristic attribute can either contain the spy not comprising this feature attribute Attribute but the characteristic data set of its characteristic attribute value missing are levied, the effective sample of each characteristic attribute can include this feature category Property and this feature attribute value be virtual value characteristic data set, for example, the 1st characteristic concentrate no transaction count this Characteristic attribute, then the 1st characteristic data set be exactly one of transaction count this characteristic attribute and lack sample, the 2nd characteristic Include this characteristic attribute of loco, and the value of this characteristic attribute of loco is a virtual value according to collection, then the 2nd spy Sign data set is exactly an effective sample of this characteristic attribute of loco.Each characteristic attribute lacks sample value and can be The ratio that each characteristic attribute is obtained in the total number of sum divided by characteristic data set that all characteristics concentration lacks sample. For example, existing 10 characteristic data sets, wherein there is 8 characteristics to concentrate all without this attribute of transaction count, then transaction time The sample value that lacks of this attribute of number is 0.8.Can be each characteristic attribute in institute in the effective sample value of each characteristic attribute There is characteristic to concentrate the sum of effective sample divided by the obtained ratio of total number of characteristic data set.Then, according to lacking sample This value and effective sample value filter out and really contain the characteristic data set of effective information, and the method for screening can be from being needed The characteristic concentration of identification account deletes invalid characteristic attribute row.Invalid characteristic attribute row can be the absence of sample value and reach The characteristic attribute that the corresponding row of characteristic attribute or effective sample value of first threshold are not up to second threshold is corresponding Row.It is empirical value that first threshold, which may be, but not limited to,.For example, first threshold is 0.7, then lack sample value more than or equal to 0.7 The corresponding row of characteristic attribute be invalid characteristic attribute row, need to concentrate it from characteristic and delete.Second threshold can be with It is, but is not limited to empirical value.For example, second threshold is 0.8, then effective sample value is less than the 0.8 corresponding row of characteristic attribute As invalid characteristic attribute row, need to concentrate it from characteristic and delete.Finally, the characteristic after calculating sifting is concentrated every The sample average and sample standard deviation of a characteristic attribute.The sample average of each characteristic attribute can be to the characteristic after screening According to concentrating the summation of corresponding characteristic attribute value, then by it is acquiring and divided by screening after the total number of characteristic data set obtain Ratio, sample standard deviation are calculated according to formula (1).Formula (1) is as follows：

Wherein, u_ijIt is characterized attribute u_iValue,It is characterized attribute u_iSample average,It is characterized attribute u_iSample Standard deviation.Finally, standardization formula (2) is used to each feature according to the sample average of each characteristic attribute and sample standard deviation Data set is standardized, and obtains corresponding standard data set.Formula (2) is as follows：

Wherein, u_ijIt is characterized attribute u_iValue, u_i'_jFor the characteristic attribute u after standardization_iValue,It is characterized attribute u_i's Sample average, σ_uiIt is characterized attribute u_iSample standard deviation.

It should be noted that in this step, the missing that the characteristic attribute of each characteristic data set is concentrated can also be counted Property value and effective property value carry out the characteristic data set of all accounts to be identified according to missing attribute values and effective property value Screening.Missing attribute values can be each characteristic data set characteristic attribute concentration lacked attribute or contain the attribute and The number divided by characteristic attribute of the attribute of the value missing of the attribute concentrate the ratio that the total number of attribute obtains, and effective property value can It concentrates and belongs to the number divided by the characteristic attribute that are the attribute that the characteristic attribute of each characteristic data set concentrates that property value is virtual value The ratio that the total number of property obtains.For example, the characteristic attribute concentration of the 1st characteristic data set includes 100 attributes, wherein having What the property value of 90 attributes was missing from, the property values of 10 attributes is effective, the then missing of first characteristic data set Property value is 0.9, and effective property value is 0.1.When missing attribute values reach third threshold value, it is believed that the letter that this feature data set includes It ceases insufficient, this feature data set from the characteristics of all accounts to be identified is concentrated and is deleted.Third threshold value can be, but not Be limited to an empirical value, such as 0.7, when missing attribute values are more than or equal to 0.7, by corresponding characteristic data set from The characteristic of all accounts to be identified, which is concentrated, deletes.When effective property value is not up to four threshold values, it is believed that this feature data set Comprising effective information value it is not high, this feature data set from the characteristics of all accounts to be identified is concentrated and is deleted.The It is an empirical value that four threshold values, which may be, but not limited to,, such as 0.8, it, will be corresponding when effective property value is less than 0.8 Characteristic data set is concentrated from the characteristic of all accounts to be identified and is deleted.

Fig. 3 is please referred to, step S102's can also include following sub-step：

Step S1021 obtains the characteristic attribute collection of all accounts to be identified, wherein, the characteristic attribute collection includes multiple Characteristic attribute.

In embodiments of the present invention, characteristic attribute may be, but not limited to, transaction danger classes, loco, transaction time Number, transaction amount etc..

Step S1022 counts each characteristic attribute and lacks sample and effective sample what each characteristic was concentrated, obtains Lack sample value and effective sample value.

In embodiments of the present invention, each characteristic attribute each characteristic concentrate lack sample and effective sample it With the summation of the characteristic data set less than or equal to all accounts to be identified, lack the sum of sample value and effective sample value and be less than or equal to 1.For example, the summation of the characteristic data set of all accounts to be identified is 10, wherein there is 5 characteristic data sets to lack transaction count This characteristic attribute, 4 characteristic data sets have this characteristic attribute of transaction count and the property value is effective, then transaction time Several sample values that lack are 0.5, and effective sample value is 0.4.

Step S1023 lacks sample value and effective sample value according to described, obtain each characteristic attribute sample average and Sample standard deviation.

In embodiments of the present invention, according to lacking sample value and effective sample value filters out and really contains effective information Characteristic data set, the characteristic after calculating sifting concentrate the sample average and sample standard deviation of each characteristic attribute.It is screening The property value of the characteristic attribute of missing can also be concentrated to fill a vacancy to each characteristic according to actual conditions before.The side of filling a vacancy It is that average value is filled a vacancy that method, which may be, but not limited to,.The processing method that average value is filled a vacancy is：It is corresponding to the missing attribute values first The characteristic attribute property value that other characteristics are concentrated in all accounts to be identified is averaging to obtain average value, then flat with this Mean value is as the missing attribute values.For example, existing 5 characteristic data sets, the value of this attribute of transaction amount are：

Wherein, the property value of the 125th characteristic attribute missing that the 3rd characteristic is concentrated, calculates the 125th feature Attribute is 33 in the average value that this characteristic is concentrated, then the 125th spy of the 3rd characteristic concentration of being filled a vacancy with average value 33 The property value of attribute is levied, the property value after filling a vacancy is：

According to sample average and sample standard deviation, each characteristic data set is calculated using standardized algorithm by step S1024 Corresponding standard data set.

In present example, since the data of each characteristic data set are different dimensions, for example, the transaction of A and B is total Number is 10 times, and 1000 yuan of total amount of merchandising, the transaction danger classes of A and B are 3, this 3 characteristic attributes be it is different magnitude of, It can not directly be handled its characteristic data set as the input data of algorithm, need to carry out it first with standardization formula Standardization obtains the corresponding standard data set of each characteristic data set.

Step S103 predicts each standard data set using preset account Relationship Prediction model, obtains every Corresponding first weight coefficient of a standard data set.

In embodiments of the present invention, account Relationship Prediction model is by carrying out multiple linear regression point to historical account Analysis, so as to obtain the prediction model of multiple linear regression equations, historical account therein is fixed malice group account.It is logical It crosses and predicts each standardized data collection substitution multiple linear regression equations, it is possible to obtain the evil of each standardized data It anticipates relationship depth predicted value, then the malice relationship depth predicted value is normalized to get to each normalized number According to corresponding first weight coefficient.

As a kind of embodiment, the method for obtaining the first weight coefficient can include：

First, multiple linear regression equations h is defined_θ(x)=θ₀+θ₁x₁+θ₂x₂+L+θ_nx_n, wherein, n is characterized of attribute Number, x_jJ-th of characteristic attribute value is concentrated for each characteristic.For example, x_jCan be transaction count, the trade gold of each pair of transaction Volume etc..In order to facilitate program processing, it is by multiple linear regression equation simplification：h_θ(x)=θ^TX=a+bx_j, wherein, θ, x are (n+1,1) dimensional vector is represented, for example, the transaction amount of 2000 couple transaction in historical data can be denoted as (2000,1), table Show 1 dimensional vector of the transaction amount of 2000 pairs of transaction.

Secondly, the loss function such as formula (3) is defined

Wherein, n is characterized the number (row) of attribute, numbers (row) of the m for historical account, y_iFor in historical account database Known real result value, h_θ(x_i) it is estimated value.

Third derives the calculation formula of regression coefficient using least square method, such as formula (4)

4th, the value of each characteristic attribute is concentrated to carry out dummy variable each characteristic of each historical account and turned It changes, obtains virtual attribute variable, conversion method is：The each characteristic for obtaining each historical account concentrates each characteristic attribute Then whole codomain ranges of value are all converted to virtual attribute variable, indicate whether to hit using 1 or 0, for example, x₂Codomain { 1,2,3 } indicates 3 cities, respectively Beijing, Shanghai, Guangzhou, need to be converted to whether Beijing, whether Shanghai, whether wide State 3 arranges, and then carries out one-to-one correspondence conversion.So x₂=1 can be converted to：x_{2_ Beijing}=1, indicate whether this virtual category of Beijing Property variable-value be "Yes", x_{2_ Shanghai}=0 indicates whether this virtual attribute variable-value of Shanghai as "No", x_{2_ Guangzhou}=0 represent be This virtual attribute variable-value of no Guangzhou is "No".

5th, the regression equation h with unknown regression coefficient a and b that the virtual attribute value substitution first step is defined_θ(x) =θ^TX=a+bx_j, solve and obtain the value of unknown regression coefficient a and b, finally obtain complete regression equation.

6th, the complete regression equation that the 5th step of each standardized data collection substitution is obtained obtains corresponding malice Relationship depth predicted value.For example, by standardized data collection { x_{200_1}=0, x_{200_2}=1, x_{200_3}=40 ... x_{200_125}=55 } it substitutes into Regression equation h_θ(x)=θ^TX=a+bx_j, wherein, the value of a and b have all been obtained in the 5th step, and it is deep to obtain corresponding malice relationship It spends for { y₂₀₀=98.233 }.

7th, corresponding first power is obtained after the malice relationship depth predicted value that previous step acquires is normalized Weight coefficient.Normalize formula such as formula (5)：

w_i=(w_i-w_min)/(w_max-w_min) (5)

Wherein, w_iFor each characteristic value, w_minIt is characterized minimum value in value, w_maxMaximum value in value is characterized, for example, above-mentioned mark Standardization data { x_{200_1}=0, x_{200_2}=1, x_{200_3}=40 ... x_{200_125}=55 } malice relationship depth is { y₂₀₀=98.233 }, After carrying out linear normalization processing to it, corresponding first weight coefficient is obtained as { w_{1_200}'=0.878804 }.

Step S104 carries out feature extraction to each standard data set, obtains each standard data set corresponding second Weight coefficient.

In embodiments of the present invention, the computational methods of corresponding second weight coefficient of each standard data set can be：It is first First, normal data is obtained according to the standard data set of each account to be identified and is always collected, always collection includes each waiting to know the normal data The standard data set of other account, and always collect carry out principal component analysis to the normal data, obtain the normal data always concentrate it is more A number of principal components evidence, for example, characteristic attribute concentration includes transaction danger classes, loco, transaction count and transaction amount This 4 attributes, obtained after principal component analysis transaction danger classes, loco, transaction count be its principal component；Then, The variance contribution ratio of corresponding principal component is integrated as weight using each standardized data, it is corresponding multiple to each standardized data collection Principal component is weighted processing and obtains a comprehensive substantial connection degree value of each standardized data, then the synthesis is closed closely It is that degree value is normalized to get to corresponding second weight coefficient of each standard data set.Variance contribution ratio is anti- Reflect weight of its principal component to each standard data set influence power.

Fig. 4 is please referred to, step S104 can also include following sub-step：

Sub-step S1041 always collects normal data carry out principal component analysis, obtain the normal data concentration it is multiple it is main into Divided data, wherein, normal data always collects the standard data set for including each account to be identified.

In embodiments of the present invention, carry out principal component analysis is always collected to normal data, so as to fulfill to characteristic attribute collection into Row dimensionality reduction obtains wherein most correlated characteristic property set, and solving characteristic equation according to maximally related characteristic attribute collection finally obtains master Ingredient.

As a kind of embodiment, the method for principal component analysis can include：

First, the normalized matrix of standard data set is obtained, the often row in normalized matrix represents a pair of of transaction, standardizes Each column in matrix represents the characteristic attribute of a transaction.

Secondly, the correlation matrix of previous step Plays matrix, the use equation below of correlation matrix are solved (6) it acquires：

Wherein,

Related coefficient is the statistical indicator of correlativity level of intimate between reflection variable.Related coefficient is by product moment method It calculates, equally based on the deviation of two variables and respective average value, is multiplied to reflect phase between two variables by two deviations Pass degree.The value of related coefficient is between -1 and+1, i.e. -1≤r≤1.Its property is as follows：

As r ＞ 0, two variable positive correlations, r are represented<When 0, two variables are negative correlation.

When | r | when=1, two variables of expression are fairly linear correlation, as functional relation.

As r=0, without linear relationship between two variables of expression.

When 0<|r|<When 1, represent that there are a degree of linear correlations for two variables.And | r | the line between 1, two variables Sexual intercourse is closer；| r | closer to 0, represent that the linear correlation of two variables is weaker.It can generally be divided by three-level：|r|≤0.4 It is low linearly related；0.4<|r|<0.7 is related for conspicuousness；0.7≤|r|<1 is related for highly linear.For example, the present invention is real In example, x₁Represent the danger classes merchandised between account, x₃Represent the number merchandised between account, r₁₃=r₃₁=-0.0423465, Illustrate that account danger classes is weak related to transaction count between account.

Third solves characteristic equation according to the correlation matrix that second step acquires | R- λ I_p|=0, obtain p characteristic root λ。

4th, in order to determine the number m of the characteristic root of correlation maximum from the p characteristic root that third walks, utilize Formula (7),

It determines m values, i.e., m characteristic root of correlation maximum is chosen from the p characteristic root that third walks, makes information Utilization rate is up to 95%, to each characteristic root λ_j, j=1,2 ..., m, solving equations Rb=λ_jB obtains unit character vector。

5th, the feature vector that the 4th step obtains is converted to principal component, conversion formula such as formula (8)：

Wherein, U₁Referred to as first principal component, U₂Referred to as Second principal component, U_mReferred to as m-th of principal component.The present invention is implemented In example, first principal component U₁={ 12.044529,11.927115 ..., -5.901484 }, Second principal component, U₂=- 0.427374, -0.079803 ..., -0.793191 } 67 principal components ... are shared.

Sub-step S1042 according to each number of principal components according to the contribution rate to each standard data set, obtains each criterion numeral According to corresponding second weight coefficient of collection.

In embodiments of the present invention, the second weight coefficient can be solved according to following methods：With each normal data The variance contribution ratio for concentrating each number of principal components evidence is weight, and summation is weighted to m principal component of each standard data set, A comprehensive weight coefficient is finally obtained, which is normalized to obtain corresponding second weight system Number.For example, the second weight coefficient of the 1st standard data set is Result_Weight={ w_{2_1}=0.143221642 }, the 200th Second weight coefficient of standard data set is Result_Weight={ w_{2_200}=-0.136319985 }.

Step S105 according to corresponding first weight coefficient of all standard data sets and the second weight coefficient, is owned Malice group account in account to be identified.

In embodiments of the present invention, obtain corresponding first weight coefficient of each standard data set and the second weight coefficient it Afterwards, first, the first weight coefficient corresponding to each standard data set and the second weight coefficient are weighted average, are obtained each The corresponding third weight coefficient of standard data set；Then, it according to the corresponding third weight coefficient of all standard data sets, restores Relational network figure between all accounts to be identified, for example, there is 4 standard data sets, the 1st standard data set represents account 5 It just merchandises to what account 8 was initiated, and malice weight coefficient is that 0.325, the 2nd standard data set represents account 3 and initiated to account 5 Transaction, and malice weight coefficient is that 0.56, the 3rd standard data set represents the transaction that account 6 is initiated to account 3, and malice Weight coefficient is that 0.84, the 4th standard data set represents the transaction that account 6 is initiated to account 5, and malice weight coefficient is 0.66, it is as shown in Figure 5 according to the relational network figure that this information restores；Finally, using hierarchical clustering greedy algorithm to the pass It is that network carries out cluster calculation, obtains the malice group account in all accounts to be identified.

Fig. 6 is please referred to, step S105 can also include following sub-step：

Sub-step S1051, the first weight coefficient corresponding to each standard data set and the second weight coefficient are weighted It is average, obtain the corresponding third weight coefficient of each standard data set.

In embodiments of the present invention, first weight coefficient corresponding to each standard data set and the second weight coefficient into After row weighted average, if obtained weighted average is too small, it is unfavorable for the calculating of follow-up corporations' sorting algorithm, can incites somebody to action each The corresponding weighted average of standard data set is enlarged after appropriate multiple as corresponding third weight coefficient.For example, the 1st First weight coefficient of a standard data set is Result_Weight={ w_{1_1}'=0.104926 }, the 200th standard data set First weight coefficient is Result_Weight={ w_{1_200}'=0.878804 }；Second weight coefficient of the 1st standard data set is Result_Weight={ w_{2_1}'=0.595968 }, the second weight coefficient of the 200th standard data set is Result_Weight= {w_{2_200}'=0.294188 }, after weighted average, obtain the third weight coefficient Result of the 1st standard data set_Weight={ W₁ =3.50447 }, the third weight coefficient of the 200th standard data set is Result_Weight={ W₂₀₀=5.86496 }, in order to just In the calculating of subsequent algorithm, 10 times of processing is amplified to weighted average herein, concrete meaning is, first pair merchandise into After the first weight of row and the second weight integrated treatment, malice relationship is familiar with depth and carries out for the 3.50447, the 200th pair of transaction After one weight and the second weight integrated treatment, it is 5.86496 that malice relationship, which is familiar with depth,.

Sub-step S1052 according to the corresponding third weight coefficient of all standard data sets, restores all accounts to be identified Between relational network figure.

In embodiments of the present invention, the node in relational network figure is the corresponding account of standard data set, relational network figure In side be the corresponding standard data set of account third weight coefficient.It is restored according to the information of node and side all to be identified Relational network figure between account.For example, Result_Weight={ W₁=3.50447 } what is represented is that account 0 is initiated to account 9 The relation data of transaction, then representing the node of account 0 and representing between the node of account 9 just has one to be directed toward account 9 from account 0 Directed edge, and the weight on the side is 3.50447.

Using hierarchical clustering greedy algorithm, all accounts to be identified are identified according to the relational network figure by sub-step S1053 Malice group account in family.

In embodiments of the present invention, a malice group account corresponds to a cluster in algorithm.Hierarchical clustering greed is calculated Method is constantly to add in new node to existing cluster in the case where ensureing that modularity does not reduce, and finally obtains modularity most It is big and comprising the most cluster of number of nodes, so as to obtain the malice group account in all accounts to be identified, the cluster in algorithm Also referred to as community, group etc., it should be understood that those skilled in the art can recognize that the different sayings of this identical meanings. The community described below meaning identical with cluster representative.

As a kind of embodiment, the method for hierarchical clustering greedy algorithm can include：

First, introductory die lumpiness Q, wherein calculation formula equation below (9) are calculated：

WhereinFor all weights in network, A_i,jFor the weight between node i and node j,Weight for the side being connect with vertex i, c_iFor the community that vertex is assigned to, δ (c_i,c_j) for judge vertex i with Whether vertex j is divided in same community, if so, returning to 1, otherwise, returns to 0；

For convenience of calculating, formula (9) is reduced to formula (10), it is as follows：

Wherein, ∑ in is weight inside community c, and Σ tot represent the weight with the side of the point connection inside community c, including Inside community while and community outside while.

Secondly, it by the community where any one node division to point adjacent thereto, again according to formula (10), counts This stylish modularity is calculated, by this node-home if new modularity is not less than the introductory die lumpiness calculated in step 1 It, otherwise cannot will be in this node-home to the community into the community.

Again, using new modularity as introductory die lumpiness, continue the iteration of second step, until modularity reaches most Big or all nodes, which all divide, to be finished, and finally obtains a community.

4th, pair first is repeated to remaining node and is walked to third, finally obtains all communities in relational network figure, i.e., Malice group that should be all.

In embodiments of the present invention, corporations are divided by hierarchical clustering greedy algorithm, finally obtains 7 malice and roll into a ball Body.First connected transaction account group is totally 12 accounts, including 1,9,21,22,26,30,41,43,44,48,53, 61}；Second connected transaction account group totally 7 accounts, including { 8,12,19,35,46,52,62 }；Third connected transaction Account group totally 9 accounts, including { 3,4,5,11,29,38,47,55,64 }；4th connected transaction account group totally 10 Account, including { 0,10,13,15,16,17,20,25,49,54 }；5th connected transaction account group, 11 accounts, including {7,14,23,27,31,32,33,34,37,59,65}；6th connected transaction account group totally 8 account, including 18,28, 40,42,45,50,57,63}；7th connected transaction account group totally 9 accounts, including 2,6,24,36,39,51,56, 58,60}；Amount to 66 accounts.The concrete meaning of each group is, for example, first connected transaction account group is totally 12 Account, including { 1,9,21,22,26,30,41,43,44,48,53,61 }, meaning is：Degree of danger between this 12 accounts It is close, and the degree of mutual dealing is excessively close, malice group property crime probability is larger.

In embodiments of the present invention, first, the characteristic data set of all accounts to be identified is obtained, is extracted from initial data On the one hand characteristic data set ensure that and comprehensively extract and be accused of relevant transaction attribute of maliciously merchandising as far as possible, ensure follow-up The malice evaluation criteria of enough multiple dimensions is included in the first and second weight coefficients calculated, is disliked so as to be conducive to improve The accuracy of meaning group account identification；On the other hand the incoherent information at all that will merchandise again with malice is removed, and is reduced Follow-up data scale to be treated improves the efficiency of data processing.Secondly, feature extraction is carried out to each standard data set When, the information of account to be identified is on the one hand taken full advantage of, the utilization rate of information is reached 95%, to ensure obtain the Two weight coefficients reflect the characteristic attribute information of malice correlation maximum in account to be identified as much as possible, further improve most The accuracy identified eventually；Weed out on the other hand the smaller characteristic attribute of malice correlation reduce it is follow-up to be treated Data scale improves the efficiency of data processing.Finally, in hierarchical clustering greedy algorithm is used to calculate all accounts to be identified Malice group account when, the calculation formula of modularity is simplified, the calculating time of algorithm is shortened, improves The efficiency of algorithm.

Second embodiment

Fig. 7 is please referred to, Fig. 7 shows that the box of malice group account identification device 200 provided in an embodiment of the present invention shows It is intended to.Malice group account identification device 200 is applied to terminal 100, including characteristic data set acquisition module 201, data mark Standardization module 202；Data prediction module 203；Characteristic extracting module 204；Malice account division module 205.

Characteristic data set acquisition module 201, for obtaining the characteristic data set of all accounts to be identified.

In the embodiment of the present invention, characteristic data set acquisition module 201 can be used for performing step S101.

Data normalization module 202 for being standardized to each characteristic data set, obtains corresponding criterion numeral According to collection.

In the embodiment of the present invention, data normalization module 202 can be used for performing step S102.

In the embodiment of the present invention, data normalization module 202 can be also used for performing the sub-step S1021- of step S102 S1024。

Data prediction module 203, for being carried out to each standard data set using preset account Relationship Prediction model Prediction, obtains corresponding first weight coefficient of each standard data set.

In the embodiment of the present invention, data prediction module 203 can be used for performing step S103.

Characteristic extracting module 204 for carrying out feature extraction to each standard data set, obtains each standard data set Corresponding second weight coefficient.

In the embodiment of the present invention, characteristic extracting module 204 can be used for performing step S104.

In the embodiment of the present invention, characteristic extracting module 204 can be also used for performing the sub-step S1041- of step S104 S1042。

Malice account division module 205, for being weighed according to corresponding first weight coefficient of all standard data sets and second Weight coefficient, obtains the malice group account in all accounts to be identified.

In the embodiment of the present invention, malice account division module 205 can be used for performing step S105.

In the embodiment of the present invention, malice account division module 205 can be also used for performing the sub-step of step S105 S1051-S1053。

The embodiment of the present invention further discloses a kind of computer readable storage medium, is stored thereon with computer program, described The malice group account recognition methods that present invention discloses is realized when computer program is performed by processor 103.

In conclusion a kind of malice group account recognition methods provided by the invention, device, terminal and storage medium, institute The method of stating includes：Obtain the characteristic data set of all accounts to be identified；Each characteristic data set is standardized, is obtained To corresponding standard data set；Each standard data set is predicted using preset account Relationship Prediction model, is obtained Corresponding first weight coefficient of each standard data set；Feature extraction is carried out to each standard data set, obtains each standard Corresponding second weight coefficient of data set；According to corresponding first weight coefficient of all standard data sets and the second weight coefficient, Obtain the malice group account in all accounts to be identified.Compared with prior art, the present invention considers influence from multiple dimensions The many factors of relationship between account to be identified, and different power is given for the influence degree of account relationship according to these factors Weight, most ownership is integrated again into a comprehensive weight at last, and carrying out malice group account using comprehensive weight identifies, so as to improve The accuracy of malice group account identification.

In several embodiments provided herein, it should be understood that disclosed device and method can also pass through Other modes are realized.The apparatus embodiments described above are merely exemplary, for example, flow chart and block diagram in attached drawing Show the device of multiple embodiments according to the present invention, the architectural framework in the cards of method and computer program product, Function and operation.In this regard, each box in flow chart or block diagram can represent the one of a module, program segment or code Part, a part for the module, program segment or code include one or more and are used to implement holding for defined logic function Row instruction.It should also be noted that at some as in the realization method replaced, the function that is marked in box can also be to be different from The sequence marked in attached drawing occurs.For example, two continuous boxes can essentially perform substantially in parallel, they are sometimes It can perform in the opposite order, this is depended on the functions involved.It is it is also noted that every in block diagram and/or flow chart The combination of a box and the box in block diagram and/or flow chart can use function or the dedicated base of action as defined in performing It realizes or can be realized with the combination of specialized hardware and computer instruction in the system of hardware.

In addition, each function module in each embodiment of the present invention can integrate to form an independent portion Point or modules individualism, can also two or more modules be integrated to form an independent part.

If the function is realized in the form of software function module and is independent product sale or in use, can be with It is stored in a computer read/write memory medium.Based on such understanding, technical scheme of the present invention is substantially in other words The part contribute to the prior art or the part of the technical solution can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, is used including some instructions so that a computer equipment (can be People's computer, server or network equipment etc.) perform all or part of the steps of the method according to each embodiment of the present invention. And aforementioned storage medium includes：USB flash disk, mobile hard disk, read-only memory (ROM, Read-OnlyMemory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic disc or CD.It needs Illustrate, herein, relational terms such as first and second and the like be used merely to by an entity or operation with Another entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this realities The relationship or sequence on border.Moreover, term " comprising ", "comprising" or its any other variant are intended to the packet of nonexcludability Contain so that process, method, article or equipment including a series of elements not only include those elements, but also including It other elements that are not explicitly listed or further includes as elements inherent to such a process, method, article, or device. In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element Process, method, also there are other identical elements in article or equipment.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, that is made any repaiies Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.It should be noted that：Similar label and letter exists Similar terms are represented in following attached drawing, therefore, once being defined in a certain Xiang Yi attached drawing, are then not required in subsequent attached drawing It is further defined and is explained.

Claims

1. a kind of malice group account recognition methods, which is characterized in that the method includes：

Obtain the characteristic data set of all accounts to be identified；

Each characteristic data set is standardized, obtains corresponding standard data set；

Each standard data set is predicted using preset account Relationship Prediction model, obtains each normal data set pair The first weight coefficient answered；

Feature extraction is carried out to each standard data set, obtains corresponding second weight coefficient of each standard data set；

According to corresponding first weight coefficient of all standard data sets and the second weight coefficient, obtain in all accounts to be identified Malice group account.

2. the method as described in claim 1, which is characterized in that it is described that each characteristic data set is standardized, The step of obtaining corresponding standard data set, including：

The characteristic attribute collection of all accounts to be identified is obtained, wherein, the characteristic attribute collection includes multiple characteristic attributes；

It counts each characteristic attribute and lacks sample and effective sample what each characteristic was concentrated, obtain lacking sample value and have Imitate sample value；

Lack sample value and effective sample value according to described, obtain the sample average and sample standard deviation of each characteristic attribute；

According to sample average and sample standard deviation, the corresponding normal data of each characteristic data set is calculated using standardized algorithm Collection.

3. the method as described in claim 1, which is characterized in that the preset account Relationship Prediction model is to historical account Characteristic data set carries out multiple regression analysis and obtains, wherein, the historical account is malice group account.

4. the method as described in claim 1, which is characterized in that it is described that feature extraction is carried out to each standard data set, it obtains The step of the second weight coefficient corresponding to each standard data set, including：

Always collect carry out principal component analysis to normal data, obtain multiple number of principal components evidences that the normal data is always concentrated, wherein, mark Quasi- data always collect the standard data set for including each account to be identified；

According to each number of principal components according to the contribution rate to each standard data set, corresponding second power of each standard data set is obtained Weight coefficient.

5. the method as described in claim 1, which is characterized in that described according to the corresponding first weight system of all standard data sets Number and the second weight coefficient, the step of obtaining the malice group account in all accounts to be identified, including：

The first weight coefficient corresponding to each standard data set and the second weight coefficient are weighted averagely, obtain each standard The corresponding third weight coefficient of data set；

According to the corresponding third weight coefficient of all standard data sets, the relational network between all accounts to be identified is restored Figure；

Using hierarchical clustering greedy algorithm, the malice group account in all accounts to be identified is identified according to the relational network figure Family.

6. a kind of malice group account identification device, which is characterized in that described device includes：

Characteristic data set acquisition module, for obtaining the characteristic data set of all accounts to be identified；

Data normalization module for being standardized to each characteristic data set, obtains corresponding standard data set；

Data prediction module for being predicted each standard data set using preset account Relationship Prediction model, is obtained To corresponding first weight coefficient of each standard data set；

For carrying out feature extraction to each standard data set, it is corresponding to obtain each standard data set for characteristic extracting module Second weight coefficient；

Malice account division module, for according to corresponding first weight coefficient of all standard data sets and the second weight coefficient, Obtain the malice group account in all accounts to be identified.

7. device as claimed in claim 6, which is characterized in that the data normalization module is additionally operable to：

8. device as claimed in claim 6, which is characterized in that the preset account Relationship Prediction model is to historical account Characteristic data set carries out multiple regression analysis and obtains, wherein, the historical account is malice group account.

9. a kind of terminal, which is characterized in that the terminal includes：

One or more processors；

Memory, for storing one or more programs, when one or more of programs are by one or more of processors During execution so that one or more of processors realize the method as described in any one of claim 1-5.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt The method as described in any one of claim 1-5 is realized when processor performs.