CN110347669A - Risk prevention method based on streaming big data analysis - Google Patents

Risk prevention method based on streaming big data analysis Download PDF

Info

Publication number
CN110347669A
CN110347669A CN201910641683.1A CN201910641683A CN110347669A CN 110347669 A CN110347669 A CN 110347669A CN 201910641683 A CN201910641683 A CN 201910641683A CN 110347669 A CN110347669 A CN 110347669A
Authority
CN
China
Prior art keywords
data
feature
user
credit
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910641683.1A
Other languages
Chinese (zh)
Inventor
马涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Weijia Software Co Ltd
Original Assignee
Chengdu Weijia Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Weijia Software Co Ltd filed Critical Chengdu Weijia Software Co Ltd
Priority to CN201910641683.1A priority Critical patent/CN110347669A/en
Publication of CN110347669A publication Critical patent/CN110347669A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Abstract

The present invention provides a kind of risk prevention methods based on streaming big data analysis, this method comprises: screening multidimensional data feature from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN belief network classifier, it is trained by DBN belief network, then the feature for being loaded into test set predicts its type, test data set type is predicted according to training effective result, the data classification prediction result of test set is obtained, is completed to the user credit tagsort based on feature selecting.The invention proposes a kind of risk prevention methods based on streaming big data analysis, reduce the quantity of user behavior characteristics to be analyzed, eliminate the redundancy between feature, the disaggregated model of use more highly effective, the speed of credit evaluation and the accuracy of assessment are effectively improved, the streaming computing scene of mass data has been better adapted to.

Description

Risk prevention method based on streaming big data analysis
Technical field
The present invention relates to network security, in particular to a kind of risk prevention method based on streaming big data analysis.
Background technique
The development of internet communication and big data technology provides solid data and technology for determining user credit grade Basis.According to the study found that the internet behavior of user is way of realization of the behavior of people on internet carrier, essence with Social action is consistent, and the variation of assets, management state can be embodied by its network behavior.And network behavior data institute table The social networks revealed be more considered with user's reference have strong correlation, therefore user credit be not only present in financial statement, It mortgages among business information, but also may be embodied in the unstructured datas such as relevant user behavior data, social relationships.This A little data can constantly generate simultaneously input data analysis and excavate engine.Compared with traditional data, stream data shows reality The features such as Shi Xing, volatibility, sudden, randomness and unlimitedness.Since Internet service is to the high request of system response time, These data generally require to analyze and calculate in real time.Therefore under the stream data environment of internet mass, how use is improved The calculating accuracy rate and real-time of family credit become main problem urgently to be solved in big data analysis field.It is advised in network Mould is in today that geometry grade increases, and detected data volume is extremely huge, and traditional network analysis monitoring instrument and platform are difficult It deals with;And a large amount of resource and time can be expended by storing and processing a large amount of social network data.But with user behavior and The increasingly complex of social networks, existing method cannot achieve the identification of risk subscribers behavioural characteristic and to the users that breaks one's promise The control management of progress, and cause to calculate time lag.
Summary of the invention
To solve the problems of above-mentioned prior art, the invention proposes a kind of wind based on streaming big data analysis Dangerous prevention method, comprising:
Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN Belief network classifier, the multiple hidden layers of DBN belief network classifier, and different excitation letters is used between hidden layer Number is to calculate;
It is trained by DBN belief network, the feature for being then loaded into test set predicts its type, according to training Effective result predicts test data set type, obtains the data classification prediction result of test set;
Test set type mark is loaded into compare and assess with DBN belief network classifier predicted value;
Wherein, it when classifying to the collage-credit data feature selected, is attempted by defining hyperplane by data set point At 2 class of positive sample and negative sample;Assuming that there are the data sample set of 2 class linear separabilities: (xi, yi), i=1,2 ..., n, n is Sample size, yi∈ {+1, -1 }, meets the following conditions:
yi(ω·xi) -1 >=0,
ω is characterized weight adjustment parameter, makes | | ω | |2/ 2 the smallest classifiers are optimal classifier, by collage-credit data The Solve problems of optimum classifier are converted to quadratic programming optimization problem:
Wherein: a1For Lagrange multiplier and a1>=0, constraint condition are as follows:
Optimum classifier function is obtained according to above solution are as follows:
Sgn is to take sign function;
It is completed according to the value of f (x) to the user credit tagsort based on feature selecting;
If optimum classifier cannot separate 2 class points, fault-tolerant factor ξ is introducedi>=0, so that:
The discrimination threshold of Λ expression broad category device;CP indicates penalty factor, obtains the optimum classifier of broad sense, wherein will Above-mentioned a1It is changed to:
0≤ai≤ CP, i=1,2 ..., n
For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, Jin Ertong The linear classification of linked character is crossed to solve;The corresponding classification function of feature at this time are as follows:
Φ representative function: Φ (xi, x) and=[(xxi)+1]ξi
The present invention compared with prior art, has the advantage that
The invention proposes a kind of risk prevention methods based on streaming big data analysis, reduce user behavior to be analyzed The quantity of feature, eliminates the redundancy between feature, and the disaggregated model of use more highly effective effectively improves credit evaluation Speed and the accuracy of credit evaluation have better adapted to the streaming computing scene of mass data.
Detailed description of the invention
Fig. 1 is the flow chart of the risk prevention method according to an embodiment of the present invention based on streaming big data analysis.
Specific embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many substitutions, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of risk prevention method based on streaming big data analysis.Fig. 1 is according to this The risk prevention method flow diagram based on streaming big data analysis of inventive embodiments.
The present invention realizes the detection to social networks behavior by monitoring social network user behavior record, generates behavior Risk-warning information.User behavior record includes social networks session information.It optionally, further include customer transaction record information. For session information, Port Mirroring is carried out in the outlet of clustered node, it will words text imports the host for being used for safety detection, Stream data original message is captured, message is decoded and is pre-processed before it is forwarded to detecting and alarm.Pretreatment includes meeting Words classification, fragment recombination and session reduction.By matching storage detected rule in the database and predefined after pretreatment Feature of risk code checks Stream Data messages head and load, is identified and intercepted to risk behavior.
Wherein, the content of text in the social networks conversation recording includes but is not limited to, instant messaging chat record, from Media releasing information, microblogging or forum's message information, the review record of news website, electric business website evaluation information etc..Above-mentioned society Network behavior record is handed over to be only for example, specific social networks behavior also can be exemplified difference under actual conditions, not make herein specific It limits.
In the analysis of social networks behavior record, using rule match and the behavior model based on user carries out risk behavior Feature extraction.Social networks behavior record is obtained first from server, then according to the decision rule in database to day Will file carries out pattern match, and the redundant recording that normal behaviour generates is eliminated before carrying out credit evaluation, to identify and mention Take violations present in record.
Generally realize that information jumps by cyberrelationship between user.These, which jump path, can indicate that some user accesses The operation of community website.By scanning social networks graph structure, the binary group of current user to be analyzed and association user, table are established Show between the two jump relationship.Then by Analysis server log, establish path of the user in the actual access page and Behavior.
According to behavioural information and warning information provided above, by using risk statistics, vulnerability analysis, availability Analysis is counted.The stream data training set and test set of social network user behavior are read respectively, it will be pre- by standardization The training set data and test set data of processing carry out dimension-reduction treatment using principal component analysis, remove redundant data, carry out data Dimensionality reduction forms collage-credit data feature set.
Collage-credit data feature can select one or more from following stream data set: history credit feature, such as with Family passes through the payment history and refund historical record of financial web site, and the purchase of shopping website is replaced, record of cancelling an order;Society Relationship characteristic, i.e. user establish the credit data of associated other users in social networks, further include that user is associated with other Contacts closeness, depth and the range of user, such as relationship are held time, session ratio etc.;Behavior preference feature, according to user Access type, period, frequency and the social networks evaluation information of webpage or application, counting user Behavior law;Identity attribute is special Sign predicts personal identification association attributes, including age, occupation, marriage, education degree that is, by user network behavior, and verifies With the consistency of the essential information of user's input.Features described above information is only for example, the spy that the sample under actual conditions is included The quantity of reference breath, can be more or less than the quantity of shown information, and specific features information also can be exemplified difference, not make herein It is specific to limit.
Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN Belief network classifier is trained by DBN belief network, and the feature for being then loaded into test set predicts its type, Obtain the data classification prediction result of test set.Wherein, the multiple hidden layers of DBN belief network classifier, and hidden layer Between calculated using different excitation functions.Training data after DBN belief network classifier training by obtaining as a result, so After be loaded into test data, test data set type is predicted according to training effective result.Prediction result is obtained, i.e. completion base In the detection of the credit grade of machine learning.Finally, be loaded into test set type mark and DBN belief network classifier predicted value into Row comparison and assessment.
When classifying to the collage-credit data feature selected, it is preferable that attempted by defining hyperplane by data set It is divided into 2 class of positive sample and negative sample.Assuming that there are the data sample set of 2 class linear separabilities: (xi, yi), i=1,2 ..., n, n For sample size, yi∈ {+1, -1 }, meets the following conditions:
yi(ω·xi) -1 >=0,
ω is characterized weight adjustment parameter, makes | | ω | |2/ 2 the smallest classifiers are optimal classifier, by collage-credit data The Solve problems of optimum classifier are converted to quadratic programming optimization problem:
Wherein: a1For Lagrange multiplier and a1>=0, constraint condition are as follows:
According to above solution, optimum classifier function is obtained are as follows:
Sgn is to take sign function.
If optimum classifier cannot separate 2 class points, fault-tolerant factor ξ is introducedi>=0, so that:
The discrimination threshold of Λ expression broad category device.Identifying CP indicates penalty factor, can obtain the optimum classifier of broad sense. The dual problem of Generalized optimal classifier is identical with linear classification situation, only by a1It is changed to:
0≤ai≤ CP, i=1,2 ..., n
For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, Jin Ertong The linear classification for crossing linked character carrys out Solve problems.The corresponding classification function of feature at this time are as follows:
Φ representative function: Φ (xi, x) and=[(xxi)+1]ξi
It is completed as a result, according to the value of f (x) to the user credit tagsort based on feature selecting.
Optionally, after extracting reference characteristic information, by deep learning by identified risk subscribers and currently wait divide The Feature Mapping of the behavioral data of user is analysed into the newly-built feature space with risk identification, is added in newly-built feature space The similarity score of both weight averages, and then it is similar to the behavioural characteristic of risk subscribers to obtain calculating currently user to be analyzed Degree.
In the training study stage of stream data, a large amount of reference behaviors library is used in advance, the depth mind that training obtains The characteristic information of reference is extracted through network, training sample set derives from social networks session data collection and data of financial transaction Collection.In the deep neural network structure, an activation primitive layer is closely followed after each convolutional layer.For will linearly input conversion At nonlinear object, the output expression formula of hidden layer node are as follows:
hi(x)=maxJ ∈ [1, k][xTWij+bijWij]
In formula, WijIndicate the i-th column jth row nodal value in eigenmatrix, bijIndicate the balance of the i-th column jth row node because Son, each implicit layer unit has corresponded to k sub- hidden layers, by a maximum conduct in this k sub- hidden layer node output valves The output of activation primitive.Using 2 activation primitive nodes, after each convolutional layer, former port number reduces 1/2.
In the layer of pond, maximum characteristic mean in neighborhood is taken to export as the new feature value of neighborhood, above and below retention behavior Literary information, specific method are that behavioural characteristic data are carried out maximum value pond after convolution operation and activation primitive activation respectively Change and average value pond export obtain 2 pond results cascades as new feature.
Feature extraction, deep neural network are carried out to reference feature representation ability network model good enough in order to obtain In the mapping space obtained using the training of limited data sample, spacing is as small as possible in class, and spacing is as big as possible between class.Cause This increases cluster constraint in the calculating of cost function L, assembles homogeneous data mutually and is located remotely from each other without homogeneous data:
In formula, m is number of clusters, xiFor the sampling feature vectors in ith cluster, WTFor the transposed matrix of regression matrix, λ is weight attenuation parameter, cxiIt is characterized vector xiPondage factor.
In risk prevention, in addition to the behaviors such as the transaction promise breaking for needing to identify user, the user in social networks is further related to The risk of fraud between individual.The fraud type user normal users that typically disguise oneself as gain other people trust by cheating, pass through network False propaganda is carried out, early investment regular hour and actual benefit win the trust of victim, and are obtaining unlawful interests After rapidly disappear hiding, or single, crowd the name such as raises and illegally accumulates wealth by unfair means by means of brush.When identification has the user of fraud, key exists In the extraction and expression of its behavioural characteristic.The present invention indicates user behavior by a reference sequence cluster, reference sequence cluster In include characteristic sequence set, without carrying out any annotation or priori to behavioral structure, it is automatic to be directly realized by characteristic sequence set Classification and study.
Firstly, the network behavior of user is decomposed into essential characteristic sequence, secondly, characteristic sequence is transformed to index sequence Column.Obtain training behavior set { (Vn, yn) n=1,2 ..., N, wherein VnFor the behavior set of some user, yn∈ [1, 2 ..., C] it is operating characteristics type label.N is the quantity of user's operation, and C is number of types.For example, user is cheated and is gone For analysis feature include that the social networks topological parameter of user, good friend maintain duration, add good friend's quantity in preset time, delete Except good friend's quantity, good friend maintain duration and fund transfer accounts the ratio of time, fund is transferred to the deleted number of good friend in same order Amount and fund be transferred to the ratio of total degree with etc..
Then by behavior VnIt is expressed as characteristic sequence Xn, it is defined as follows:
Xn=[X1,n..., Xi,n..., Xln,n]
In formula: Xi,nIt is the feature set calculated the i-th period;lnIndicate VnThe quantity of middle period.
It is μ={ p by a characteristic sequence set expressioni| i=1 ..., Np, NpIt is characterized the quantity of arrangement set.I-th A characteristic sequence piIt is defined as { Xi, τi};
τ in formulaiFor detection threshold value.
To calculate Xi, first to all training characteristics sequence { X1..., XNMatrixing is carried out, it is represented with obtaining to have The period of property and the index for clustering all periods, transformation matrix A is indicated are as follows:
In formula:WithRespectively Xi,nAnd Xk,mThe Fisher vector of middle description type t.
Then, to i-th of reference sequence cluster, by the way that detection threshold value τ is arrangediTraining data sequence is established, is avoided noisy The sequence pattern of sound is mined.
It is X for a characteristic sequencen, by XnIndex sequence is converted to, is expressed as
In=[I1,n..., Ii,n..., Iln,n]
In formula: Ii,nFor ith feature sequence index, using characteristic sequence detection model to XnProcessing, selection index Ii,n The response of SVM is set to reach maximum.
From the index sequence [I after training1, I2..., IN] in characteristic sequence set R obtained by data mining algorithm, it is special Levying arrangement set R indicates a user's operation Similarity of Local Characteristic Structure, j-th of sequence RjIt is defined as follows:
Rj={ cj, sj, xj, wj}
In formula: cj∈ [1,2 ..., C] is operating characteristics type label;sjFor sequence pattern;xjIt is characterized arrangement set spy Sign;wjIt is to indicate RjIn operating characteristics type cjWeight.
To calculate sj, training data index is collected first.Then from the training index sequence sequence of calculation mode being collected into, Identical sequence pattern can be excavated from two operating characteristics classes and be obtained, so one weight w of settingj, for mode sj, wj Indicate sjOpposite supporting rate.If same mode occurs in more than two operating characteristics types, two characteristic sequences Gather weight to reduce.If a mode value appears in a type, weight is up to maximum value 1.
One specific action type of each characteristic sequence set expression, characteristic sequence set remain characteristic sequence when Between relationship.Due to the diversity of type, can effective simulation complexity feature.If behavioral test VT, characteristic sequence set R, The evaluation function of one operating characteristics c may be expressed as:
In formula: αj,c, βj,c, γj,cFor the parameter of j-th of characteristic sequence set in operating characteristics type c.NRIt is characterized sequence Column set number.ITFor sequence index, XTFor VTCharacteristic sequence, σ (IT,sj) it is sequence reference feature.σ(IT,sj) based on The structural similarity between behavioral test and characteristic sequence set is calculated, if initial value F (n, 0)=0, n ∈ [0, L];F (0, m)=- M, m ∈ [0, mj], L is in ITThe quantity of middle period, mjFor sequence pattern sjLength.Therefore, matching matrix F is defined as follows:
F (n, m)=max { -1+A (Xn,T, Xm,j), F (n-1, m), F (n, m-1) }
Sequence reference feature is will to indicate that a long sequence i.e. cycle tests of whole operation feature structure is short to homogeneous Sequence describes the part-structure of an operating characteristics.Work as sjWhen matching with cycle tests, σ (IT,sj) there is maximum reference to obtain Point:
σ(IT,sj)=max (F (n, mj)/mj)
Understanding and cognitive phase in collage-credit data feature realize σ (I using step analysis algorithm to be accurate quickT,sj) The identification of sequence reference feature makes distribution within class matrix SwOrder it is small as far as possible;Distribution between class matrix S simultaneouslybOrder it is big as far as possible, with The classification performance being optimal.Calculate Fisher function J:
In formula:For a n dimensional vector.Made by choosingIt is maximumFor projecting direction, obtained most after projection Big SbWith minimum Sw;One group of best discriminant vector is selected to establish projection matrix W, is indicated are as follows:
Finally in the study based on step analysis, dimensionality reduction is carried out to projection matrix W using PCA, eliminates redundancy feature letter Breath completes the identification of risk subscribers feature.
And after the reference risk identification of certain customers is completed, credit evaluation for other new users can be with base In depth network analysis, currently the behavior pattern of new user and risk subscribers whether there is similitude, realize risk subscribers and current The feature of user to be analyzed identifies.Specifically, the feature sample of the risk subscribers having confirmed that and current user to be analyzed is recorded first This binary group (xf, xc), wherein xf、xcRespectively indicate risk subscribers and the reference feature vector of user to be analyzed.Deep learning Target is to find a mapping function f, so that f (xf)、f(xc) meet in newly-built feature space under relation of plane: when to be analyzed When new user has similar behavior pattern feature to risk subscribers, f (xc) and f (xf) the distance between it is as small as possible;When user with When risk subscribers do not have similar behavior pattern feature, f (xf) and f (xc) distance it is as big as possible.
For problem is further simplified, one convolutional network of training before deep learning algorithm, by learning a component Layer nonlinear transformation projects to feature samples binary group in newly-built feature space, and positive sample is to more than default threshold within this space Value, negative sample is to the preset threshold is less than, to make accurate judgment in depth network.
Assuming that depth network is M layers shared, m layers have a neuron of p (m), wherein m=1,2,3 ..., M, to given use M layers of output of family behavioural characteristic vector are as follows:
hm=tanh (W(m)h(m-1)+b(m));
In formula,W(m)For m layers of weight parameter, b(m)For m layers of amount of bias, xf、xcBy Above-mentioned M layers of nonlinear transformation obtains:
F(xf)=hf (M), F (xc)=hc (M), risk subscribers are at a distance from current user to be analyzed is in new feature space Are as follows: d2 fc(xf, xc)=| | F (xf)-F(xc)||2
Then the behavior pattern measuring similarity of user and risk subscribers should then meet:
d2 fc(xf, xc) < τ -1, then xfAnd xcThere is behavior pattern similarity;
d2 fc(xf, xc) > τ+1, then xfAnd xcWithout behavior pattern similarity;
τ indicates that the risk distance threshold of setting, negative sample positive in this way are divided well on newly-built feature space in formula It separates.
Then setting optimization object function is
Wherein
β is to adjust operator to obtain weight parameter W, amount of bias b according to above-mentioned formula using stochastic gradient descent algorithm.
New character representation binary group (x' is obtained by deep learningf, x'c), it can be respectively obtained using similarity algorithm The behavior pattern similarity Sfc (x' of current user to be analyzed and some risk subscribersf, x'c):
I.e. final user's similarity estimated value.Wherein x'fi, x'ciRespectively feature vector x'f, x'cI-th of component, D is the dimension of feature vector.
If the behavior pattern similarity of current user to be analyzed and some risk subscribers is greater than preset threshold, will be current User to be analyzed is identified as the user that breaks one's promise.
Except it is above-mentioned based on the feature of social networks figure in addition to, the semanteme of dialogue-based content is also needed to user credit assessment Analysis.Such as certain advertisement type users, by the frequency for repeating to send similar content to attract legitimate user to access, and one is used A little tools issue content again, express identical semanteme using different word expression.They are distinguished from normal users Become more difficult.Based on this, office of the embodiment of the present invention to user each in social networks in social networks network topology The centripetal degree feature in portion is calculated, and identifies the risk subscribers of disguise as normal users.
In social networks network, user is indicated with node, side indicates social networks.From node ViIt is directed toward node VjSide a =(i, j) indicates user i, and there are at least one sessions with j.Even if the user that breaks one's promise change self attributes, also it is more difficult change they Position in social networks topology.Therefore, based on above-mentioned social networks topology, the following characteristics of user node are calculated.
The centripetal degree in the part of one node refers to remove the node from network after, the degree of the associated energies decline of network.Office The centripetal degree in portion not only allows for local density's information, it is also contemplated that bottleneck information.The associated energies of one topological diagram are defined as:
EL(G)=∑ θ2
θ refers to the characteristic value of the Kirchhoff's matrix of figure G, is the sum of all vertex out-degree.It is assumed that A (G) is the neighbour of figure G Matrix is connect, D (G) is the diagonal matrix of vertex out-degree.The Kirchhoff's matrix of figure G is L (G)=D (G)-A (G).
For there is n vertex, out-degree is respectively d1, d2..., dnTopological diagram G for, associated energies areReflect the contiguity inside figure.One vertex is removed from figure, the pass of figure Joining energy will be reduced.The part E of reductionL(G) significance level of this vertex in the graphic is reflected.It is assumed that by vertex v from figure Figure after removing in shape G is H.The centripetal degree in the part of vertex v are as follows:
Cv=EL(G)-EL(H)
Due to the unstable social network structure of user of breaking one's promise, and it is very weak with the relationship of neighbor node.By these societies It is removed in the unessential users from networks of breaking one's promise of friendship relationship, the energy of network reduces few.
The user that breaks one's promise is in itself specific commercial interest, and the session content of publication often has very big similitude, wraps Containing information such as a large amount of duplicate session content, harmful links, these information have the similitude of height.Therefore first by streaming number Session content of text in is decomposed into phrase, then calculates these phrase semantic distances using the method that bag of words are analyzed.It uses Closed bag of words characteristic set is used to calculate content similarity.Each characteristic set includes a list, includes similar in list Semantic word.By checking the similarity of these words, the similarity of available entire content, and then count each user The similarity between session content issued every time.
Similarity between the session content that the centripetal degree in part for obtaining each user and each user are issued every time Later, by the way that risk discrimination threshold is arranged, the centripetal degree in part is obtained by filtration lower than presetting centripetal degree threshold value and session content phase It is higher than the user node of default similarity threshold like degree, and is identified as the user that breaks one's promise.
In the promise breaking transaction for bringing risk to system, significant proportion is because of superior node of the user in chain transaction Credit it is lower caused by.The present invention carries out risk diffusion identification further directed to the risk diffusion behavior in chain transaction.According to The credit that the average value of the past period user's All Activity sets the user passively reduces threshold value.There are more transactions when simultaneously When consider influence of the network structure to diffusion.Network G (V, E) is established with real trade data.Node V indicates All Activity user Set.Wherein S (x) is the set of devoid of risk user, and I (x) is the set of risky user.Node E indicates user in network Between the set traded.Side EijOn weight be denoted as { aij, indicate the number traded between user.The state for remembering user i is ni, ni=1 indicates promise breaking, ni=0 indicates not break a contract;Trade E between note userijState be eij, eij=1 indicate this user it Between transaction it is abnormal, eij=0 indicates that transaction is normal.It is d that the credit of user j, which passively reduces number,j=Σ AijaijeijIf user Credit passively reduces threshold value distribution and is denoted as { δi, and credit passively reduces number { Fi, risk diffusion to credit passively reduce The collection of user is combined into Risk (x).Diffusion process description are as follows:
A) all users are initialized all in normal condition (S), so that a part of user is become risk status (I) at random, i.e., Make a part of n at randomiBecome 1 from 0, a certain transaction E of this certain customersijIt breaks a contract, eij=1.
B) number broken a contract is added to higher level user, once higher level's user credit, which passively reduces number, is greater than given threshold value, Work as dj=Σ Aijaijeij> δj, which becomes I by S.
C) the number F that each user i is spread and become risk status is recordedi, the credit after front and back is spread twice When the user set Risk (x) passively reduced is identical, diffusion process terminates.
Promise breaking transaction is extracted from trade network and constitutes sub-network, and credit is passively reduced to number FiDescending arrangement, choosing Select number F in sub-networkiHighest preceding X user propagates the recognition result of user as high risk, and wherein X is preset quantity Threshold value.
For the violations for further limiting risk subscribers and the user that breaks one's promise, illegal operation is reduced to normal users and social activity The adverse effect that network environment generates, the present invention is after the cognitive phase of stream data is completed, the specific limit further implemented System or control strategy.The limitation or control strategy include, but are not limited to, the following ways:
1: limitation user's extension social scope is attempted to search for other new users or be recommended newly in system for the user in user When good friend, reduce that the user is visible or the quantity of the object of system push.Specific practice is, by the recommendable new use of user's script Family is sorted from low to high by value-at-risk, hides the new user of the highest preset ratio of system level of trust.Thus limitation risk is used Influence of the family to normal users.2: user being marked, if user does not generate violations within a certain period of time, but is more than It breaks a contract again after the period, then the user is identified as the user that breaks one's promise again, be placed into the wind higher than former control hierarchy Dangerous grade.3: if some user is determined as the user that malice is broken a contract, pressure control means, including Frozen Account are taken, The user is prevented to continue to influence social network environment.
In conclusion reducing the invention proposes a kind of risk prevention method based on streaming big data analysis wait divide The quantity for analysing user behavior characteristics, eliminates the redundancy between feature, the disaggregated model of use more highly effective effectively improves The speed of credit evaluation and the accuracy of credit evaluation have better adapted to the streaming computing scene of mass data.
Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can be performed, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims (1)

1. a kind of risk prevention method based on streaming big data analysis characterized by comprising
Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN conviction Network classifier, the multiple hidden layers of DBN belief network classifier, and between hidden layer using different excitation functions come It calculates;
It is trained by DBN belief network, the feature for being then loaded into test set predicts its type, effective according to training As a result test data set type is predicted, obtains the data classification prediction result of test set;
Test set type mark is loaded into compare and assess with DBN belief network classifier predicted value;
Wherein, it when classifying to the collage-credit data feature selected, attempts for data set to be divided into just by defining hyperplane 2 class of sample and negative sample;Assuming that there are the data sample set of 2 class linear separabilities: (xi, yi), i=1,2 ..., n, n are sample Quantity, yi∈ {+1, -1 }, meets the following conditions:
yi(ω·xi) -1 >=0,
ω is characterized weight adjustment parameter, makes | | ω | |2/ 2 the smallest classifiers are optimal classifier, and collage-credit data is optimal The Solve problems of classifier are converted to quadratic programming optimization problem:
Wherein: a1For Lagrange multiplier and a1>=0, constraint condition are as follows:
Optimum classifier function is obtained according to above solution are as follows:
Sgn is to take sign function;
It is completed according to the value of f (x) to the user credit tagsort based on feature selecting;
If optimum classifier cannot separate 2 class points, fault-tolerant factor ξ is introducedi>=0, so that:
The discrimination threshold of Λ expression broad category device;CP indicates penalty factor, the optimum classifier of broad sense is obtained, wherein by above-mentioned a1It is changed to:
0≤ai≤ CP, i=1,2 ..., n
For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, and then passes through pass Join the linear classification of feature to solve;The corresponding classification function of feature at this time are as follows:
Φ representative function: Φ (xi, x) and=[(xxi)+1]ξi
CN201910641683.1A 2019-07-16 2019-07-16 Risk prevention method based on streaming big data analysis Withdrawn CN110347669A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910641683.1A CN110347669A (en) 2019-07-16 2019-07-16 Risk prevention method based on streaming big data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910641683.1A CN110347669A (en) 2019-07-16 2019-07-16 Risk prevention method based on streaming big data analysis

Publications (1)

Publication Number Publication Date
CN110347669A true CN110347669A (en) 2019-10-18

Family

ID=68176516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910641683.1A Withdrawn CN110347669A (en) 2019-07-16 2019-07-16 Risk prevention method based on streaming big data analysis

Country Status (1)

Country Link
CN (1) CN110347669A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101609A (en) * 2020-07-24 2020-12-18 西安电子科技大学 Prediction system, method and device for timeliness of payment of user and electronic equipment
CN112511632A (en) * 2020-12-03 2021-03-16 中国平安财产保险股份有限公司 Object pushing method, device and equipment based on multi-source data and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101609A (en) * 2020-07-24 2020-12-18 西安电子科技大学 Prediction system, method and device for timeliness of payment of user and electronic equipment
CN112101609B (en) * 2020-07-24 2023-08-01 西安电子科技大学 Prediction system, method and device for user repayment timeliness and electronic equipment
CN112511632A (en) * 2020-12-03 2021-03-16 中国平安财产保险股份有限公司 Object pushing method, device and equipment based on multi-source data and storage medium

Similar Documents

Publication Publication Date Title
CN110348528A (en) Method is determined based on the user credit of multidimensional data mining
Mirtaheri et al. Identifying and analyzing cryptocurrency manipulations in social media
Olszewski A probabilistic approach to fraud detection in telecommunications
US7089592B2 (en) Systems and methods for dynamic detection and prevention of electronic fraud
Lopez-Rojas et al. Money laundering detection using synthetic data
Lekha et al. Data mining techniques in detecting and predicting cyber crimes in banking sector
Singh et al. Fraud detection by monitoring customer behavior and activities
Chen et al. Variational autoencoders and Wasserstein generative adversarial networks for improving the anti-money laundering process
Dou et al. Pc 2 a: predicting collective contextual anomalies via lstm with deep generative model
Barman et al. A complete literature review on financial fraud detection applying data mining techniques
Badawi et al. Detection of money laundering in bitcoin transactions
Allan et al. Towards fraud detection methodologies
Lata et al. A comprehensive survey of fraud detection techniques
CN115438102A (en) Space-time data anomaly identification method and device and electronic equipment
CN110347669A (en) Risk prevention method based on streaming big data analysis
Torres et al. A proposal for online analysis and identification of fraudulent financial transactions
Ni et al. A Victim-Based Framework for Telecom Fraud Analysis: A Bayesian Network Model
Adedoyin Predicting fraud in mobile money transfer
Reddy et al. CNN-Bidirectional LSTM based Approach for Financial Fraud Detection and Prevention System
CN110334780A (en) Streaming big data security processing
CN116451050A (en) Abnormal behavior recognition model training and abnormal behavior recognition method and device
Xu et al. Multi-view Heterogeneous Temporal Graph Neural Network for “Click Farming” Detection
Huang et al. Imbalanced Credit Card Fraud Detection Data: A Solution Based on Hybrid Neural Network and Clustering-based Undersampling Technique
Wang et al. Bot-like Behavior Detection in Online Banking
Rajendran et al. Role of ML and DL in Detecting Fraudulent Transactions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20191018