CN110347669A

CN110347669A - Risk prevention method based on streaming big data analysis

Info

Publication number: CN110347669A
Application number: CN201910641683.1A
Authority: CN
Inventors: 马涛
Original assignee: Chengdu Weijia Software Co Ltd
Current assignee: Chengdu Weijia Software Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2019-10-18

Abstract

The present invention provides a kind of risk prevention methods based on streaming big data analysis, this method comprises: screening multidimensional data feature from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN belief network classifier, it is trained by DBN belief network, then the feature for being loaded into test set predicts its type, test data set type is predicted according to training effective result, the data classification prediction result of test set is obtained, is completed to the user credit tagsort based on feature selecting.The invention proposes a kind of risk prevention methods based on streaming big data analysis, reduce the quantity of user behavior characteristics to be analyzed, eliminate the redundancy between feature, the disaggregated model of use more highly effective, the speed of credit evaluation and the accuracy of assessment are effectively improved, the streaming computing scene of mass data has been better adapted to.

Description

Risk prevention method based on streaming big data analysis

Technical field

The present invention relates to network security, in particular to a kind of risk prevention method based on streaming big data analysis.

Background technique

The development of internet communication and big data technology provides solid data and technology for determining user credit grade Basis.According to the study found that the internet behavior of user is way of realization of the behavior of people on internet carrier, essence with Social action is consistent, and the variation of assets, management state can be embodied by its network behavior.And network behavior data institute table The social networks revealed be more considered with user's reference have strong correlation, therefore user credit be not only present in financial statement, It mortgages among business information, but also may be embodied in the unstructured datas such as relevant user behavior data, social relationships.This A little data can constantly generate simultaneously input data analysis and excavate engine.Compared with traditional data, stream data shows reality The features such as Shi Xing, volatibility, sudden, randomness and unlimitedness.Since Internet service is to the high request of system response time, These data generally require to analyze and calculate in real time.Therefore under the stream data environment of internet mass, how use is improved The calculating accuracy rate and real-time of family credit become main problem urgently to be solved in big data analysis field.It is advised in network Mould is in today that geometry grade increases, and detected data volume is extremely huge, and traditional network analysis monitoring instrument and platform are difficult It deals with；And a large amount of resource and time can be expended by storing and processing a large amount of social network data.But with user behavior and The increasingly complex of social networks, existing method cannot achieve the identification of risk subscribers behavioural characteristic and to the users that breaks one's promise The control management of progress, and cause to calculate time lag.

Summary of the invention

To solve the problems of above-mentioned prior art, the invention proposes a kind of wind based on streaming big data analysis Dangerous prevention method, comprising:

Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN Belief network classifier, the multiple hidden layers of DBN belief network classifier, and different excitation letters is used between hidden layer Number is to calculate；

It is trained by DBN belief network, the feature for being then loaded into test set predicts its type, according to training Effective result predicts test data set type, obtains the data classification prediction result of test set；

Test set type mark is loaded into compare and assess with DBN belief network classifier predicted value；

Wherein, it when classifying to the collage-credit data feature selected, is attempted by defining hyperplane by data set point At 2 class of positive sample and negative sample；Assuming that there are the data sample set of 2 class linear separabilities: (x_i, y_i), i=1,2 ..., n, n is Sample size, y_i∈ {+1, -1 }, meets the following conditions:

y_i(ω·x_i) -1 >=0,

ω is characterized weight adjustment parameter, makes | | ω | |²/ 2 the smallest classifiers are optimal classifier, by collage-credit data The Solve problems of optimum classifier are converted to quadratic programming optimization problem:

Wherein: a₁For Lagrange multiplier and a₁>=0, constraint condition are as follows:

Optimum classifier function is obtained according to above solution are as follows:

Sgn is to take sign function；

It is completed according to the value of f (x) to the user credit tagsort based on feature selecting；

If optimum classifier cannot separate 2 class points, fault-tolerant factor ξ is introduced_i>=0, so that:

The discrimination threshold of Λ expression broad category device；CP indicates penalty factor, obtains the optimum classifier of broad sense, wherein will Above-mentioned a₁It is changed to:

0≤a_i≤ CP, i=1,2 ..., n

For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, Jin Ertong The linear classification of linked character is crossed to solve；The corresponding classification function of feature at this time are as follows:

Φ representative function: Φ (x_i, x) and=[(xx_i)+1]ξ_i。

The present invention compared with prior art, has the advantage that

The invention proposes a kind of risk prevention methods based on streaming big data analysis, reduce user behavior to be analyzed The quantity of feature, eliminates the redundancy between feature, and the disaggregated model of use more highly effective effectively improves credit evaluation Speed and the accuracy of credit evaluation have better adapted to the streaming computing scene of mass data.

Detailed description of the invention

Fig. 1 is the flow chart of the risk prevention method according to an embodiment of the present invention based on streaming big data analysis.

Specific embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many substitutions, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of risk prevention method based on streaming big data analysis.Fig. 1 is according to this The risk prevention method flow diagram based on streaming big data analysis of inventive embodiments.

The present invention realizes the detection to social networks behavior by monitoring social network user behavior record, generates behavior Risk-warning information.User behavior record includes social networks session information.It optionally, further include customer transaction record information. For session information, Port Mirroring is carried out in the outlet of clustered node, it will words text imports the host for being used for safety detection, Stream data original message is captured, message is decoded and is pre-processed before it is forwarded to detecting and alarm.Pretreatment includes meeting Words classification, fragment recombination and session reduction.By matching storage detected rule in the database and predefined after pretreatment Feature of risk code checks Stream Data messages head and load, is identified and intercepted to risk behavior.

Wherein, the content of text in the social networks conversation recording includes but is not limited to, instant messaging chat record, from Media releasing information, microblogging or forum's message information, the review record of news website, electric business website evaluation information etc..Above-mentioned society Network behavior record is handed over to be only for example, specific social networks behavior also can be exemplified difference under actual conditions, not make herein specific It limits.

In the analysis of social networks behavior record, using rule match and the behavior model based on user carries out risk behavior Feature extraction.Social networks behavior record is obtained first from server, then according to the decision rule in database to day Will file carries out pattern match, and the redundant recording that normal behaviour generates is eliminated before carrying out credit evaluation, to identify and mention Take violations present in record.

Generally realize that information jumps by cyberrelationship between user.These, which jump path, can indicate that some user accesses The operation of community website.By scanning social networks graph structure, the binary group of current user to be analyzed and association user, table are established Show between the two jump relationship.Then by Analysis server log, establish path of the user in the actual access page and Behavior.

According to behavioural information and warning information provided above, by using risk statistics, vulnerability analysis, availability Analysis is counted.The stream data training set and test set of social network user behavior are read respectively, it will be pre- by standardization The training set data and test set data of processing carry out dimension-reduction treatment using principal component analysis, remove redundant data, carry out data Dimensionality reduction forms collage-credit data feature set.

Collage-credit data feature can select one or more from following stream data set: history credit feature, such as with Family passes through the payment history and refund historical record of financial web site, and the purchase of shopping website is replaced, record of cancelling an order；Society Relationship characteristic, i.e. user establish the credit data of associated other users in social networks, further include that user is associated with other Contacts closeness, depth and the range of user, such as relationship are held time, session ratio etc.；Behavior preference feature, according to user Access type, period, frequency and the social networks evaluation information of webpage or application, counting user Behavior law；Identity attribute is special Sign predicts personal identification association attributes, including age, occupation, marriage, education degree that is, by user network behavior, and verifies With the consistency of the essential information of user's input.Features described above information is only for example, the spy that the sample under actual conditions is included The quantity of reference breath, can be more or less than the quantity of shown information, and specific features information also can be exemplified difference, not make herein It is specific to limit.

Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN Belief network classifier is trained by DBN belief network, and the feature for being then loaded into test set predicts its type, Obtain the data classification prediction result of test set.Wherein, the multiple hidden layers of DBN belief network classifier, and hidden layer Between calculated using different excitation functions.Training data after DBN belief network classifier training by obtaining as a result, so After be loaded into test data, test data set type is predicted according to training effective result.Prediction result is obtained, i.e. completion base In the detection of the credit grade of machine learning.Finally, be loaded into test set type mark and DBN belief network classifier predicted value into Row comparison and assessment.

When classifying to the collage-credit data feature selected, it is preferable that attempted by defining hyperplane by data set It is divided into 2 class of positive sample and negative sample.Assuming that there are the data sample set of 2 class linear separabilities: (x_i, y_i), i=1,2 ..., n, n For sample size, y_i∈ {+1, -1 }, meets the following conditions:

y_i(ω·x_i) -1 >=0,

According to above solution, optimum classifier function is obtained are as follows:

Sgn is to take sign function.

The discrimination threshold of Λ expression broad category device.Identifying CP indicates penalty factor, can obtain the optimum classifier of broad sense. The dual problem of Generalized optimal classifier is identical with linear classification situation, only by a₁It is changed to:

0≤a_i≤ CP, i=1,2 ..., n

For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, Jin Ertong The linear classification for crossing linked character carrys out Solve problems.The corresponding classification function of feature at this time are as follows:

Φ representative function: Φ (x_i, x) and=[(xx_i)+1]ξ_i；

It is completed as a result, according to the value of f (x) to the user credit tagsort based on feature selecting.

Optionally, after extracting reference characteristic information, by deep learning by identified risk subscribers and currently wait divide The Feature Mapping of the behavioral data of user is analysed into the newly-built feature space with risk identification, is added in newly-built feature space The similarity score of both weight averages, and then it is similar to the behavioural characteristic of risk subscribers to obtain calculating currently user to be analyzed Degree.

In the training study stage of stream data, a large amount of reference behaviors library is used in advance, the depth mind that training obtains The characteristic information of reference is extracted through network, training sample set derives from social networks session data collection and data of financial transaction Collection.In the deep neural network structure, an activation primitive layer is closely followed after each convolutional layer.For will linearly input conversion At nonlinear object, the output expression formula of hidden layer node are as follows:

h_i(x)=max_{J ∈ [1, k]}[x^TW_ij+b_ijW_ij]

In formula, W_ijIndicate the i-th column jth row nodal value in eigenmatrix, b_ijIndicate the balance of the i-th column jth row node because Son, each implicit layer unit has corresponded to k sub- hidden layers, by a maximum conduct in this k sub- hidden layer node output valves The output of activation primitive.Using 2 activation primitive nodes, after each convolutional layer, former port number reduces 1/2.

In the layer of pond, maximum characteristic mean in neighborhood is taken to export as the new feature value of neighborhood, above and below retention behavior Literary information, specific method are that behavioural characteristic data are carried out maximum value pond after convolution operation and activation primitive activation respectively Change and average value pond export obtain 2 pond results cascades as new feature.

Feature extraction, deep neural network are carried out to reference feature representation ability network model good enough in order to obtain In the mapping space obtained using the training of limited data sample, spacing is as small as possible in class, and spacing is as big as possible between class.Cause This increases cluster constraint in the calculating of cost function L, assembles homogeneous data mutually and is located remotely from each other without homogeneous data:

In formula, m is number of clusters, x_iFor the sampling feature vectors in ith cluster, W^TFor the transposed matrix of regression matrix, λ is weight attenuation parameter, c_xiIt is characterized vector x_iPondage factor.

In risk prevention, in addition to the behaviors such as the transaction promise breaking for needing to identify user, the user in social networks is further related to The risk of fraud between individual.The fraud type user normal users that typically disguise oneself as gain other people trust by cheating, pass through network False propaganda is carried out, early investment regular hour and actual benefit win the trust of victim, and are obtaining unlawful interests After rapidly disappear hiding, or single, crowd the name such as raises and illegally accumulates wealth by unfair means by means of brush.When identification has the user of fraud, key exists In the extraction and expression of its behavioural characteristic.The present invention indicates user behavior by a reference sequence cluster, reference sequence cluster In include characteristic sequence set, without carrying out any annotation or priori to behavioral structure, it is automatic to be directly realized by characteristic sequence set Classification and study.

Firstly, the network behavior of user is decomposed into essential characteristic sequence, secondly, characteristic sequence is transformed to index sequence Column.Obtain training behavior set { (V_n, y_n) n=1,2 ..., N, wherein V_nFor the behavior set of some user, y_n∈ [1, 2 ..., C] it is operating characteristics type label.N is the quantity of user's operation, and C is number of types.For example, user is cheated and is gone For analysis feature include that the social networks topological parameter of user, good friend maintain duration, add good friend's quantity in preset time, delete Except good friend's quantity, good friend maintain duration and fund transfer accounts the ratio of time, fund is transferred to the deleted number of good friend in same order Amount and fund be transferred to the ratio of total degree with etc..

Then by behavior V_nIt is expressed as characteristic sequence X_n, it is defined as follows:

X_n=[X_1,n..., X_i,n..., X_ln,n]

In formula: X_i,nIt is the feature set calculated the i-th period；l_nIndicate V_nThe quantity of middle period.

It is μ={ p by a characteristic sequence set expression_i| i=1 ..., N_p, N_pIt is characterized the quantity of arrangement set.I-th A characteristic sequence p_iIt is defined as { X_i, τ_i}；

τ in formula_iFor detection threshold value.

To calculate X_i, first to all training characteristics sequence { X₁..., X_NMatrixing is carried out, it is represented with obtaining to have The period of property and the index for clustering all periods, transformation matrix A is indicated are as follows:

In formula:WithRespectively X_i,nAnd X_k,mThe Fisher vector of middle description type t.

Then, to i-th of reference sequence cluster, by the way that detection threshold value τ is arranged_iTraining data sequence is established, is avoided noisy The sequence pattern of sound is mined.

It is X for a characteristic sequence_n, by X_nIndex sequence is converted to, is expressed as

I_n=[I_1,n..., I_i,n..., I_ln,n]

In formula: I_i,nFor ith feature sequence index, using characteristic sequence detection model to X_nProcessing, selection index I_i,n The response of SVM is set to reach maximum.

From the index sequence [I after training₁, I₂..., I_N] in characteristic sequence set R obtained by data mining algorithm, it is special Levying arrangement set R indicates a user's operation Similarity of Local Characteristic Structure, j-th of sequence R_jIt is defined as follows:

R_j={ c_j, s_j, x_j, w_j}

In formula: c_j∈ [1,2 ..., C] is operating characteristics type label；s_jFor sequence pattern；x_jIt is characterized arrangement set spy Sign；w_jIt is to indicate R_jIn operating characteristics type c_jWeight.

To calculate s_j, training data index is collected first.Then from the training index sequence sequence of calculation mode being collected into, Identical sequence pattern can be excavated from two operating characteristics classes and be obtained, so one weight w of setting_j, for mode s_j, w_j Indicate s_jOpposite supporting rate.If same mode occurs in more than two operating characteristics types, two characteristic sequences Gather weight to reduce.If a mode value appears in a type, weight is up to maximum value 1.

One specific action type of each characteristic sequence set expression, characteristic sequence set remain characteristic sequence when Between relationship.Due to the diversity of type, can effective simulation complexity feature.If behavioral test V_T, characteristic sequence set R, The evaluation function of one operating characteristics c may be expressed as:

In formula: α_j,c, β_j,c, γ_j,cFor the parameter of j-th of characteristic sequence set in operating characteristics type c.N_RIt is characterized sequence Column set number.I_TFor sequence index, X_TFor V_TCharacteristic sequence, σ (I_T,s_j) it is sequence reference feature.σ(I_T,s_j) based on The structural similarity between behavioral test and characteristic sequence set is calculated, if initial value F (n, 0)=0, n ∈ [0, L]；F (0, m)=- M, m ∈ [0, m_j], L is in I_TThe quantity of middle period, m_jFor sequence pattern s_jLength.Therefore, matching matrix F is defined as follows:

F (n, m)=max { -1+A (X_n,T, X_m,j), F (n-1, m), F (n, m-1) }

Sequence reference feature is will to indicate that a long sequence i.e. cycle tests of whole operation feature structure is short to homogeneous Sequence describes the part-structure of an operating characteristics.Work as s_jWhen matching with cycle tests, σ (I_T,s_j) there is maximum reference to obtain Point:

σ(I_T,s_j)=max (F (n, m_j)/m_j)

Understanding and cognitive phase in collage-credit data feature realize σ (I using step analysis algorithm to be accurate quick_T,s_j) The identification of sequence reference feature makes distribution within class matrix S_wOrder it is small as far as possible；Distribution between class matrix S simultaneously_bOrder it is big as far as possible, with The classification performance being optimal.Calculate Fisher function J:

In formula:For a n dimensional vector.Made by choosingIt is maximumFor projecting direction, obtained most after projection Big S_bWith minimum S_w；One group of best discriminant vector is selected to establish projection matrix W, is indicated are as follows:

Finally in the study based on step analysis, dimensionality reduction is carried out to projection matrix W using PCA, eliminates redundancy feature letter Breath completes the identification of risk subscribers feature.

And after the reference risk identification of certain customers is completed, credit evaluation for other new users can be with base In depth network analysis, currently the behavior pattern of new user and risk subscribers whether there is similitude, realize risk subscribers and current The feature of user to be analyzed identifies.Specifically, the feature sample of the risk subscribers having confirmed that and current user to be analyzed is recorded first This binary group (x_f, x_c), wherein x_f、x_cRespectively indicate risk subscribers and the reference feature vector of user to be analyzed.Deep learning Target is to find a mapping function f, so that f (x_f)、f(x_c) meet in newly-built feature space under relation of plane: when to be analyzed When new user has similar behavior pattern feature to risk subscribers, f (x_c) and f (x_f) the distance between it is as small as possible；When user with When risk subscribers do not have similar behavior pattern feature, f (x_f) and f (x_c) distance it is as big as possible.

For problem is further simplified, one convolutional network of training before deep learning algorithm, by learning a component Layer nonlinear transformation projects to feature samples binary group in newly-built feature space, and positive sample is to more than default threshold within this space Value, negative sample is to the preset threshold is less than, to make accurate judgment in depth network.

Assuming that depth network is M layers shared, m layers have a neuron of p (m), wherein m=1,2,3 ..., M, to given use M layers of output of family behavioural characteristic vector are as follows:

h^m=tanh (W^(m)h^(m-1)+b^(m))；

In formula,W^(m)For m layers of weight parameter, b^(m)For m layers of amount of bias, x_f、x_cBy Above-mentioned M layers of nonlinear transformation obtains:

F(x_f)=h_f ^(M), F (x_c)=h_c ^(M), risk subscribers are at a distance from current user to be analyzed is in new feature space Are as follows: d² _fc(x_f, x_c)=| | F (x_f)-F(x_c)||²

Then the behavior pattern measuring similarity of user and risk subscribers should then meet:

d² _fc(x_f, x_c) < τ -1, then x_fAnd x_cThere is behavior pattern similarity；

d² _fc(x_f, x_c) > τ+1, then x_fAnd x_cWithout behavior pattern similarity；

τ indicates that the risk distance threshold of setting, negative sample positive in this way are divided well on newly-built feature space in formula It separates.

Then setting optimization object function is

Wherein

β is to adjust operator to obtain weight parameter W, amount of bias b according to above-mentioned formula using stochastic gradient descent algorithm.

New character representation binary group (x' is obtained by deep learning_f, x'_c), it can be respectively obtained using similarity algorithm The behavior pattern similarity Sfc (x' of current user to be analyzed and some risk subscribers_f, x'_c):

I.e. final user's similarity estimated value.Wherein x'_fi, x'_ciRespectively feature vector x'_f, x'_cI-th of component, D is the dimension of feature vector.

If the behavior pattern similarity of current user to be analyzed and some risk subscribers is greater than preset threshold, will be current User to be analyzed is identified as the user that breaks one's promise.

Except it is above-mentioned based on the feature of social networks figure in addition to, the semanteme of dialogue-based content is also needed to user credit assessment Analysis.Such as certain advertisement type users, by the frequency for repeating to send similar content to attract legitimate user to access, and one is used A little tools issue content again, express identical semanteme using different word expression.They are distinguished from normal users Become more difficult.Based on this, office of the embodiment of the present invention to user each in social networks in social networks network topology The centripetal degree feature in portion is calculated, and identifies the risk subscribers of disguise as normal users.

In social networks network, user is indicated with node, side indicates social networks.From node V_iIt is directed toward node V_jSide a =(i, j) indicates user i, and there are at least one sessions with j.Even if the user that breaks one's promise change self attributes, also it is more difficult change they Position in social networks topology.Therefore, based on above-mentioned social networks topology, the following characteristics of user node are calculated.

The centripetal degree in the part of one node refers to remove the node from network after, the degree of the associated energies decline of network.Office The centripetal degree in portion not only allows for local density's information, it is also contemplated that bottleneck information.The associated energies of one topological diagram are defined as:

E_L(G)=∑ θ²

θ refers to the characteristic value of the Kirchhoff's matrix of figure G, is the sum of all vertex out-degree.It is assumed that A (G) is the neighbour of figure G Matrix is connect, D (G) is the diagonal matrix of vertex out-degree.The Kirchhoff's matrix of figure G is L (G)=D (G)-A (G).

For there is n vertex, out-degree is respectively d₁, d₂..., d_nTopological diagram G for, associated energies areReflect the contiguity inside figure.One vertex is removed from figure, the pass of figure Joining energy will be reduced.The part E of reduction_L(G) significance level of this vertex in the graphic is reflected.It is assumed that by vertex v from figure Figure after removing in shape G is H.The centripetal degree in the part of vertex v are as follows:

C_v=E_L(G)-E_L(H)

Due to the unstable social network structure of user of breaking one's promise, and it is very weak with the relationship of neighbor node.By these societies It is removed in the unessential users from networks of breaking one's promise of friendship relationship, the energy of network reduces few.

The user that breaks one's promise is in itself specific commercial interest, and the session content of publication often has very big similitude, wraps Containing information such as a large amount of duplicate session content, harmful links, these information have the similitude of height.Therefore first by streaming number Session content of text in is decomposed into phrase, then calculates these phrase semantic distances using the method that bag of words are analyzed.It uses Closed bag of words characteristic set is used to calculate content similarity.Each characteristic set includes a list, includes similar in list Semantic word.By checking the similarity of these words, the similarity of available entire content, and then count each user The similarity between session content issued every time.

Similarity between the session content that the centripetal degree in part for obtaining each user and each user are issued every time Later, by the way that risk discrimination threshold is arranged, the centripetal degree in part is obtained by filtration lower than presetting centripetal degree threshold value and session content phase It is higher than the user node of default similarity threshold like degree, and is identified as the user that breaks one's promise.

In the promise breaking transaction for bringing risk to system, significant proportion is because of superior node of the user in chain transaction Credit it is lower caused by.The present invention carries out risk diffusion identification further directed to the risk diffusion behavior in chain transaction.According to The credit that the average value of the past period user's All Activity sets the user passively reduces threshold value.There are more transactions when simultaneously When consider influence of the network structure to diffusion.Network G (V, E) is established with real trade data.Node V indicates All Activity user Set.Wherein S (x) is the set of devoid of risk user, and I (x) is the set of risky user.Node E indicates user in network Between the set traded.Side E_ijOn weight be denoted as { a_ij, indicate the number traded between user.The state for remembering user i is n_i, n_i=1 indicates promise breaking, n_i=0 indicates not break a contract；Trade E between note user_ijState be e_ij, e_ij=1 indicate this user it Between transaction it is abnormal, e_ij=0 indicates that transaction is normal.It is d that the credit of user j, which passively reduces number,_j=Σ A_ija_ije_ijIf user Credit passively reduces threshold value distribution and is denoted as { δ_i, and credit passively reduces number { F_i, risk diffusion to credit passively reduce The collection of user is combined into Risk (x).Diffusion process description are as follows:

A) all users are initialized all in normal condition (S), so that a part of user is become risk status (I) at random, i.e., Make a part of n at random_iBecome 1 from 0, a certain transaction E of this certain customers_ijIt breaks a contract, e_ij=1.

B) number broken a contract is added to higher level user, once higher level's user credit, which passively reduces number, is greater than given threshold value, Work as d_j=Σ A_ija_ije_ij> δ_j, which becomes I by S.

C) the number F that each user i is spread and become risk status is recorded_i, the credit after front and back is spread twice When the user set Risk (x) passively reduced is identical, diffusion process terminates.

Promise breaking transaction is extracted from trade network and constitutes sub-network, and credit is passively reduced to number F_iDescending arrangement, choosing Select number F in sub-network_iHighest preceding X user propagates the recognition result of user as high risk, and wherein X is preset quantity Threshold value.

For the violations for further limiting risk subscribers and the user that breaks one's promise, illegal operation is reduced to normal users and social activity The adverse effect that network environment generates, the present invention is after the cognitive phase of stream data is completed, the specific limit further implemented System or control strategy.The limitation or control strategy include, but are not limited to, the following ways:

1: limitation user's extension social scope is attempted to search for other new users or be recommended newly in system for the user in user When good friend, reduce that the user is visible or the quantity of the object of system push.Specific practice is, by the recommendable new use of user's script Family is sorted from low to high by value-at-risk, hides the new user of the highest preset ratio of system level of trust.Thus limitation risk is used Influence of the family to normal users.2: user being marked, if user does not generate violations within a certain period of time, but is more than It breaks a contract again after the period, then the user is identified as the user that breaks one's promise again, be placed into the wind higher than former control hierarchy Dangerous grade.3: if some user is determined as the user that malice is broken a contract, pressure control means, including Frozen Account are taken, The user is prevented to continue to influence social network environment.

In conclusion reducing the invention proposes a kind of risk prevention method based on streaming big data analysis wait divide The quantity for analysing user behavior characteristics, eliminates the redundancy between feature, the disaggregated model of use more highly effective effectively improves The speed of credit evaluation and the accuracy of credit evaluation have better adapted to the streaming computing scene of mass data.

Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, optionally, they can be realized with the program code that computing system can be performed, it is thus possible to they are stored It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.

It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing Change example.

Claims

1. a kind of risk prevention method based on streaming big data analysis characterized by comprising

Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN conviction Network classifier, the multiple hidden layers of DBN belief network classifier, and between hidden layer using different excitation functions come It calculates；

It is trained by DBN belief network, the feature for being then loaded into test set predicts its type, effective according to training As a result test data set type is predicted, obtains the data classification prediction result of test set；

Wherein, it when classifying to the collage-credit data feature selected, attempts for data set to be divided into just by defining hyperplane 2 class of sample and negative sample；Assuming that there are the data sample set of 2 class linear separabilities: (x_i, y_i), i=1,2 ..., n, n are sample Quantity, y_i∈ {+1, -1 }, meets the following conditions:

y_i(ω·x_i) -1 >=0,

ω is characterized weight adjustment parameter, makes | | ω | |²/ 2 the smallest classifiers are optimal classifier, and collage-credit data is optimal The Solve problems of classifier are converted to quadratic programming optimization problem:

Sgn is to take sign function；

The discrimination threshold of Λ expression broad category device；CP indicates penalty factor, the optimum classifier of broad sense is obtained, wherein by above-mentioned a₁It is changed to:

0≤a_i≤ CP, i=1,2 ..., n

For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, and then passes through pass Join the linear classification of feature to solve；The corresponding classification function of feature at this time are as follows:

Φ representative function: Φ (x_i, x) and=[(xx_i)+1]ξ_i。