CN110334780A - Streaming big data security processing - Google Patents
Streaming big data security processing Download PDFInfo
- Publication number
- CN110334780A CN110334780A CN201910641669.1A CN201910641669A CN110334780A CN 110334780 A CN110334780 A CN 110334780A CN 201910641669 A CN201910641669 A CN 201910641669A CN 110334780 A CN110334780 A CN 110334780A
- Authority
- CN
- China
- Prior art keywords
- user
- node
- degree
- similarity
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012545 processing Methods 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000010586 diagram Methods 0.000 claims description 7
- 230000007423 decrease Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 abstract description 10
- 230000006399 behavior Effects 0.000 description 42
- 238000012549 training Methods 0.000 description 17
- 230000006870 function Effects 0.000 description 10
- 238000012360 testing method Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 9
- 230000003542 behavioural effect Effects 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 238000001514 detection method Methods 0.000 description 6
- 238000009792 diffusion process Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000009227 behaviour therapy Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000001149 cognitive effect Effects 0.000 description 2
- 238000011217 control strategy Methods 0.000 description 2
- 239000013256 coordination polymer Substances 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000003012 network analysis Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 108010074864 Factor XI Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012038 vulnerability analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- Primary Health Care (AREA)
- Marketing (AREA)
- Economics (AREA)
- Computing Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of streaming big data security processings, this method comprises: calculating the centripetal degree feature in part of user node based on social networks topology;The similarity between the session content that each user issues every time is calculated, by the way that risk discrimination threshold is arranged, determines that the centripetal degree in part is higher than the user node of preset threshold lower than preset threshold and session content similarity, is identified as the user that breaks one's promise.The invention proposes a kind of streaming big data security processings, reduce the quantity of user behavior characteristics to be analyzed, eliminate the redundancy between feature, the disaggregated model of use more highly effective, the speed of credit evaluation and the accuracy of assessment are effectively improved, the streaming computing scene of mass data has been better adapted to.
Description
Technical field
The present invention relates to network security, in particular to a kind of streaming big data security processing.
Background technique
The development of internet communication and big data technology provides solid data and technology for determining user credit grade
Basis.According to the study found that the internet behavior of user is way of realization of the behavior of people on internet carrier, essence with
Social action is consistent, and the variation of assets, management state can be embodied by its network behavior.And network behavior data institute table
The social networks revealed be more considered with user's reference have strong correlation, therefore user credit be not only present in financial statement,
It mortgages among business information, but also may be embodied in the unstructured datas such as relevant user behavior data, social relationships.This
A little data can constantly generate simultaneously input data analysis and excavate engine.Compared with traditional data, stream data shows reality
The features such as Shi Xing, volatibility, sudden, randomness and unlimitedness.Since Internet service is to the high request of system response time,
These data generally require to analyze and calculate in real time.Therefore under the stream data environment of internet mass, how use is improved
The calculating accuracy rate and real-time of family credit become main problem urgently to be solved in big data analysis field.It is advised in network
Mould is in today that geometry grade increases, and detected data volume is extremely huge, and traditional network analysis monitoring instrument and platform are difficult
It deals with;And a large amount of resource and time can be expended by storing and processing a large amount of social network data.But with user behavior and
The increasingly complex of social networks, existing method cannot achieve the identification of risk subscribers behavioural characteristic and to the users that breaks one's promise
The control management of progress, and cause to calculate time lag.
Summary of the invention
To solve the problems of above-mentioned prior art, the invention proposes a kind of streaming big data safe handling sides
Method, comprising:
Based on social networks topology, the centripetal degree feature in part of user node is calculated;The social networks network topology
Node indicates user, and side indicates the social networks between user;
The centripetal degree in the part of the node indicates after removing the node from network, the journey of the associated energies decline of network
Degree;
Wherein the associated energies of social network diagram G are defined as:
EL(G)=∑ θ2
θ indicates the characteristic value of the Kirchhoff's matrix of figure G;
The Kirchhoff's matrix of the figure G is L (G)=D (G)-A (G);
A (G) is the adjacency matrix of figure G, and D (G) is the diagonal matrix of vertex out-degree.
For there is n node, out-degree is respectively d1, d2..., dnTopological diagram G, associated energies are
In social networks network, user is indicated with node, side indicates social networks.From node ViIt is directed toward node VjSide a
=(i, j) indicates user i, and there are at least one sessions with j.
It is H, the then centripetal degree in the part of vertex v by seal of the vertex v after being removed in figure G are as follows:
Cv=EL(G)-EL(H)
The centripetal degree in part of each user is calculated, and compared with pre-selected centripetal degree threshold value;
Then the session content of text in stream data is decomposed into phrase, then calculates this using the method that bag of words are analyzed
The semantic distance of a little phrases;
It is used to calculate session content similarity using closed bag of words characteristic set;Each characteristic set includes a column
Table includes the word of similar semantic in list;By checking the similarity of these words, the similarity of entire content is obtained, into
And count the similarity between the session content that each user issues every time;
Similarity between the session content that the centripetal degree in part that each user has been calculated and each user are issued every time
Later, by the way that risk discrimination threshold is arranged, the centripetal degree in part is obtained by filtration lower than presetting centripetal degree threshold value and session content phase
It is higher than the user node of default similarity threshold like degree, is identified as the user that breaks one's promise.
The present invention compared with prior art, has the advantage that
The invention proposes a kind of streaming big data security processings, reduce the number of user behavior characteristics to be analyzed
Amount, eliminates the redundancy between feature, the disaggregated model of use more highly effective, effectively improve credit evaluation speed and
The accuracy of credit evaluation has better adapted to the streaming computing scene of mass data.
Detailed description of the invention
Fig. 1 is the flow chart of streaming big data security processing according to an embodiment of the present invention.
Specific embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing of the diagram principle of the invention
It states.The present invention is described in conjunction with such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right
Claim limits, and the present invention covers many substitutions, modification and equivalent.Illustrate in the following description many details with
Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of streaming big data security processing.Fig. 1 is according to embodiments of the present invention
Streaming big data security processing flow chart.
The present invention realizes the detection to social networks behavior by monitoring social network user behavior record, generates behavior
Risk-warning information.User behavior record includes social networks session information.It optionally, further include customer transaction record information.
For session information, Port Mirroring is carried out in the outlet of clustered node, it will words text imports the host for being used for safety detection,
Stream data original message is captured, message is decoded and is pre-processed before it is forwarded to detecting and alarm.Pretreatment includes meeting
Words classification, fragment recombination and session reduction.By matching storage detected rule in the database and predefined after pretreatment
Feature of risk code checks Stream Data messages head and load, is identified and intercepted to risk behavior.
Wherein, the content of text in the social networks conversation recording includes but is not limited to, instant messaging chat record, from
Media releasing information, microblogging or forum's message information, the review record of news website, electric business website evaluation information etc..Above-mentioned society
Network behavior record is handed over to be only for example, specific social networks behavior also can be exemplified difference under actual conditions, not make herein specific
It limits.
In the analysis of social networks behavior record, using rule match and the behavior model based on user carries out risk behavior
Feature extraction.Social networks behavior record is obtained first from server, then according to the decision rule in database to day
Will file carries out pattern match, and the redundant recording that normal behaviour generates is eliminated before carrying out credit evaluation, to identify and mention
Take violations present in record.
Generally realize that information jumps by cyberrelationship between user.These, which jump path, can indicate that some user accesses
The operation of community website.By scanning social networks graph structure, the binary group of current user to be analyzed and association user, table are established
Show between the two jump relationship.Then by Analysis server log, establish path of the user in the actual access page and
Behavior.
According to behavioural information and warning information provided above, by using risk statistics, vulnerability analysis, availability
Analysis is counted.The stream data training set and test set of social network user behavior are read respectively, it will be pre- by standardization
The training set data and test set data of processing carry out dimension-reduction treatment using principal component analysis, remove redundant data, carry out data
Dimensionality reduction forms collage-credit data feature set.
Collage-credit data feature can select one or more from following stream data set: history credit feature, such as with
Family passes through the payment history and refund historical record of financial web site, and the purchase of shopping website is replaced, record of cancelling an order;Society
Relationship characteristic, i.e. user establish the credit data of associated other users in social networks, further include that user is associated with other
Contacts closeness, depth and the range of user, such as relationship are held time, session ratio etc.;Behavior preference feature, according to user
Access type, period, frequency and the social networks evaluation information of webpage or application, counting user Behavior law;Identity attribute is special
Sign predicts personal identification association attributes, including age, occupation, marriage, education degree that is, by user network behavior, and verifies
With the consistency of the essential information of user's input.Features described above information is only for example, the spy that the sample under actual conditions is included
The quantity of reference breath, can be more or less than the quantity of shown information, and specific features information also can be exemplified difference, not make herein
It is specific to limit.
Multidimensional data feature is screened from collage-credit data feature set, and the data characteristics of training set and type are loaded into DBN
Belief network classifier is trained by DBN belief network, and the feature for being then loaded into test set predicts its type,
Obtain the data classification prediction result of test set.Wherein, the multiple hidden layers of DBN belief network classifier, and hidden layer
Between calculated using different excitation functions.Training data after DBN belief network classifier training by obtaining as a result, so
After be loaded into test data, test data set type is predicted according to training effective result.Prediction result is obtained, i.e. completion base
In the detection of the credit grade of machine learning.Finally, be loaded into test set type mark and DBN belief network classifier predicted value into
Row comparison and assessment.
When classifying to the collage-credit data feature selected, it is preferable that attempted by defining hyperplane by data set
It is divided into 2 class of positive sample and negative sample.Assuming that there are the data sample set of 2 class linear separabilities: (xi, yi), i=1,2 ..., n, n
For sample size, yi∈ {+1, -1 }, meets the following conditions:
yi(ω·xi) -1 >=0,
ω is characterized weight adjustment parameter, makes | | ω | |2/ 2 the smallest classifiers are optimal classifier, by collage-credit data
The Solve problems of optimum classifier are converted to quadratic programming optimization problem:
Wherein: a1For Lagrange multiplier and a1>=0, constraint condition are as follows:
According to above solution, optimum classifier function is obtained are as follows:
Sgn is to take sign function.
If optimum classifier cannot separate 2 class points, fault-tolerant factor ξ is introducedi>=0, so that:
The discrimination threshold of Λ expression broad category device.Identifying CP indicates penalty factor, can obtain the optimum classifier of broad sense.
The dual problem of Generalized optimal classifier is identical with linear classification situation, only by a1It is changed to:
0≤ai≤ CP, i=1,2 ..., n
For linear classification problem, related associated data group is subjected to mapping processing, after being mapped to higher dimensional space, Jin Ertong
The linear classification for crossing linked character carrys out Solve problems.The corresponding classification function of feature at this time are as follows:
Φ representative function: Φ (xi, x) and=[(xxi)+1]ξi;
It is completed as a result, according to the value of f (x) to the user credit tagsort based on feature selecting.
Optionally, after extracting reference characteristic information, by deep learning by identified risk subscribers and currently wait divide
The Feature Mapping of the behavioral data of user is analysed into the newly-built feature space with risk identification, is added in newly-built feature space
The similarity score of both weight averages, and then it is similar to the behavioural characteristic of risk subscribers to obtain calculating currently user to be analyzed
Degree.
In the training study stage of stream data, a large amount of reference behaviors library is used in advance, the depth mind that training obtains
The characteristic information of reference is extracted through network, training sample set derives from social networks session data collection and data of financial transaction
Collection.In the deep neural network structure, an activation primitive layer is closely followed after each convolutional layer.For will linearly input conversion
At nonlinear object, the output expression formula of hidden layer node are as follows:
hi(x)=maxJ ∈ [1, k][xTWij+bijWij]
In formula, WijIndicate the i-th column jth row nodal value in eigenmatrix, bijIndicate the balance of the i-th column jth row node because
Son, each implicit layer unit has corresponded to k sub- hidden layers, by a maximum conduct in this k sub- hidden layer node output valves
The output of activation primitive.Using 2 activation primitive nodes, after each convolutional layer, former port number reduces 1/2.
In the layer of pond, maximum characteristic mean in neighborhood is taken to export as the new feature value of neighborhood, above and below retention behavior
Literary information, specific method are that behavioural characteristic data are carried out maximum value pond after convolution operation and activation primitive activation respectively
Change and average value pond export obtain 2 pond results cascades as new feature.
Feature extraction, deep neural network are carried out to reference feature representation ability network model good enough in order to obtain
In the mapping space obtained using the training of limited data sample, spacing is as small as possible in class, and spacing is as big as possible between class.Cause
This increases cluster constraint in the calculating of cost function L, assembles homogeneous data mutually and is located remotely from each other without homogeneous data:
In formula, m is number of clusters, xiFor the sampling feature vectors in ith cluster, WTFor the transposed matrix of regression matrix,
λ is weight attenuation parameter, cxiIt is characterized vector xiPondage factor.
In risk prevention, in addition to the behaviors such as the transaction promise breaking for needing to identify user, the user in social networks is further related to
The risk of fraud between individual.The fraud type user normal users that typically disguise oneself as gain other people trust by cheating, pass through network
False propaganda is carried out, early investment regular hour and actual benefit win the trust of victim, and are obtaining unlawful interests
After rapidly disappear hiding, or single, crowd the name such as raises and illegally accumulates wealth by unfair means by means of brush.When identification has the user of fraud, key exists
In the extraction and expression of its behavioural characteristic.The present invention indicates user behavior by a reference sequence cluster, reference sequence cluster
In include characteristic sequence set, without carrying out any annotation or priori to behavioral structure, it is automatic to be directly realized by characteristic sequence set
Classification and study.
Firstly, the network behavior of user is decomposed into essential characteristic sequence, secondly, characteristic sequence is transformed to index sequence
Column.Obtain training behavior set { (Vn, yn) n=1,2 ..., N, wherein VnFor the behavior set of some user, yn∈ [1,
2 ..., C] it is operating characteristics type label.N is the quantity of user's operation, and C is number of types.For example, user is cheated and is gone
For analysis feature include that the social networks topological parameter of user, good friend maintain duration, add good friend's quantity in preset time, delete
Except good friend's quantity, good friend maintain duration and fund transfer accounts the ratio of time, fund is transferred to the deleted number of good friend in same order
Amount and fund be transferred to the ratio of total degree with etc..
Then by behavior VnIt is expressed as characteristic sequence Xn, it is defined as follows:
Xn=[X1,n..., Xi,n..., Xln,n]
In formula: Xi,nIt is the feature set calculated the i-th period;lnIndicate VnThe quantity of middle period.
It is μ={ p by a characteristic sequence set expressioni| i=1 ..., Np, NpIt is characterized the quantity of arrangement set.I-th
A characteristic sequence piIt is defined as { Xi, τi};
τ in formulaiFor detection threshold value.
To calculate Xi, first to all training characteristics sequence { X1..., XNMatrixing is carried out, it is represented with obtaining to have
The period of property and the index for clustering all periods, transformation matrix A is indicated are as follows:
In formula:WithRespectively Xi,nAnd Xk,mThe Fisher vector of middle description type t.
Then, to i-th of reference sequence cluster, by the way that detection threshold value τ is arrangediTraining data sequence is established, is avoided noisy
The sequence pattern of sound is mined.
It is X for a characteristic sequencen, by XnIndex sequence is converted to, is expressed as
In=[I1,n..., Ii,n..., Iln,n]
In formula: Ii,nFor ith feature sequence index, using characteristic sequence detection model to XnProcessing, selection index Ii,n
The response of SVM is set to reach maximum.
From the index sequence [I after training1, I2..., IN] in characteristic sequence set R obtained by data mining algorithm, it is special
Levying arrangement set R indicates a user's operation Similarity of Local Characteristic Structure, j-th of sequence RjIt is defined as follows:
Rj={ cj, sj, xj, wj}
In formula: cj∈ [1,2 ..., C] is operating characteristics type label;sjFor sequence pattern;xjIt is characterized arrangement set spy
Sign;wjIt is to indicate RjIn operating characteristics type cjWeight.
To calculate sj, training data index is collected first.Then from the training index sequence sequence of calculation mode being collected into,
Identical sequence pattern can be excavated from two operating characteristics classes and be obtained, so one weight w of settingj, for mode sj, wj
Indicate sjOpposite supporting rate.If same mode occurs in more than two operating characteristics types, two characteristic sequences
Gather weight to reduce.If a mode value appears in a type, weight is up to maximum value 1.
One specific action type of each characteristic sequence set expression, characteristic sequence set remain characteristic sequence when
Between relationship.Due to the diversity of type, can effective simulation complexity feature.If behavioral test VT, characteristic sequence set R,
The evaluation function of one operating characteristics c may be expressed as:
In formula: αj,c, βj,c, γj,cFor the parameter of j-th of characteristic sequence set in operating characteristics type c.NRIt is characterized sequence
Column set number.ITFor sequence index, XTFor VTCharacteristic sequence, σ (IT,sj) it is sequence reference feature.σ(IT,sj) based on
The structural similarity between behavioral test and characteristic sequence set is calculated, if initial value F (n, 0)=0, n ∈ [0, L];F (0, m)=-
M, m ∈ [0, mj], L is in ITThe quantity of middle period, mjFor sequence pattern sjLength.Therefore, matching matrix F is defined as follows:
F (n, m)=max { -1+A (Xn,T, Xm,j), F (n-1, m), F (n, m-1) }
Sequence reference feature is will to indicate that a long sequence i.e. cycle tests of whole operation feature structure is short to homogeneous
Sequence describes the part-structure of an operating characteristics.Work as sjWhen matching with cycle tests, σ (IT,sj) there is maximum reference to obtain
Point:
σ(IT,sj)=max (F (n, mj)/mj)
Understanding and cognitive phase in collage-credit data feature realize σ (I using step analysis algorithm to be accurate quickT,sj)
The identification of sequence reference feature makes distribution within class matrix SwOrder it is small as far as possible;Distribution between class matrix S simultaneouslybOrder it is big as far as possible, with
The classification performance being optimal.Calculate Fisher function J:
In formula:For a n dimensional vector.Made by choosingIt is maximumFor projecting direction, obtained most after projection
Big SbWith minimum Sw;One group of best discriminant vector is selected to establish projection matrix W, is indicated are as follows:
Finally in the study based on step analysis, dimensionality reduction is carried out to projection matrix W using PCA, eliminates redundancy feature letter
Breath completes the identification of risk subscribers feature.
And after the reference risk identification of certain customers is completed, credit evaluation for other new users can be with base
In depth network analysis, currently the behavior pattern of new user and risk subscribers whether there is similitude, realize risk subscribers and current
The feature of user to be analyzed identifies.Specifically, the feature sample of the risk subscribers having confirmed that and current user to be analyzed is recorded first
This binary group (xf, xc), wherein xf、xcRespectively indicate risk subscribers and the reference feature vector of user to be analyzed.Deep learning
Target is to find a mapping function f, so that f (xf)、f(xc) meet in newly-built feature space under relation of plane: when to be analyzed
When new user has similar behavior pattern feature to risk subscribers, f (xc) and f (xf) the distance between it is as small as possible;When user with
When risk subscribers do not have similar behavior pattern feature, f (xf) and f (xc) distance it is as big as possible.
For problem is further simplified, one convolutional network of training before deep learning algorithm, by learning a component
Layer nonlinear transformation projects to feature samples binary group in newly-built feature space, and positive sample is to more than default threshold within this space
Value, negative sample is to the preset threshold is less than, to make accurate judgment in depth network.
Assuming that depth network is M layers shared, m layers have a neuron of p (m), wherein m=1,2,3 ..., M, to given use
M layers of output of family behavioural characteristic vector are as follows:
hm=tanh (W(m)h(m-1)+b(m));
In formula,W(m)For m layers of weight parameter, b(m)For m layers of amount of bias, xf, xc pass through
Above-mentioned M layers of nonlinear transformation obtains:
F(xf)=hf (M), F (xc)=hc (M), risk subscribers are at a distance from current user to be analyzed is in new feature space
Are as follows: d2 fc(xf, xc)=| | F (xf)-F(xc)||2
Then the behavior pattern measuring similarity of user and risk subscribers should then meet:
d2 fc(xf, xc) < τ -1, then xfAnd xcThere is behavior pattern similarity;
d2 fc(xf, xc) > τ+1, then xfAnd xcWithout behavior pattern similarity;
τ indicates that the risk distance threshold of setting, negative sample positive in this way are divided well on newly-built feature space in formula
It separates.
Then setting optimization object function is
Wherein
β is to adjust operator to obtain weight parameter W, amount of bias b according to above-mentioned formula using stochastic gradient descent algorithm.
New character representation binary group (x' is obtained by deep learningf, x'c), it can be respectively obtained using similarity algorithm
The behavior pattern similarity Sfc (x' of current user to be analyzed and some risk subscribersf, x'c):
I.e. final user's similarity estimated value.Wherein x'fi, x'ciRespectively feature vector x'f, x'cI-th of component,
D is the dimension of feature vector.
If the behavior pattern similarity of current user to be analyzed and some risk subscribers is greater than preset threshold, will be current
User to be analyzed is identified as the user that breaks one's promise.
Except it is above-mentioned based on the feature of social networks figure in addition to, the semanteme of dialogue-based content is also needed to user credit assessment
Analysis.Such as certain advertisement type users, by the frequency for repeating to send similar content to attract legitimate user to access, and one is used
A little tools issue content again, express identical semanteme using different word expression.They are distinguished from normal users
Become more difficult.Based on this, office of the embodiment of the present invention to user each in social networks in social networks network topology
The centripetal degree feature in portion is calculated, and identifies the risk subscribers of disguise as normal users.
In social networks network, user is indicated with node, side indicates social networks.From node ViIt is directed toward node VjSide a
=(i, j) indicates user i, and there are at least one sessions with j.Even if the user that breaks one's promise change self attributes, also it is more difficult change they
Position in social networks topology.Therefore, based on above-mentioned social networks topology, the following characteristics of user node are calculated.
The centripetal degree in the part of one node refers to remove the node from network after, the degree of the associated energies decline of network.Office
The centripetal degree in portion not only allows for local density's information, it is also contemplated that bottleneck information.The associated energies of one topological diagram are defined as:
EL(G)=∑ θ2
θ refers to the characteristic value of the Kirchhoff's matrix of figure G, is the sum of all vertex out-degree.It is assumed that A (G) is the neighbour of figure G
Matrix is connect, D (G) is the diagonal matrix of vertex out-degree.The Kirchhoff's matrix of figure G is L (G)=D (G)-A (G).
For there is n vertex, out-degree is respectively d1, d2..., dnTopological diagram G for, associated energies areReflect the contiguity inside figure.One vertex is removed from figure, the pass of figure
Joining energy will be reduced.The part E of reductionL(G) significance level of this vertex in the graphic is reflected.It is assumed that by vertex v from figure
Figure after removing in shape G is H.The centripetal degree in the part of vertex v are as follows:
Cv=EL(G)-EL(H)
Due to the unstable social network structure of user of breaking one's promise, and it is very weak with the relationship of neighbor node.By these societies
It is removed in the unessential users from networks of breaking one's promise of friendship relationship, the energy of network reduces few.
The user that breaks one's promise is in itself specific commercial interest, and the session content of publication often has very big similitude, wraps
Containing information such as a large amount of duplicate session content, harmful links, these information have the similitude of height.Therefore first by streaming number
Session content of text in is decomposed into phrase, then calculates these phrase semantic distances using the method that bag of words are analyzed.It uses
Closed bag of words characteristic set is used to calculate content similarity.Each characteristic set includes a list, includes similar in list
Semantic word.By checking the similarity of these words, the similarity of available entire content, and then count each user
The similarity between session content issued every time.
Similarity between the session content that the centripetal degree in part for obtaining each user and each user are issued every time
Later, by the way that risk discrimination threshold is arranged, the centripetal degree in part is obtained by filtration lower than presetting centripetal degree threshold value and session content phase
It is higher than the user node of default similarity threshold like degree, and is identified as the user that breaks one's promise.
In the promise breaking transaction for bringing risk to system, significant proportion is because of superior node of the user in chain transaction
Credit it is lower caused by.The present invention carries out risk diffusion identification further directed to the risk diffusion behavior in chain transaction.According to
The credit that the average value of the past period user's All Activity sets the user passively reduces threshold value.There are more transactions when simultaneously
When consider influence of the network structure to diffusion.Network G (V, E) is established with real trade data.Node V indicates All Activity user
Set.Wherein S (x) is the set of devoid of risk user, and I (x) is the set of risky user.Node E indicates user in network
Between the set traded.Side EijOn weight be denoted as { aij, indicate the number traded between user.The state for remembering user i is ni,
ni=1 indicates promise breaking, ni=0 indicates not break a contract;Trade E between note userijState be eij, eij=1 indicate this user it
Between transaction it is abnormal, eij=0 indicates that transaction is normal.It is d that the credit of user j, which passively reduces number,j=Σ AijaijeijIf user
Credit passively reduces threshold value distribution and is denoted as { δi, and credit passively reduces number { Fi, risk diffusion to credit passively reduce
The collection of user is combined into Risk (x).Diffusion process description are as follows:
A) all users are initialized all in normal condition (S), so that a part of user is become risk status (I) at random, i.e.,
Make a part of n at randomiBecome 1 from 0, a certain transaction E of this certain customersijIt breaks a contract, eij=1.
B) number broken a contract is added to higher level user, once higher level's user credit, which passively reduces number, is greater than given threshold value,
Work as dj=Σ Aijaijeij> δj, which becomes I by S.
C) the number F that each user i is spread and become risk status is recordedi, the credit after front and back is spread twice
When the user set Risk (x) passively reduced is identical, diffusion process terminates.
Promise breaking transaction is extracted from trade network and constitutes sub-network, and credit is passively reduced to number FiDescending arrangement, choosing
Select number F in sub-networkiHighest preceding X user propagates the recognition result of user as high risk, and wherein X is preset quantity
Threshold value.
For the violations for further limiting risk subscribers and the user that breaks one's promise, illegal operation is reduced to normal users and social activity
The adverse effect that network environment generates, the present invention is after the cognitive phase of stream data is completed, the specific limit further implemented
System or control strategy.The limitation or control strategy include, but are not limited to, the following ways:
1: limitation user's extension social scope is attempted to search for other new users or be recommended newly in system for the user in user
When good friend, reduce that the user is visible or the quantity of the object of system push.Specific practice is, by the recommendable new use of user's script
Family is sorted from low to high by value-at-risk, hides the new user of the highest preset ratio of system level of trust.Thus limitation risk is used
Influence of the family to normal users.2: user being marked, if user does not generate violations within a certain period of time, but is more than
It breaks a contract again after the period, then the user is identified as the user that breaks one's promise again, be placed into the wind higher than former control hierarchy
Dangerous grade.3: if some user is determined as the user that malice is broken a contract, pressure control means, including Frozen Account are taken,
The user is prevented to continue to influence social network environment.
In conclusion reducing user behavior to be analyzed the invention proposes a kind of streaming big data security processing
The quantity of feature, eliminates the redundancy between feature, and the disaggregated model of use more highly effective effectively improves credit evaluation
Speed and the accuracy of credit evaluation have better adapted to the streaming computing scene of mass data.
Obviously, it should be appreciated by those skilled in the art, each module of the above invention or each steps can be with general
Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed
Network on, optionally, they can be realized with the program code that computing system can be performed, it is thus possible to they are stored
It is executed within the storage system by computing system.In this way, the present invention is not limited to any specific hardware and softwares to combine.
It should be understood that above-mentioned specific embodiment of the invention is used only for exemplary illustration or explains of the invention
Principle, but not to limit the present invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention
Covering the whole variations fallen into attached claim scope and boundary or this range and the equivalent form on boundary and is repairing
Change example.
Claims (1)
1. a kind of streaming big data security processing characterized by comprising
Based on social networks topology, the centripetal degree feature in part of user node is calculated;The node of the social networks network topology
Indicate user, side indicates the social networks between user;
The centripetal degree in the part of the node indicates after removing the node from network, the degree of the associated energies decline of network;
Wherein the associated energies of social network diagram G are defined as:
EL(G)=∑ θ2
θ indicates the characteristic value of the Kirchhoff's matrix of figure G;
The Kirchhoff's matrix of the figure G is L (G)=D (G)-A (G);
A (G) is the adjacency matrix of figure G, and D (G) is the diagonal matrix of vertex out-degree.
For there is n node, out-degree is respectively d1, d2..., dnTopological diagram G, associated energies are
In social networks network, user is indicated with node, side indicates social networks.From node ViIt is directed toward node VjSide a=
(i, j) indicates user i, and there are at least one sessions with j.
It is H, the then centripetal degree in the part of vertex v by seal of the vertex v after being removed in figure G are as follows:
Cv=EL(G)-EL(H)
The centripetal degree in part of each user is calculated, and compared with pre-selected centripetal degree threshold value;
Then the session content of text in stream data is decomposed into phrase, then calculates these words using the method that bag of words are analyzed
The semantic distance of group;
It is used to calculate session content similarity using closed bag of words characteristic set;Each characteristic set includes a list, column
It include the word of similar semantic in table;By checking the similarity of these words, the similarity of entire content is obtained, and then count
The similarity between session content that each user issues every time;
After similarity between the session content that the centripetal degree in part that each user has been calculated and each user are issued every time,
By the way that risk discrimination threshold is arranged, the centripetal degree in part is obtained by filtration lower than presetting centripetal degree threshold value and session content similarity height
In the user node of default similarity threshold, it is identified as the user that breaks one's promise.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910641669.1A CN110334780A (en) | 2019-07-16 | 2019-07-16 | Streaming big data security processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910641669.1A CN110334780A (en) | 2019-07-16 | 2019-07-16 | Streaming big data security processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110334780A true CN110334780A (en) | 2019-10-15 |
Family
ID=68145255
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910641669.1A Withdrawn CN110334780A (en) | 2019-07-16 | 2019-07-16 | Streaming big data security processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110334780A (en) |
-
2019
- 2019-07-16 CN CN201910641669.1A patent/CN110334780A/en not_active Withdrawn
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110348528A (en) | Method is determined based on the user credit of multidimensional data mining | |
US11080709B2 (en) | Method of reducing financial losses in multiple payment channels upon a recognition of fraud first appearing in any one payment channel | |
Olszewski | A probabilistic approach to fraud detection in telecommunications | |
US7721336B1 (en) | Systems and methods for dynamic detection and prevention of electronic fraud | |
Lopez-Rojas et al. | Money laundering detection using synthetic data | |
Lekha et al. | Data mining techniques in detecting and predicting cyber crimes in banking sector | |
Singh et al. | Fraud detection by monitoring customer behavior and activities | |
CN113011889B (en) | Account anomaly identification method, system, device, equipment and medium | |
CN109829721B (en) | Online transaction multi-subject behavior modeling method based on heterogeneous network characterization learning | |
CN110347669A (en) | Risk prevention method based on streaming big data analysis | |
CN113139876B (en) | Risk model training method, risk model training device, computer equipment and readable storage medium | |
Mezei et al. | Credit risk evaluation in peer-to-peer lending with linguistic data transformation and supervised learning | |
Badawi et al. | Detection of money laundering in bitcoin transactions | |
Barman et al. | A complete literature review on financial fraud detection applying data mining techniques | |
Lata et al. | A comprehensive survey of fraud detection techniques | |
CN115438102A (en) | Space-time data anomaly identification method and device and electronic equipment | |
Torres et al. | A proposal for online analysis and identification of fraudulent financial transactions | |
Reddy et al. | CNN-Bidirectional LSTM based Approach for Financial Fraud Detection and Prevention System | |
Abdulghani et al. | Credit card fraud detection using XGBoost algorithm | |
Ni et al. | A Victim‐Based Framework for Telecom Fraud Analysis: A Bayesian Network Model | |
Xiao et al. | Explainable fraud detection for few labeled time series data | |
CN110334780A (en) | Streaming big data security processing | |
Hanae et al. | End-to-End Real-time Architecture for Fraud Detection in Online Digital Transactions | |
CN116451050A (en) | Abnormal behavior recognition model training and abnormal behavior recognition method and device | |
Xu et al. | Multi-view Heterogeneous Temporal Graph Neural Network for “Click Farming” Detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20191015 |
|
WW01 | Invention patent application withdrawn after publication |