CN108509793A

CN108509793A - A kind of user's anomaly detection method and device based on User action log data

Info

Publication number: CN108509793A
Application number: CN201810306815.0A
Authority: CN
Inventors: 曾毅; 彭洪涛; 喻波; 王志海; 董爱华; 安鹏
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2018-04-08
Filing date: 2018-04-08
Publication date: 2018-09-07

Abstract

The invention discloses a kind of user's anomaly detection methods and device based on User action log data, and this approach includes the following steps：User journal data are acquired, and are normalized；Score value assessment is carried out to new collected User action log by LDA analysis models；When point value of evaluation is less than predetermined score value, determine that freshly harvested User action log is suspicious user user behaviors log；It determines the corresponding user terminal of suspicious user user behaviors log and application software, and generates warning information.Technical solution through the invention can quickly find the abnormal behaviour of user, alarm in time to administrator or user, improve processing and threaten discovery treatment effeciency；It is completed with intimate speed in real time, enhances system audit function and the timeliness of alarm function.

Description

A kind of user's anomaly detection method and device based on User action log data

Technical field

The present invention relates to data security arts, and in particular to a kind of user's abnormal behaviour based on User action log data Detection method.

Background technology

LDA (Latent Dirichlet Allocation) is that a kind of document subject matter generates model, also referred to as one three layers Bayesian probability model, including word, theme and document three-decker.So-called generation model, that is, it is believed that an article Each word be by " with some theme of certain probability selection, and with some word of certain probability selection from this theme Such a process of language " obtains.Document obeys multinomial distribution to theme, and theme to word obeys multinomial distribution.

LDA is a kind of non-supervisory machine learning techniques, can be used for identifying extensive document sets (document Collection the subject information) or in corpus (corpus) hidden.The method that it uses bag of words (bag of words), Each document is considered as a word frequency vector by this method, is believed for ease of the number of modeling to convert text message Breath.

User behavior analysis refers to being counted, being analyzed to related data, therefrom in the case where obtaining master data It was found that the rule of user behavior.

Such as Fig. 1, A-NIDS frames in the prior art include mainly three phases：

1. the stage of parametrization：System will be collected into information and format or pre-process in a predetermined manner.

2. the training stage：Classified according to the performance of normally performed activity feature, then establishes corresponding model.

3. detection-phase：System model training is completed and be can be used, and is compared with obtained data on flows, if it find that partially When difference is more than given threshold values, system will give a warning, and generate examining report.

For the prior art, need to solve following technical problem：

1. acquisition and the normalized of user behavior data.

2. the foundation of the machine learning LDA models based on spark.

3. the alarm of abnormal behaviour result is shown.

Invention content

In order to solve the above technical problems, the present invention provides a kind of user's abnormal behaviours based on User action log data Detection method, which is characterized in that this approach includes the following steps：

1) user journal data are acquired, and are normalized；

2) score value assessment is carried out to new collected User action log by LDA analysis models；

3) when point value of evaluation is less than predetermined score value, determine that freshly harvested User action log is suspicious user behavior day Will；

4) the corresponding user terminal of suspicious user user behaviors log and application software are determined, and generates warning information.

With the method for the invention it is preferred to, in the LDA analysis models, User action log data include following word Language：User ID, user terminal ID, application software coding, operating time, action type are analyzed based on these words according to LDA is established Document, theme needed for mode input calculate the probability of every User action log appearance, and will then according to LDA algorithm Score value of the probability as this User action log.

With the method for the invention it is preferred to, each word occurs in collection of document in the User action log Probability stamps are：It is newly acquired according to the determine the probability The score value of the User action log arrived.

With the method for the invention it is preferred to, before the step 1), LDA is trained by User action log data Analysis model；

Using user journal data as the document of training LDA analysis models, the word formed after user's operation data processing As the word of trained LDA analysis models, theme of the theme as trained LDA analysis models in terms of user's operation type.

With the method for the invention it is preferred to, User action log data are divided into two words；

One of word includes：User ID, user terminal ID, application software type and operating time；

Another word includes：Action type, operation duration, request field reference numeral, response field number.

In order to solve the above technical problems, the present invention provides a kind of user's abnormal behaviours based on User action log data Detection device, which is characterized in that the device includes：

Digital sampling and processing acquires user journal data, and is normalized；

Score value evaluation module carries out score value assessment by LDA analysis models to new collected User action log；

Score value judgment module determines that freshly harvested User action log is suspicious when point value of evaluation is less than predetermined score value User action log；

Alarm module determines the corresponding user terminal of suspicious user user behaviors log and application software, and generates alarm letter Breath.

The apparatus according to the invention, it is preferred that in the LDA analysis models, User action log data include following word Language：User ID, user terminal ID, application software coding, operating time, action type are analyzed based on these words according to LDA is established Document, theme needed for mode input calculate the probability of every User action log appearance, and will then according to LDA algorithm Score value of the probability as this User action log.

The apparatus according to the invention, it is preferred that each word occurs in collection of document in the User action log Probability stamps areIt is newly collected according to the determine the probability User action log score value.

The apparatus according to the invention, it is preferred that the device further includes model training module, passes through User action log data Training LDA analysis models；

The apparatus according to the invention, it is preferred that User action log data are divided into two words；

In order to solve the above technical problems, the present invention provides a kind of computer readable storage medium, which has meter Calculation machine program instruction is realized when executing the computer program instructions such as one of above-mentioned method.

Technical solution using the present invention achieves following technique effect：

1. Function Extension：Machine learning abnormal behaviour analysis method based on User action log can quickly find user Abnormal behaviour, alarm in time to administrator or user, improve processing and threaten and find treatment effeciency.

2. real-time：Machine learning LDA analysis models based on spark make the analysis of data with intimate real-time speed It completes, enhances system audit function and the timeliness of alarm function.

Description of the drawings

Fig. 1 is prior art data analysis flowcharts；

Fig. 2 is the user behavior anomaly flow chart of the present invention.

Specific implementation mode

LDA is a kind of non-supervisory machine learning techniques, can be used for identifying extensive document sets (document Collection the subject information) or in corpus (corpus) hidden.The method that it uses bag of words (bag of words), Each document is considered as a word frequency vector by this method, is believed for ease of the number of modeling to convert text message Breath.But bag of words method does not account for the sequence between word and word, this simplifies the complex natures of the problem, while being also changing for model Into providing opportunity.The probability distribution that some themes of each documents representative are constituted, and each theme represents Probability distribution that many words are constituted.

LDA generating process

For every document in corpus, LDA defines following generating process (generativeprocess)：

1. pair each document extracts a theme from theme distribution；

2. extracting a word from the word distribution corresponding to the above-mentioned theme being pumped to；

3. repeating the above process each word until in traversal document.

One multinomial point of each document in corpus and T (given in advance by the methods of repetition test) a theme Cloth (multinomialdistribution) is corresponding, which is denoted as θ.Each theme and and vocabulary (vocabulary) multinomial distribution of V word in is corresponding, this multinomial distribution is denoted as φ.

LDA overall flows

First define the meaning of some letters：Collection of document D, theme (topic) set T

Each document d regards a word sequence as in D<w1,w2,...,wn>, wi indicates i-th of word, if d has n list Word.(being referred to as wordbag inside LDA, the appearance position of actually each word is on LDA algorithm without influencing)

All various words involved in D form a big collection VOCABULARY (abbreviation VOC), and LDA is with collection of document D As input, it is desirable to two result vectors (set and be polymerized to k topic, include m word altogether in VOC) trained：

To the document d in each D, the probability θ d of different Topic are corresponded to<pt1,...,ptk>, wherein pti indicates d pairs Answer the probability of i-th of topic in T.Computational methods are intuitive, pti=nti/n, and wherein nti indicates i-th corresponding in d The number of the word of topic, n are the sums of all words in d.

To the topict in each T, the probability φ t of various words are generated<pw1,...,pwm>, wherein pwi indicates t lifes At the probability of i-th of word in VOC.Computational methods are equally very intuitive, and pwi=Nwi/N, wherein Nwi expression correspond to topict VOC in i-th of word number, N indicates all total words for corresponding to topict.

The core formula of LDA is as follows：

P (w | d)=p (w | t) * p (t | d)

It intuitively sees this formula, is exactly that can give text by current θ d and φ t using Topic as middle layer There is the probability of word w in shelves d.Wherein p (t | d) it is calculated using θ d, p (w | t) it is calculated using φ t.

In fact, using current θ d and φ t, we can be that a word in a document calculates its correspondence arbitrarily P (w | d) when one Topic, then according to these results come update this word should corresponding topic.Then, if this Update changes the Topic corresponding to this word, will influence θ d and φ t in turn.[2]

LDA learning processes

When LDA algorithm starts, first randomly give θ d and φ t assignment (to all d and t).Then the above process is constantly heavy Multiple, the result finally converged to is exactly the output of LDA.The specifically once learning process of this iteration again：

It, can be with if enabling the corresponding topic of the word for tj 1. for the i-th word wi in specific document ds Above-mentioned formula is rewritten as：

Pj (wi | ds)=p (wi | tj) * p (tj | ds)

2. we can enumerate the topic in T now, all pj (wi | ds), wherein 1~k of j values are obtained.Then may be used To be that i-th of word wi in ds selects a topic according to these probability value results.Simplest idea be take enable pj (wi | Ds) maximum tj (note that it is variable there was only j in this formula), i.e. argmax [j] pj (wi | ds)

3., will be to θ then, if i-th of word wi in ds has selected one and original different topic herein D and φ t have an impact and (one can readily appreciate that according to the calculation formula of the two aforementioned vectors).Their influence It can influence the calculating to p above-mentioned (w | d) in turn again.P's (w | d) is carried out to all w in d all in D It calculates and reselects topic and regard an iteration as.After carrying out n times loop iteration in this way, it is required LDA will to be converged to As a result.

Below in conjunction with the accompanying drawings and specific embodiment the present invention is further illustrated, but protection scope of the present invention is simultaneously It is without being limited thereto.

<User's anomaly detection method>

Referring to such as Fig. 2, steps are as follows for behavioral value：

(1) business diary of each system is acquired by log acquisition module.Collected data are passed through into association Analysis forms the User action log that can describe user behavior.The User action log of generation is normalized.Return Daily record after at one change can describe what which when a user on which station terminal has carried out using (app) by Operation.Treated, and message field includes：

Receive the time of data

Operation duration

Terminal ID number

ID users

Application software encodes

Action type (increases, deletes, looking into, changing)

The length (byte number) of required parameter

The length (byte number) of response results

(2) judge that LDA analysis models whether there is, if it does not exist, then establishing the machine learning LDA (texts based on spark Shelves topic model) analysis model, using the user behavior data of input as input document, by large volume document data to model into Row training, obtains convergent result.Trained model is preserved.If it does, to new collected user behavior number According to analysis marking is carried out, a threshold values is set, when score value is less than threshold values, it is believed that the behavior is suspicious actions.

(3) warning message is generated according to the suspicious actions data of generation, is shown in the warning message Show board of front end page Show, alarm is made to administrator.

Clustering algorithm (LDA)：

LDA, which can be used for identifying in extensive document sets (document collection) or corpus (corpus), to dive The subject information of Tibetan.Each document is considered as a word by the method that it uses bag of words (bag of words), this method Frequency vector, to which text message to be converted to the digital information for ease of modeling.But bag of words method do not account for word and word it Between sequence, this simplifies the complex natures of the problem, while also providing opportunity for the improvement of model.Each documents representative one The probability distribution that a little themes are constituted, and each theme represents the probability distribution that many words are constituted.

LDA is considered a following cluster process：

(1) each theme (Topics) corresponds to all kinds of " barycenter ", and each document is considered as one in data set Sample.

(2) theme and document are considered to there are in a vector space, each feature vector in this vector space All it is word frequency (bag of words)

(3) from using unlike being weighed using range formula in traditional clustering method, LDA is using one based on counting The equation of model, and this statistical model discloses how these documents all generate.

It is based on a common-sense and assumes：All texts in collection of document share a certain number of implicit themes.Base In this it is assumed that entire document sets are characterized as the set of implicit theme by it, and every text is represented as these implicit themes Special ratios mixing.The core formula of LDA is as follows：

D represents certain document, and w represents some word, z_kThe i-th theme is represented, K theme is shared.Popular understanding is：Text Shelves d belongs to theme z with certain probability_k, i.e. p (z_k| d), and theme z_kThe lower prior probability for word w occur be p (w | z_k), therefore In theme z_kUnder, probability that word w occurs in document be p (w | z_k)*p(z_k| d), nature ti under all themes document:K occurs The probability of word w adds up, and is exactly the Probability p (w | d) (word frequency) for occurring word w in document d.

LDA is a level Bayesian model, and the parameter of model is also regarded as stochastic variable, so as to introduce control ginseng Several parameters is realized thoroughly " randomization ".The prior distribution of the Dirichlet of LDA models, the multinomial of theme on document d Distribution.Currently, parameter Estimation is the most important tasks of LDA, there are mainly two types of methods：Gibbs samplings (it is computationally intensive, but phase To simple and accurate) and variational Bayesian method (calculation amount is small, and precision degree is weak).

User behavior analysis LDA models are established to include the following steps：

(1) LDA (LATENT DIRICHLET ALLOCATION) document subject matter model brief introduction

LDA is a kind of non-supervisory machine learning techniques, can be used for identifying extensive document sets (document Collection the subject information) or in corpus (corpus) hidden.The method that it uses bag of words (bag of words), Each document is considered as a word frequency vector by this method, is believed for ease of the number of modeling to convert text message Breath.The probability distribution that some themes of each documents representative are constituted, and each theme represents many words The probability distribution constituted.If we will generate a document, the probability that each word inside it occurs is：

(2) document subject matter corresponding with LDA models is defined as follows in the present invention：

Model User action logs

Document UserAction data

The word formed after word UserAction data processings

Theme in terms of topic user behaviors

The essence of LDA model trainings is to obtain the probability-distribution function of a word in a document, then general according to this Rate distribution function generates a word every time.Therefore, significant in order to make the LDA model trainings based on User action log obtain As a result, word segmentation processing must be carried out to the User action log that is collected into because the later data of normalized include very More fields, these data there is no repeatability, the model of convergence meaning can not be directly trained by these data.

(3) word segmentation processing of User action log

It is divided to every User action log to two independent words (word), respectively userid_hardwareid_ appcode_trhour、actionType_duration_resLen_reqLen.The specific establishment rule of word is as follows：

Time of day (time)

Use the trhour fields in data.The corresponding hour numerical value for generating time when operating.

Request Bytes (size of required parameter)

Use the number of the respective bins corresponding to the resLen field respective values in data.As follows [0,512, 1024,2048,4096 ...], unit is byte number, if resLen is equal to 256 bytes, corresponding value is 1；If ResLen is equal to 760 bytes, then corresponding value is 2.

Response Bytes (sizes of response results)

Use the number of the respective bins corresponding to the reqLen field respective values in data.As follows [0,512, 1024,2048,4096 ...], unit is byte number, if resLen is equal to 256 bytes, corresponding value is 1；If ResLen is equal to 760 bytes, then corresponding value is 2.

ActionType (action type)

0 corresponds to increase；1 corresponds to deletion；2 correspond to inquiry；3 correspond to modification.

Duration (operation duration)

The number in section where time of the whole operation from request to response corresponds to, as follows [0,10,20,30,40, 50,60,70 ...], unit is the second, if duration is equal to 10 seconds, corresponding value is 2.

Word generates example

A. a User action log is userid:1200211123456789,hardwareid:000426, duration:20,trhour:10,appcode:100026,resLen:100,reqLen:200, actionType；1.It generates Word be：

I, first words are：“1200211123456789_000426_100026_10”.

II words of second are:“1_1_2_2_4”.

The threat detection system based on User action log of the present invention is applied in the big data analysis system of certain enterprise, The system effectively alarms to abnormal behaviour.

Scheme through the invention, the machine learning abnormal behaviour analysis method based on User action log can quickly be sent out The abnormal behaviour at current family is alarmed to administrator or user in time, is improved processing and is threatened discovery treatment effeciency.Based on spark Machine learning LDA analysis models make the analysis of data with it is intimate in real time speed complete, enhance system audit function and The timeliness of alarm function.

Example of the above example only as protection scheme of the present invention does not limit the specific implementation mode of the present invention It is fixed.

Claims

1. a kind of user's anomaly detection method based on User action log data, which is characterized in that this method include with Lower step：

1) user journal data are acquired, and are normalized；

3) when point value of evaluation is less than predetermined score value, determine that freshly harvested User action log is suspicious user user behaviors log；

2. according to the method described in claim 1, in the LDA analysis models, User action log data include following word： User ID, user terminal ID, application software coding, operating time, action type analyze mould based on these words according to LDA is established Document, theme needed for type input calculate the probability of every User action log appearance, and should then according to LDA algorithm Score value of the probability as this User action log.

3. according to the method described in claim 2, in the User action log each word occur in collection of document it is general Rate is identified as：It is newly collected according to the determine the probability User action log score value.

4. according to the method described in claim 1, before the step 1), LDA points are trained by User action log data Analyse model；

Using user journal data as the document of training LDA analysis models, the word conduct formed after user's operation data processing The word of LDA analysis models is trained, theme of the theme as training LDA analysis models in terms of user's operation type.

5. according to the method described in claim 1, User action log data are divided into two words；

6. a kind of user's unusual checking device based on User action log data, which is characterized in that the device includes：

Score value judgment module determines that freshly harvested User action log is suspicious user when point value of evaluation is less than predetermined score value User behaviors log；

Alarm module determines the corresponding user terminal of suspicious user user behaviors log and application software, and generates warning information.

7. device according to claim 6, in the LDA analysis models, User action log data include following word： User ID, user terminal ID, application software coding, operating time, action type analyze mould based on these words according to LDA is established Document, theme needed for type input calculate the probability of every User action log appearance, and should then according to LDA algorithm Score value of the probability as this User action log.

8. device according to claim 7, each word occurs general in collection of document in the User action log Rate is identified asIt is newly collected according to the determine the probability The score value of User action log.

9. device according to claim 6, which further includes model training module, is instructed by User action log data Practice LDA analysis models；

10. User action log data are divided into two words by device according to claim 6；

11. a kind of computer readable storage medium, which has computer program instructions, when the execution computer program When instruction, realize such as one of above-mentioned method.