CN108011809A

CN108011809A - Anti-data-leakage analysis method and system based on user behavior and document content

Info

Publication number: CN108011809A
Application number: CN201711262779.4A
Authority: CN
Inventors: 魏效征; 王志海; 喻波; 安鹏
Original assignee: Beijing Wondersoft Technology Co Ltd
Current assignee: Beijing Wondersoft Technology Co Ltd
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-05-08

Abstract

The invention discloses anti-data-leakage analysis method and system based on user behavior and document content, this method comprises the following steps：The outgoing mail behavior related data of the predetermined long period of user and predetermined short time period is obtained respectively, is averaged by data, normalized, respectively obtains the long-term action data vector and acts and efforts for expediency data vector of the user；According to distance between the vector between user's long-term action data vector and acts and efforts for expediency data vector and the comparative result of predetermined vector distance threshold, determine the behavior of user's outgoing mail with the presence or absence of abnormal；For there are user's outgoing mail of abnormal behaviour, extracting Mail Contents document, and judge the subject categories of document；According to document subject matter classification select with the associated accurate matching of texts policing rule of the category, determine to whether there is sensitive data in document.By technical scheme, sensitive data can be significantly improved and leaked the order of accuarcy of event judgement, effectively reduce the rate of false alarm only judged by content matching.

Description

Anti-data-leakage analysis method and system based on user behavior and document content

Technical field

The present invention relates to data security arts, and in particular to is analyzed based on the anti-data-leakage of user behavior and document content Method and system.

Background technology

The major function of business data leak prevention system is to prevent enterprise staff outgoing sensitive data.Therefore, accurate judgement Whether the data of employee's outgoing are sensitive, are the keys of anti-data-leakage system.Traditional means be by accurate matched means, Such as the hit-count of keyword or regular expression is realized, tends to produce many wrong reports.Therefore anti-data-leakage system System there is an urgent need to consider more factors, come judge the outgoing data behavior of enterprise staff whether security incident.

Documents 1

Publication number：105357217A, denomination of invention：Data based on user behavior analysis steal methods of risk assessment and are System

The prior art is analyzed by the network behavior of internal network termination user, it is found that there are the potential of risk operations Terminal, protects data safety, improves the security of internal network.

The prior art is by obtaining the operation behavior pair of terminal user；According to the operation behavior pair, dangerous behaviour is obtained Make behavior pair and risky operation behavior logarithm, calculate the first dangerous property coefficient；According to the risky operation behavior pair, obtain and access The coupling number and mismatch number of website behavior type of service and registration type of service, calculate the second dangerous property coefficient；According to copy Behavior, obtains dangerous copy behavior and dangerous copied files number, calculates the 3rd dangerous property coefficient and the 4th dangerous property coefficient；According to Described first dangerous property coefficient, the second dangerous property coefficient, the 3rd dangerous property coefficient and the 4th dangerous property coefficient, using default wind Dangerous assessment models computing terminal danger property coefficient.

The above-mentioned prior art according to the operation of terminal to calculating danger coefficient, including：Intercepting network data stream；To the net Network data flow carries out protocol analysis and obtains character stream；Obtain default detection character string corresponding with program language and/or grammer Analyze built-in function；Whether the parsing obtained character stream is judged according to the detection character string and/or syntactic analysis built-in function Comprising source code, if so, then blocking the network data flow.

Above patent document has the following disadvantages：

(1), according to the value after risk assessment, carried out dangerous by operation of the user in terminal to carrying out risk assessment Property judgement, without considering the content of data in itself, easily produce very big rate of false alarm.

(2) exception of real terminal operation behavior, may not be equivalent to the security incident that data are stolen.Operation behavior it is different Often, it is related to the multiple factors such as the mood of operator, the temporary shift to work, therefore does not combine other factors fusion and consider, it is real Must be bad with property.

The content of the invention

In order to solve the above technical problems, the present invention provides the anti-data-leakage analysis based on user behavior and document content Method, it is characterised in that this method comprises the following steps：

1) the outgoing mail behavior related data of the predetermined long period of user and predetermined short time period is obtained respectively, by number According to average, normalized, the long-term action data vector and acts and efforts for expediency data vector of the user are respectively obtained；

2) distance between the vector between calculating user's long-term action data vector and acts and efforts for expediency data vector, according to meter Distance and the comparative result of predetermined vector distance threshold, determine that user's outgoing mail behavior whether there is between the obtained vector It is abnormal, if there is exception, step 3) is jumped to, otherwise jumps to step 5；

3) for there are user's outgoing mail of abnormal behaviour, extracting Mail Contents document, and judge the theme class of document Not；

4) according to document subject matter classification select with the associated accurate matching of texts policing rule of the category, and use the matching Policing rule determines to whether there is sensitive data in document；

5) terminate.

According to an embodiment of the invention, it is preferred that the outgoing mail behavior related data in the step 1) includes：Mail Sending time, e-mail sender address, e-mail sender domain, mail recipient address, mail recipient domain, mail recipient top Level domain name, mail matter topics type, the number of mail sent, received number of mail, the size of mail, Mail Clients IP Location, mail server IP address.

According to an embodiment of the invention, it is preferred that user's long-term action data vector described in the step 2) and short-term Distance is mahalanobis distance (Mahalanobis Distance) between vector between behavioral data vector, and vector distance threshold value is by card Square method of calibration determines, if distance is more than the vector distance threshold value between the vector, judges user's outgoing mail behavior There are exception.

According to an embodiment of the invention, it is preferred that in the step 3), the mail document content of extraction is segmented, Then using linear discriminent analysis LDA (Linear Discriminant Analysis) method, the word included according to document Word content, judges the subject categories of document.

According to an embodiment of the invention, it is preferred that the accurate matching strategy rule in the step 4) includes regular expressions Formula matching strategy rule and Keywords matching policing rule.

In order to solve the above technical problems, the present invention provides a kind of anti-data-leakage based on user behavior and document content Analysis system, it is characterised in that the system includes：

Data vector establishes module, obtains the outgoing mail behavior of the predetermined long period of user and predetermined short time period respectively Related data, is averaged, normalized by data, respectively obtains the long-term action data vector and acts and efforts for expediency number of the user According to vector；

Abnormal determining module, calculates between the vector between user's long-term action data vector and acts and efforts for expediency data vector Distance, according to distance between the vector being calculated and the comparative result of predetermined vector distance threshold, determines user's outgoing mail Behavior is with the presence or absence of abnormal；

Document subject matter kind judging module, for there are user's outgoing mail of abnormal behaviour, extracting Mail Contents document, And judge the subject categories of document；

Accurate Analysis module, according to document subject matter classification select with the associated accurate matching of texts policing rule of the category, And determine to whether there is sensitive data in document using matching strategy rule.

According to an embodiment of the invention, it is preferred that the outgoing mail behavior related data includes：Post time, E-mail sender address, e-mail sender domain, mail recipient address, mail recipient domain, mail recipient's top level domain, postal Part type of theme, the number of mail sent, received number of mail, the size of mail, Mail Clients IP address, mail service Device IP address.

According to an embodiment of the invention, it is preferred that user's long-term action data vector and acts and efforts for expediency data vector Between vector between distance be mahalanobis distance (Mahalanobis Distance), and vector distance threshold value is true by card side's method of calibration It is fixed；

If distance is more than the vector distance threshold value between abnormal determining module determines the vector, user's outgoing is judged Mail behavior exists abnormal.

According to an embodiment of the invention, it is preferred that document subject matter kind judging module first unites mail document to be detected One is converted to txt text document forms, and the mail document content of extraction is segmented, is then analyzed using linear discriminent LDA (Linear Discriminant Analysis) method, the words content included according to document, judges the theme class of document Not.

In order to solve the above technical problems, the present invention provides a kind of computer-readable recording medium, it is characterised in that Jie Matter includes computer program instructions, and one of above-mentioned method is realized by performing the computer program execution.

Technical solution using the present invention, leaks detection method in the Dual Sensitive data of user behavior and content matching, Can significantly improve sensitive data leak event judgement order of accuarcy, enhancing enterprise for source code data safety management and control energy Power.This method can effectively reduce the rate of false alarm only judged by content matching.

Brief description of the drawings

Fig. 1 is the analysis process figure of the present invention.

Embodiment

The present invention proposes and realizes data that are a kind of while considering data content and user behavior to leak detection method.Should Method can consider user behavior on the basis of matched data content, so as to greatly reduce the wrong report of anti-data-leakage system Number.

Below in conjunction with the accompanying drawings and specific embodiment the present invention is further illustrated, but protection scope of the present invention is simultaneously Not limited to this.

Double monitoring mechanism proposed by the present invention based on user behavior and data content, for the susceptibility of business data Detection demand, effectively reduces the rate of false alarm of business data leak prevention system security incident.Monitoring of this patent to data content, Carried out according to the matic mould and accurate profile matching pattern；Monitoring to user behavior, mainly from time, quantity, outgoing group Fall relation etc. to be analyzed；Finally by the relation of logical combination, the result of content detection and behavioral value is combined Come.

The first order detects：The anomaly analysis of user behavior.Outgoing mail behavior to each user of enterprise is analyzed, bag Include following aspect, sending time, e-mail sender address, e-mail sender domain, mail recipient address, mail recipient domain, Mail recipient's top level domain, mail matter topics type, the number of mail sent, received number of mail, the size of mail, mail Client ip address, mail server IP address etc..By analyzing prolonged user data (being typically larger than three months), statistics The average data of the aspects above of each user is obtained, and is normalized, so as to obtain the daily behavior of the user Data vector.Specifically, can then divided by standard deviation by the way that the value for the data item that need to be counted is subtracted average data values, Obtained value is taken to the index of e, the calculating of softmax functions is finally done, obtains the daily behavior data vector of the user.By user The data of daily data, either three days or one week, the acts and efforts for expediency vector of user is obtained according to same method for normalizing.

By calculating the distance (being proposed with mahalanobis distance) between long-term average user behavior vector sum acts and efforts for expediency vector, And obtain distance threshold using card side's method of calibration.If the distance value of acts and efforts for expediency vector sum long-term action vector is more than threshold Value, then assert the mail outgoing abnormal behavior on the day of the user.Acts and efforts for expediency do not ensure that extremely be anti-data-leakage peace Total event, it is therefore desirable to data are analyzed in itself again.

Chi-square Test is a kind of very wide hypothesis testing method of purposes, its application in grouped data statistical inference, Including：Two rates or two form frequently compared with Chi-square Test；Multiple rates or it is multiple form frequently compared with Chi-square Test and point Correlation analysis of class data etc..Chi-square Test is exactly the deviation journey between the actual observed value of statistical sample and theoretical implications value Degree, the departure degree between actual observed value and theoretical implications value just determine the size of chi-square value, and chi-square value is bigger, is not inconsistent more Close；Chi-square value is smaller, and deviation is smaller, more tends to meet, if two values are essentially equal, chi-square value is just 0, shows that theoretical value is complete Meet entirely.

By each vector value for the sample data that need to be counted, the distance with average is calculated, obtained distance value all maps Onto chi square function, the value of card taking side's zero point can obtain distance threshold.

Detect the second level：The matic mould analysis of data content.The annex or text of mail, comprising substantial amounts of words, Differentiate which kind of type document (referring mainly to Email attachment) is from the angle of subject mode analysis, follow-up precise contents are matched It is significant.After being segmented to the content of document, using LDA analysis methods, the words content included according to document, judges text The subject categories of shelves.

The third level detects：The accurate the matching analysis of description of data content.It is related to the document of sensitive data, it is necessary to comprising definite Sensitive features, the whether crucial numeric string feature such as words or regular expression.The detection in three above stage can expire Foot, then document to be detected necessarily contains sensitive data.

With reference to attached drawing 1, to the treated of the double check method proposed by the present invention based on user behavior and data content Journey is described in detail, which mainly includes abnormal behavior analysis, and subject analysis and precise contents match three mistakes Journey.

(1) anomaly analysis based on user behavior

For specific sender, the history outbox information of sender is first counted, particularly and the associated hair of the sender Number of packages amount, mail size, address of the addressee information, mail domain name classification etc., finally obtain normalized mail behavior vector (hair Number of packages amount, number of mail, addressee's quantity, mail domain quantity ...)；Then the hair in the same day or current time interval is calculated The associated mail behavior vector of part people；Finally calculate two vectorial mahalanobis distances or included angle cosine.If mahalanobis distance surpasses Threshold value is crossed, then assert that the mail sends behavior and belongs to abnormal behavior.

Behavioural analysis can have many methods, including the mahalanobis distance of long-term action vector sum acts and efforts for expediency vector judges, Or the Distance Judgment of population mean vector sum individual behavior vector, the various vector distance computational methods comprising the present invention are not Depart from the essence of the present invention, within protection scope of the present invention.

(2) theme of document content judges

The analysis of subject categories and training process, are carried out using LDA methods, should just be established and be completed before mail outgoing, LDA models should establish in advance.During mail outgoing, first document (such as doc, xls, pdf form) to be detected is united One is converted to txt text document forms；Then, word segmentation processing is done to the content in text according to dictionary, using LDA methods, sentences Determine the subject categories belonging to text.

(3) precise contents match

Judge that the policing rule includes as a result, selecting the accurate matching strategy rule of category associations according to subject categories： Matching regular expressions strategy and keyword threshold value matching strategy.For matching regular expressions, if after successful match, also needed Scripts match is carried out, if it fails to match, illustrates that document content is normal, not including sensitive data；If pass through canonical table Have found that it is likely that there are sensitive data up to formula matching and after handling scripts match, it is also necessary to Keywords matching is further carried out, if It fails to match, then illustrates that document content is normal, not including sensitive data；If Keywords matching success, illustrates the document bag Containing sensitive data, judgement is exported accordingly as a result, such as sending warning prompt to user and administrator, and carry out log recording, This as just output citing, and and it is non-limiting, other various result way of outputs are within the protection domain of the invention.

The content rule of anti-data-leakage, the frequency of occurrence for being usually some keywords exceed some threshold value, some canonicals The appearance species of expression formula feature exceedes specific threshold, or the certain logic combination of both the above situation.Precise contents Method of completing the square is the common method of anti-data-leakage, it is easy to accomplish.

If during the outgoing mail of some user, first it is detected that be distributed as being abnormal behavior outside the mail of the user, Such as the change dramatically of mail outgoing quantity, either outgoing frequency increased dramatically or the group of purpose addressee significantly has Difference, then need to carry out Content inspection.If by subject analysis in Content inspection, the subject content of the document can determine that, And matched by the precise contents of the theme, matched rule can be hit, then can assert that this outer is distributed as being that data leak.

The present invention provides the anti-data-leakage analysis system based on user behavior and document content, it is characterised in that should System includes：

The outgoing mail behavior related data includes：Post time, e-mail sender address, e-mail sender Domain, mail recipient address, mail recipient domain, mail recipient's top level domain, mail matter topics type, the mail number sent Amount, received number of mail, the size of mail, Mail Clients IP address, mail server IP address.

Distance is mahalanobis distance between vector between user's long-term action data vector and acts and efforts for expediency data vector (Mahalanobis Distance), and vector distance threshold value is determined by card side's method of calibration；

Mail document to be detected is first uniformly converted to txt text document forms by document subject matter kind judging module, right The mail document content of extraction is segmented, then using linear discriminent analysis LDA (Linear Discriminant Analysis) method, the words content included according to document, judges the subject categories of document.

The accurate matching strategy rule that Accurate Analysis module uses includes matching regular expressions policing rule and keyword Matching strategy rule.

Certain bank personnel is shortly before it will apply for leaving office, by the frequent outgoing sensitive data document of mailbox in row, either The quantity of outgoing mail, or the byte number of outgoing mail, all obvious increase.

The double monitoring method described through this patent, judges that the outer sensitive data for being distributed as being to determine of the user leaks peace Total event, therefore the management and control measures of blocking are taken, effectively protect the data assets of bank.

The Dual Sensitive data based on user behavior and content matching proposed by this method, which leak, detection method and is System, can significantly improve sensitive data leak event judgement order of accuarcy, enhancing enterprise for source code data safety management and control Ability.This method can effectively reduce the rate of false alarm only judged by content matching.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement for being made etc., should all protect the guarantor in the present invention Within the scope of shield.

Claims

1. the anti-data-leakage analysis method based on user behavior and document content, it is characterised in that this method includes following step Suddenly：

1) the outgoing mail behavior related data of the predetermined long period of user and predetermined short time period is obtained respectively, is put down by data , normalized, respectively obtains the long-term action data vector and acts and efforts for expediency data vector of the user；

2) distance between the vector between user's long-term action data vector and acts and efforts for expediency data vector is calculated, according to calculating Distance and the comparative result of predetermined vector distance threshold between the vector arrived, determine the behavior of user's outgoing mail with the presence or absence of different Often, if there is exception, step 3) is jumped to, otherwise jumps to step 5；

3) for there are user's outgoing mail of abnormal behaviour, extracting Mail Contents document, and judge the subject categories of document；

4) according to document subject matter classification select with the associated accurate matching of texts policing rule of the category, and use the matching strategy It whether there is sensitive data in the definite document of rule；

5) terminate.

2. according to the method described in claim 1, the outgoing mail behavior related data in the step 1) includes：Mail is sent Time, e-mail sender address, e-mail sender domain, mail recipient address, mail recipient domain, mail recipient's top level domain Name, mail matter topics type, the number of mail sent, received number of mail, the size of mail, Mail Clients IP address, postal Part server ip address.

3. according to the method described in claim 1, user's long-term action data vector and acts and efforts for expediency described in the step 2) Distance is mahalanobis distance (Mahalanobis Distance) between vector between data vector, and vector distance threshold value is by card side school Proved recipe method determines, if distance is more than the vector distance threshold value between the vector, judges that the behavior of user's outgoing mail exists It is abnormal.

4. according to the method described in claim 1, in the step 3), the mail document content of extraction is segmented, then Using linear discriminent analysis LDA (Linear Discriminant Analysis) method, in the words included according to document Hold, judge the subject categories of document.

5. according to the method described in claim 1, the accurate matching strategy rule in the step 4) includes regular expression With policing rule and Keywords matching policing rule.

6. the anti-data-leakage analysis system based on user behavior and document content, it is characterised in that the system includes：

Data vector establishes module, and it is related to the outgoing mail behavior of predetermined short time period to obtain the predetermined long period of user respectively Data, by data are average, normalized, respectively obtain the user long-term action data vector and acts and efforts for expediency data to Amount；

Abnormal determining module, calculates the vectorial spacing between user's long-term action data vector and acts and efforts for expediency data vector From according to the comparative result of distance between the vector being calculated and predetermined vector distance threshold, determining user's outgoing mail row For with the presence or absence of exception；

Document subject matter kind judging module, for there are user's outgoing mail of abnormal behaviour, extracting Mail Contents document, and sentence Determine the subject categories of document；

Accurate Analysis module, according to document subject matter classification select with the associated accurate matching of texts policing rule of the category, and adopt It whether there is sensitive data with the definite document of matching strategy rule.

7. system according to claim 6, the outgoing mail behavior related data includes：Post time, mail Sender address, e-mail sender domain, mail recipient address, mail recipient domain, mail recipient's top level domain, mail master Inscribe type, the number of mail sent, received number of mail, the size of mail, Mail Clients IP address, mail server IP Address.

8. system according to claim 6, between user's long-term action data vector and acts and efforts for expediency data vector Distance is mahalanobis distance (Mahalanobis Distance) between vector, and vector distance threshold value is determined by card side's method of calibration；

If distance is more than the vector distance threshold value between abnormal determining module determines the vector, user's outgoing mail is judged Behavior exists abnormal.

9. system according to claim 6, document subject matter kind judging module first turns mail document unification to be detected Txt text document forms are changed to, the mail document content of extraction is segmented, then analyze LDA using linear discriminent (Linear Discriminant Analysis) method, the words content included according to document, judges the subject categories of document.

10. a kind of computer-readable recording medium, it is characterised in that the medium includes computer program instructions, by performing State computer program and perform the method realized described in one of claim 1-5.