CN107196942A

CN107196942A - A kind of inside threat detection method based on user language feature

Info

Publication number: CN107196942A
Application number: CN201710374486.9A
Authority: CN
Inventors: 杨光; 王继志; 杨英; 陈丽娟; 陈振娅; 文立强
Original assignee: Shandong Computer Science Center
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan; Shandong Computer Science Center
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2017-09-22
Anticipated expiration: 2037-05-24
Also published as: CN107196942B

Abstract

The invention discloses a kind of inside threat detection method based on user language feature, it analyzes the language data of user first, the characteristic vector that quantizes of user's personality feature can be characterized by extracting language feature and setting up, then build grader and carry out classifier training to recognize the user of anomalous personality's psychological characteristics, the characteristic vector drift rate of ultimate analysis anomalous personality psychological characteristics user reports safety officer's progress analysis reply to filter out wrong report user using remaining user as internal potential malicious user.The present invention, which has taken into full account, internals attack the middle attacker psychological characteristics of itself, psychological modeling has been carried out from personality angle, and abnormality detection grader is constructed with this, it compensate for existing detection method and only focus on the deficiency that attack process ignores attack main body, "abnormal" and " malice " are distinguished so as to fine granularity, detection inside threat can be analyzed comprehensively, and effectively the high wrong report of reduction conventional interior threat detection method is with failing to report problem.

Description

A kind of inside threat detection method based on user language feature

Technical field

The present invention relates to a kind of inside threat detection method based on user language feature, belong to Information Security Construction/net Network security technology area.

Background technology

With the development of network, the safety of the network information increasingly causes the attention of society, various anti-virus softwares, fire prevention The safety products such as wall, intrusion detection are widely used.But these safety information products are outside just for the sake of defence Invasion and steal, with people to network security cognition and technology development, find due to divulging a secret that internal staff causes Account for significant proportion with intrusion event, such as 2013 Snowdons " prism door " event, be exactly together typical internal staff let out Close safe example.So reply inside threat should with resist outside invasion must in the same manner as be taken seriously, it is but real In there is no effective inside threat testing mechanism.

Because inside threat attacker is usually employee's (on-job or leaving office), contractor and the business partner of enterprise or tissue With etc., and the access right of the organized system of tool, network and data, thus inside threat be generally configured with high disguise with Harmfulness, traditional Defense in depth system based on safety means such as fire wall, IDS can not successfully manage inside threat.

The key of detection inside threat is perfect internal security audit, and its core is customer-centric, records it All key operations and behavior in system and network, so as to form action trail of the user internally in network.In current The emphasis of portion's security audit is following behavior：

Document is audited：Write-in, establishment, duplication, deletion of audit portfolio etc. are operated；

Print auditing：Client-initiated of auditing prints event and file content etc.；

Log in audit：The behavior of audit logging in system by user, and nullify, restart, closing the operation of system；

Process is audited：The process that the user that audits creates, closed；

Network monitoring：The WEB that audits accesses behavior, including access target IP/Port, page request etc.；

Equipment is audited：The movable memory equipment usage behaviors such as audit USB, the file for such as replicating, deleting；

Mail is audited：Part people, mail header, part body are transmitted/received in audit user mail behavior, such as mail header And annex number (type) etc..

Various dimensions, fine-grained internal security audit necessarily cause huge data volume, thing followed sharp increase Detection complexity is that inside threat detection proposes challenge.Therefore, user behavior is modeled with reference to big data analytical technology, especially It is that the big data security study for being directed to internal security audit daily record has become current study hotspot.But inside prestige in practice Side of body detecting system causes detection rate of false alarm higher because its data source portrays the deficiencies such as dimension is unilateral, detecting system framework is single, Practicality is poor, therefore necessary inside threat detecting system of the design with good practicality.

The emphasis of existing inside threat detecting system design research and development is to use method for detecting abnormality, the inside peace based on user Full audit log sets up inside threat grader, and its key step is as follows：

Internal security audit is gathered：Internal security audit system is disposed, the built-in systems such as the document access of collection user are received With network behavior, pass to grader after formatting processing and build module；

Abnormality detection grader：With method for detecting abnormality from the data learning personal behavior model of reception, build different Often detect grader；

User behavior is detected：Abnormality detection grader is detected to the User action log of special time window, is judged Whether it is inside threat；

The above-mentioned inside threat detection method based on abnormality detection can tackle major part in practice and internal attack situation, so And its hypotheses has the defect that can not ignore, i.e., it is assumed：The malicious act of internal malicious user is necessarily different from normal Work behavior, therefore malicious act can be distinguished by abnormality detection；In practice, the malicious act of above-mentioned hypothesis and abnormal behaviour is simultaneously Not exclusively of equal value, i.e. two class behavior set are simultaneously unequal, if therefore only consider unusual checking, necessarily cause high rate of false alarm (normal users are identified as malicious user) fails to report (malicious user is identified as normal users) with height, specifically, may be referred to following Two examples：

Frequently by the progress matters of e-mail communications collaborative project when 1. project manager A and B is flat, certain day A will by mail The Project Technical material that should be maintained secrecy has issued B (malicious act is not abnormal)；

2. purchasing agent A is always purchased when flat at supply and marketing company B, once purchased suddenly from supply and marketing company C, but can not Thus judge that A has received C returning (abnormal behaviour not necessarily malice)

As described above, the core of existing inside threat detecting system is to check and user's row in attack process by strategy For abnormality detection build the grader that detection is abnormal.But the hypotheses of simple analytical attack process feature are obscured Perhaps, the boundary of "abnormal" and " malice ", the in practice behavior of user's malice is not belonging to exception, and abnormal behaviour may not also belong to evil Meaning.The custom system relied solely in the internal security audit daily record of collection e insufficient to fine granularity differentiation with network behavior data The boundary of "abnormal" and " malice ", therefore the inside threat detecting system based on available data dimension is inevitably present high miss Report and fail to report problem.Height wrong report cause alarm quality it is relatively low, one side analysis personnel can not analyze comprehensively, on the other hand cause be Availability of uniting reduction, as a result detecting system performs practically no function；Height, which is failed to report, then directly causes Prevention-Security failure, causes enterprise or tissue Assets are sunk among excessive risk.It is the key factor for restricting inside threat detecting system practicality that height wrong report is failed to report with height, is also The subject matter that current internal threat detection system is present.

The content of the invention

Check that the existing inside threat detection method with behavioral data abnormality detection is present by strategy for prior art High wrong report, the high deficiency failed to report, the invention provides a kind of inside threat detection method based on user language feature, its energy Enough detection inside threats of analysis comprehensively, effectively the high wrong report of reduction conventional interior threat detection method is with failing to report problem.

The present invention solves its technical problem and adopted the technical scheme that：A kind of inside threat inspection based on user language feature Survey method, it is characterized in that, the language data of user is analyzed first, and user's personality can be characterized by extracting language feature and setting up The characteristic vector that quantizes of feature, then builds grader and carries out classifier training to recognize the use of anomalous personality's psychological characteristics Family, the characteristic vector drift rate of ultimate analysis anomalous personality psychological characteristics user is used to filter out wrong report user by remaining Family reports safety officer as internal potential malicious user and carries out analysis reply.

Preferably, the inside threat detection method based on user language feature comprises the following steps：

1), data prediction：User language data to internal audit system are carried out including at least automation audit, automatically Change the analyzing and processing of contents processing and the aspect of automation polymerization three；

2), personality characteristic vector is built：User language data first to each user are analyzed, and will obtain phase The word frequency result for the important part of speech answered as Chinese word LIWC analysis result, then by the spy of LIWC parts of speech and five-factor model personality Association is levied, 18 sub- dimensional characteristics numerical value of five-factor model personality will be calculated as the personality characteristic vector of the user；

3), classifier training：Grader is built first, and selects the user language number of audit in some initial period According to then the personality characteristic vector of each user of calculating obtaining initial customer group using one-class support vector machines training The mental model of group, finally calculates the personality modeled based on user language data content after in any one new period Psychological characteristics vector, and judge whether exception using groups of users mental model, judge that abnormal groups of users set is designated as AbnormalUsers；

4) confidence calculations, are threatened：To being judged as that abnormal groups of users set AbnormalUsers carries out calculating threat Confidence level further screens user；The threat confidence calculations process includes step in detail below：

41), for the user in abnormal user cluster set AbnormalUsers, by its corresponding 18 dimensional characteristics to Amount constitutes a matrix Matrix_1, and line number is AbnormalUsers number of users, is classified as 18；

42) Martrix_2, Martrix_2 calculating, are obtained according to column count matrix Martix_1 often capable Z score Formula is as follows：

Wherein, for i-th of user in Matrix_1, X_ijIts j-th of dimension numerical value is represented,Represent its square The numerical value average that jth is arranged in battle array, σ_jRepresent the standard deviation of jth row；

After each user calculates Z score in Matrix_1, new matrix Matrix_2 is constituted；

43), the average of calculating matrix Martrix_2 every column data, obtains the mean vector Mean_value of 18 dimensions；

44) its 18 dimension, is compared each user in abnormal user cluster set AbnormalUsers successively first special Levy the number for exceeding correspondence numerical value in mean vector Mean_value in vector, then using 18 new dimension binary vectors of gained as It threatens confidence level TCD, if threatening in confidence level TCD ' 1 ' number to exceed threshold k, it is normal users to mark the user, And delete the user from abnormal user cluster set AbnormalUsers；

45), repeat the above steps 41) to step 44) until institute in all abnormal user cluster set AbnormalUsers Useful to pass through judgement per family, finally user is used as inside is potential to dislike in remaining abnormal user cluster set AbnormalUsers Reporting of user of anticipating is analyzed to safety officer tackles.

Preferably, the user language data include work mail data, data for electronic documents and social networking application data, institute State work mail data for user send work mail content of text, the data for electronic documents for user writing and work Make content of text that is related and in electronic form storing, the social networking application data crawled for the social status of user after text This content.

Preferably, the analyzing and processing process of described pair of work mail data comprises the following steps：

111), automation audit：Collect the work mail data in certain period；

112) contents processing, is automated：The mail that only analysis user sends, mail head's letter is weeded out for each envelope mail Breath, only extracts content of text；For the transmission mail with multiple time tags, the postal of the last time time transmission is only considered Part；

113), automation polymerization：The work mail data of each user is carried out at automation audit and automation content The content of text of reason aggregates into one big text and stored.

Preferably, the analyzing and processing process to data for electronic documents comprises the following steps：

121), automation audit：Collect the data for electronic documents in being worked in certain period；

122) contents processing, is automated：Remove title datas at different levels, formatted data and the picture sound in electronic document Data, only extract the plain text content in electronic document；

123), automation polymerization：The data for electronic documents of each user is carried out at automation audit and automation content The content of text of reason aggregates into one big text and stored.

Preferably, the analyzing and processing process to social networking application data comprises the following steps：

131), automation audit：Collect the social networking application status data of internal user in certain period；

132) contents processing, is automated：Picture, sound and the hyperlink data in social networking application status data are removed, The content of text only write in processing state by the user；

133), automation polymerization：The social networking application data of each user are carried out at automation audit and automation content The content of text data aggregate of reason is into one big text and is stored.

Preferably, in personality characteristic vector building process, in the minds of the text of Institute of Developed Organisms, Academia Sinica Literary Psychoanalysis System is analyzed the mail text of each user, obtains the word frequency result of corresponding important part of speech, as Chinese word LIWC analysis result；By LIWC parts of speech and the feature association of five-factor model personality, 18 sons of five-factor model personality are calculated Dimensional characteristics numerical value, is used as the personality characteristic vector of the user.

Preferably, the sub- dimension of 18 five-factor model personalities is respectively：Anxiety speciality, angry speciality, depressed speciality, self Realize speciality, impulsion speciality, fragile speciality, trust speciality, moral speciality, profit his speciality, cooperation speciality, modest speciality, sympathy Speciality, self efficacy, order speciality, responsibility speciality, sense of accomplishment, self-discipline speciality and careful speciality.

Preferably, the calculating process of described 118 sub- dimensional characteristics numerical value is as follows：

For i-th of dimension in 18 sub- dimensions, the sub- dimension and the statistic correlation of LIWC parts of speech are：

Wherein, Feat_iRepresent i-th of sub- dimension, and (q_i,j,c_i,j) represent corresponding LIWC parts of speech q_i,jAnd its it is corresponding Statistic correlation c_i,j, and N_iFor the LIWC part of speech number related to i-th of sub- dimension statistically significant；

On the basis of formula (1), the personality characteristic vector of user is calculated by formula (2)：

Wherein, Feat_iRepresent the personality characteristic vector of any one in 18 dimensions of user, q_jWith c_jRespectively Represent the word frequency value and corresponding statistic correlation on j-th of the part of speech for the LIWC that the user associates in i-th of dimension.

Preferably, the calculation formula for threatening confidence level TCD is as follows：

TCD_i={ 1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1 } (5)

Wherein, Z_ijRepresent the i row j column datas in Matrix_2, i.e., the Z score of i-th user's jth row dimensional characteristics, MV_j Represent j-th of value in mean vector Mean_value；The number of numerical value ' 1 ' in the threat confidence level of user in formula (5) For 14, if the number 14 of numerical value ' 1 ' is more than given threshold k, the user is corrected for normal users, and by from Rejected in AbnormalUsers set.

The beneficial effects of the invention are as follows：The present invention by analyze user language data extraction language feature and set up can The characteristic vector that quantizes of user's personality feature is characterized, then trains grader to identify anomalous personality's heart from user The user of feature is managed, and further analyzes the characteristic vector drift rate of these users, user is reported by mistake so as to filter out, will be remaining User reports safety officer's analysis reply as internal potential malicious user.The present invention has taken into full account internal attack in attack The psychological characteristics of the person of hitting itself, psychological modeling has been carried out from personality angle, and constructs abnormality detection grader with this, compensate for Existing detection method only focuses on the deficiency that attack process ignores attack main body, and "abnormal" is distinguished so as to fine granularity with " disliking Meaning ", analysis detection inside threat, effectively prevent the high wrong report of conventional interior threat detection method with failing to report problem comprehensively.

Compared with prior art, the invention has the characteristics that：

Model attacker's feature：The deficiency that existing detection method is concerned only with attack process feature is compensate for, attacker is modeled Feature, so that there is provided analytical attack motivation and the possibility of Forecast attack；By taking the mail that works as an example, by analyzing user in mail Language feature, with reference to LIWC parts of speech and the statistical correlation Journal of Sex Research of personality characteristics, constructs and characterizes user's personality feature 18 dimensional characteristics vector, with this carry out machine learning training obtain grader.

The confidence level that poses a threat TCD：If relying solely on language feature modeling personality psychological characteristics judges malicious user, must So there is higher wrong report, therefore the abnormal user that the present invention goes out for detection of classifier, further analyze these users at 18 Mean deviation change in personality dimension, finally identifies that it is normal users to offset larger judgement, from Deleted in AbnormalUsers, so as to reduce the rate of false alarm of detection method.

Except it is above-mentioned it is main a little, this invention also solves the deficiency of traditional psychological detection method.Traditional psychology detection Method relies primarily on the test of user psychology questionnaire, colleague or leader's evaluation etc. and realized, wherein not only needing to pay the more time With financial cost, it is often more important that user's self-assessment is difficult to avoid that subjective bias with third party evaluation, but also may touch Violate the laws and regulations such as secret protection.Detection method in the present invention bases oneself upon internal audit system, whole analysis process prosthetic ginseng With automating and carrying out, original content file is automatically deleted after the analysis of LIWC parts of speech, while effectively protection employee's privacy, in fact The detection of existing internal malicious user, finally not only reduces the time financial cost of traditional detection, reduces legal ethics risk, Also effectively reduce the inside threat risk that enterprise faces with tissue.

Brief description of the drawings

With reference to Figure of description, the present invention will be described.

Fig. 1 is flow chart of the method for the present invention；

Fig. 2 carries out the method flow diagram of inside threat detection for the present invention.

Embodiment

For the technical characterstic for illustrating this programme can be understood, below by embodiment, and its accompanying drawing is combined, to this hair It is bright to be described in detail.Following disclosure provides many different embodiments or example is used for realizing the different knots of the present invention Structure.In order to simplify disclosure of the invention, hereinafter the part and setting of specific examples are described.In addition, the present invention can be with Repeat reference numerals and/or letter in different examples.This repetition is that for purposes of simplicity and clarity, itself is not indicated Relation between various embodiments are discussed and/or set.It should be noted that part illustrated in the accompanying drawings is not necessarily to scale Draw.Present invention omits the description to known assemblies and treatment technology and process to avoid being unnecessarily limiting the present invention.

It is that insider is initiated in enterprise or tissue to internal attack (or inside threat), is different from legacy network invasion and attacks The new threat hit.Insider is located inside legacy network secure border, and possesses the key of Prevention-Security and target of attack and know Know, therefore insider can bypass existing Prevention-Security mechanism, implement network attack from enterprise or organization internal and (such as steal technology Patent, client's list etc.), so as to bring about great losses.It is an object of the present invention to：Internal information based on enterprise or NGO Auditing system, collects the language data of user to analyze its feature, and sets up mental model for all users, therefrom distinguishes Internal potential malicious user, i.e., very likely turn into the high-risk user for the person of internaling attack.High-risk user name is submitted on this basis It is single to be analyzed for internal security administrative staff, and take reply action to prevent or stop to internal attack behavior.

A kind of inside threat detection method based on user language feature of the present invention, it is characterized in that, user is analyzed first Language data, the characteristic vector that quantizes of user's personality feature can be characterized by extracting language feature and setting up, then structure Build grader and carry out classifier training to recognize the user of anomalous personality's psychological characteristics, ultimate analysis anomalous personality's psychological characteristics The characteristic vector drift rate of user reports user by mistake to filter out, and is reported remaining user as internal potential malicious user Safety officer carries out analysis reply.

Preferably, as shown in figure 1, the inside threat detection method based on user language feature comprises the following steps：

The main thought of the present invention is to analyze the language data of user, and user's personality can be characterized by extracting language feature foundation The characteristic vector that quantizes of psychological characteristics, then trains grader to identify the use of anomalous personality's psychological characteristics from user Family, and further analyze the characteristic vector drift rate of these users, so as to filter out wrong report user, using remaining user as interior The potential malicious user in portion reports safety officer's analysis reply.As shown in Fig. 2 the overall technical architecture of the present invention can be divided into Data processing, personality characteristic vector structure, classifier training and threat four key steps of confidence calculations, divide below Do not elaborate.

First, data prediction

The user language data of analysis come from internal audit system, mainly include three classes：

1st, work mail audit：The content of text for the work mail that the user of audit sends；

2nd, electronic document content is audited：Prospectus, working report of work correlation of the user writing etc. are with electronic edition shape The content of text of the multimedia forms such as job documentation, form and the PPT of formula audit；

3rd, social networking application content auditing：The social status such as microblogging, the wechat circle of friends of the user crawl after content of text Audit.

Analyzing and processing for above-mentioned three speech like sounds data source is similar method, and for convenience of description, the present invention connects down Illustrated respectively come the pretreatment work to three speech like sound data sources.

For work mail, data processing work mainly has：

11) automation audit：Collect the work mail data of certain period (some months or 1 year)；

12) contents processing is automated：The transmission mail of user is only analyzed, for each envelope mail, mail head is weeded out (title, sender, recipient etc.) information, only extracts content of text；For the transmission mail with multiple time tags, only examine The worry time is recent (such as：For forwarding and replied mail, text when only considering to reply text or forwarding)；

13) automation polymerization：For each user, by its all text according to above-mentioned steps automatic business processing Appearance aggregates into one big text, is analyzed after storage for lower step.

For electronic document：

21) automation audit：Collect the data for electronic documents in being worked in certain period (some months or 1 year)；

22) contents processing is automated：Title datas at different levels, formatted data and picture sound in removal electronic document etc. Multi-medium data, only extracts the plain text content in electronic document；

23) automation polymerization：For each user, by all corresponding electronics according to above-mentioned steps automatic business processing Document content aggregates into one big text, is analyzed after storage for lower step.

For social networking application：

31) automation audit：Collect the social networking application status data of internal user in certain period (some months or 1 year) (such as microblogging, circle of friends)；

32) contents processing is automated：The nonformats such as picture, sound and hyperlink in removal social networking application status data Change data, the content of text only write in processing state by the user, the i.e. content of text not comprising forwarding type；

33) automation polymerization：, will be all social according to the correspondence of above-mentioned steps automatic business processing for each user Application state content of text data aggregate is analyzed into one big text after storage for lower step.

2nd, psychological characteristics is built

The following Treatment Analysis process of the present invention is applied to work mail, electronic document and the class number of social networking application state three According to source, explanation is no longer distinguished one by one.

Use literary Psychoanalysis System (http in the minds of the text of Institute of Developed Organisms, Academia Sinica:// ccpl.psych.ac.cn/textmind/)【1】Mail text analysis to each user, obtains corresponding important part of speech Word frequency result, be used as Chinese word LIWC【2】Analysis result.Wherein LIWC (Linguistic Inquiry and Word Count, language acquirement and vocabulary count storehouse) be one be widely used be used for from language analyze thought, emotion, personality etc. Literary analysis system is the science to former english system on Chinese language lexicon in the minds of the opening analysis system of subjective factor, text Extension.After this step terminates, the original content file of each user is deleted, to ensure privacy safety；

By LIWC parts of speech and five-factor model personality【3】Feature association, calculate 18 sub- dimensional characteristics numbers of five-factor model personality Value, is used as the personality characteristic vector of the user【4】.

The sub- dimension of 18 five-factor model personalities is respectively：

Below by taking fragile speciality as an example, illustrate specific how each to calculate each user according to LIWC parts of speech analysis result The method of sub- dimension speciality numerical value.From【4】In can obtain the statistic correlation of sub- dimension speciality and LIWC parts of speech, it is such as fragile special Matter is：Sensation class vocabulary (0.18), anxiety class vocabulary (0.16), article (- 0.16), first person singular word (0.14), instead Body pronoun class (0.13), cause and effect word (0.11), gap word (0.11), cognitive process word-congnitive processes (0.1), qualifier (0.1), second person class vocabulary-(- 0.1).

It is correlation system wherein to list the numerical value in 10 LIWC parts of speech stronger with fragile speciality correlation, bracket Number, can obtain the numerical score of fragile speciality of user to make according to the LIWC part of speech analysis results of these correlations and user For a numerical value in 18 sub- dimensions.

For i-th of dimension in 18 sub- dimensions, studied by searching【4】Obtain the sub- dimension and LIWC parts of speech Statistic correlation be：

Wherein, Feat_iRepresent i-th of sub- dimension, and (q_i,j,c_i,j) represent corresponding LIWC parts of speech q_i,jAnd its it is corresponding Statistic correlation c_i,j, and N_iFor the LIWC part of speech number related to i-th of sub- dimension statistically significant.Basis in formula (1) On, our calculation formula (2)：

Above-mentioned formula is represented for any one user, the side of the personality characteristic vector calculating of its 18 dimensions Method.q_jWith c_jThe word frequency value and corresponding system on j-th of the part of speech for the LIWC that the user associates in i-th of dimension are represented respectively Count correlation.Its remainder values is according to such method combination correlation calculations.For each user, represented The characteristic vector of 18 dimensions of its personality feature, each numerical value is to combine LIWC part of speech analysis results according to the method described above Obtained with personal traits correlation weighted sum.

3rd, grader is trained

Grader is built in order to application machine learning algorithm, it is proposed that some initial period (such as 1 of selection Individual month), according to the user job mail of audit in time period, the personality of each user is calculated according to psychological characteristics building process Psychological characteristics vector, is then trained using one-class support vector machines (One Class SVM, sklearn-0.19 versions algorithms library) Obtain the mental model PsyModel of initial groups of users.

When after any one new period (some moon of time as after), according to the method for psychological characteristics building process Calculate the personality characteristic vector modeled in the period based on user job Mail Contents, the customer group obtained using upper step Group mental model PsyModel judges whether exception, judges that abnormal groups of users set is designated as AbnormalUsers.

4th, calculate and threaten confidence level

It is judged as abnormal groups of users set for the grader obtained during training grader AbnormalUsers, wherein certain normal users may be included, it is therefore desirable to calculate and threaten confidence level to be used with further screening Family.Specifically：

1) for the user in AbnormalUsers, its corresponding 18 dimensional characteristics vector is constituted into a matrix Matrix_1, line number is AbnormalUsers number of users, is classified as 18；

2) Martrix_2, i.e. formula are obtained according to column count matrix Martix_1 often capable Z score：

Wherein, for i-th of user in Matrix_1, X_ijIts j-th of dimension numerical value is represented,Represent its square The numerical value average that jth is arranged in battle array, σ_jRepresent the standard deviation of jth row.Each user's (i.e. each row of data) meter in Matrix_1 Calculate after Z score, constitute new matrix Matrix_2；

3) average of calculating matrix Martrix_2 every column data, obtains the mean vector Mean_value of 18 dimensions；

4) for each user in AbnormalUsers, compare exceed in its 18 dimensional characteristics vector successively It is worth the number of correspondence numerical value in vector M ean_value, 18 new dimension binary vectors of gained is then threatened into confidence level as it (TCD)；Number such as in TCD ' 1 ' exceedes threshold k, then it is normal users to mark the user, and is deleted from AbnormalUsers Except the user；Here K suggestions are 12, according to circumstances can be specifically adjusted flexibly between (12~16), specific formula is as follows：

TCD_i={ 1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1 } (5)

Wherein, Z_ijRepresent the i row j column datas in Matrix_2, i.e., the Z score of i-th user's jth row dimensional characteristics, MV_j Represent j-th of value in mean vector Mean_value.The number of numerical value ' 1 ' in the threat confidence level of user in formula (5) For 14, more than given threshold k=12, therefore the user is corrected for normal users, and picks in being gathered from AbnormalUsers Remove.

5) repeat the above steps 1) to step 4) used per family by judging, most until all in all AbnormalUsers User reports safety officer's analysis reply as internal potential malicious user in remaining AbnormalUsers afterwards

The present invention is exactly directed to checks that the existing inside threat detection method with behavioral data abnormality detection is deposited by strategy High wrong report, the high deficiency failed to report, propose the language data feature based on user (employee), construct sign user personality special The psychological characteristics vector levied, overall groups of users mental model is set up by machine learning algorithm, is finally therefrom identified interior The psychology modeling detection method of portion's abnormal user.On this basis, the present invention analyzes it to the abnormal user that previous step is identified Overall offset degree in characteristic dimension, so as to remove the normal users that may be reported by mistake, finally obtains internal potential malicious user, Submit to safety officer further analysis and reply.The present invention has taken into full account that the psychology for internaling attack middle attacker itself is special Point, psychological modeling has been carried out from personality angle, and constructs abnormality detection grader with this, be compensate for existing detection method and is only closed The deficiency that attack process ignores attack main body is noted, "abnormal" and " malice " are distinguished so as to fine granularity, comprehensively analysis detection The high wrong report of inside threat, effectively reduction conventional interior threat detection method is with failing to report problem.

The present invention applies (microblogging, friend for coming from audit work mail, audit work document and audit social media Friend's circle etc.) user language data, delete after identity identification information (such as mail head, job documentation metadata, social activity ID etc.), general Text data convergence is big file, then by literary Psychoanalysis System in the minds of text【1】LIWC part of speech knots are obtained to Chinese analysis Really；

The present invention is according to LIWC parts of speech【2】With personality feature【3】Statistic correlation achievement in research【4】, set up with The personality characteristic vector of 18 dimensions headed by anxiety speciality；

The present invention is judged as abnormal user set AbnormalUsers, the Z of the analysis wherein row of user for grader Fraction, and calculate the row Characteristic Number work for exceeding correspondence average in each user as reference vector by column count average value Confidence level is threatened for it, such as exceedes previously given threshold k, then is determined as normal, the rejecting from AbnormalUsers；

The present invention relates to bibliography：

【1】Literary Psychoanalysis System in the minds of text：http://ccpl.psych.ac.cn/textmind/

【2】LIWC Program：http://liwc.wpengine.com/

【3】Five-factor model personality model：

http://www.baike.com/wiki/%E5%A4%A7%E4%BA%94%E4%BA%BA%E6% A0%BC%E7%90%86%E8%AE%BA

【4】LIWC parts of speech and five-factor model personality model interaction：

https://www.researchgate.net/publication/ 44687893Personalityin100000Word sAlarge- scaleanalysisofpersonalityandworduseamongbloggers。

Simply the preferred embodiment of the present invention described above, for those skilled in the art, Without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications are also regarded as this hair Bright protection domain.

Claims

1. a kind of inside threat detection method based on user language feature, it is characterized in that, the language data of user is analyzed first, The characteristic vector that quantizes of user's personality feature can be characterized by extracting language feature and setting up, and then built grader and gone forward side by side Row classifier training recognizes the user of anomalous personality's psychological characteristics, the feature of ultimate analysis anomalous personality psychological characteristics user to Measure drift rate to filter out wrong report user, and remaining user reported into safety officer as internal potential malicious user Row analysis reply.

2. a kind of inside threat detection method based on user language feature according to claim 1, it is characterized in that, it is described Inside threat detection method based on user language feature comprises the following steps：

1), data prediction：User language data to internal audit system are carried out including at least in automation audit, automation Hold the analyzing and processing of processing and the aspect of automation polymerization three；

2), personality characteristic vector is built：User language data first to each user are analyzed, and will obtain corresponding Then the word frequency result of important part of speech closes as Chinese word LIWC analysis result by the feature of LIWC parts of speech and five-factor model personality Connection, will calculate 18 sub- dimensional characteristics numerical value of five-factor model personality as the personality characteristic vector of the user；

3), classifier training：The user language data of audit in grader, and some initial period of selection are built first, The personality characteristic vector of each user is calculated, then initial groups of users is obtained using one-class support vector machines training Mental model, finally calculates the personality modeled based on user language data content after in any one new period Characteristic vector, and judge whether exception using groups of users mental model, judge that abnormal groups of users set is designated as AbnormalUsers；

4) confidence calculations, are threatened：Confidence is threatened to being judged as that abnormal groups of users set AbnormalUsers calculate Spend further to screen user；The threat confidence calculations process includes step in detail below：

41), for the user in abnormal user cluster set AbnormalUsers, by the vectorial structure of its corresponding 18 dimensional characteristics Into a matrix Matrix_1, line number is AbnormalUsers number of users, is classified as 18；

42) Martrix_2, Martrix_2 calculation formula, are obtained according to column count matrix Martix_1 often capable Z score It is as follows：

<mrow> <msub> <mi>ZScore</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <msub> <mi>X</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mover> <mi>X</mi> <mo>&OverBar;</mo> </mover> <mi>j</mi> </msub> <mo>|</mo> </mrow> <msub> <mi>&sigma;</mi> <mi>j</mi> </msub> </mfrac> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>~</mo> <mn>18</mn> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

Wherein, for i-th of user in Matrix_1, X_ijIts j-th of dimension numerical value is represented,Represent in its matrix The numerical value average of j row, σ_jRepresent the standard deviation of jth row；

44), each user in abnormal user cluster set AbnormalUsers is compared successively first its 18 dimensional characteristics to Exceed the number of correspondence numerical value in mean vector Mean_value in amount, then regard 18 new dimension binary vectors of gained as its prestige Confidence level TCD is coerced, if threatening in confidence level TCD ' 1 ' number to exceed threshold k, it is normal users to mark the user, and from The user is deleted in abnormal user cluster set AbnormalUsers；

45), repeat the above steps 41) to step 44) until in all abnormal user cluster set AbnormalUsers are useful Per family by judging, user uses as internal potential malice in last remaining abnormal user cluster set AbnormalUsers Family reports safety officer's analysis reply.

3. a kind of inside threat detection method based on user language feature according to claim 2, it is characterized in that, it is described User language data include work mail data, data for electronic documents and social networking application data, and the work mail data is use The content of text for the work mail that family is sent, the data for electronic documents is for the related to work of user writing and with electronic edition shape Formula storage content of text, the social networking application data for user social status crawl after content of text.

4. a kind of inside threat detection method based on user language feature according to claim 3, it is characterized in that, it is described Analyzing and processing process to the mail data that works comprises the following steps：

111), automation audit：Collect the work mail data in certain period；

112) contents processing, is automated：The mail that only analysis user sends, weeds out mail header, only for each envelope mail Extract content of text；For the transmission mail with multiple time tags, the mail of the last time time transmission is only considered；

113), automation polymerization：The work mail data of each user is subjected to automation audit and contents processing is automated Content of text aggregates into one big text and stored.

5. a kind of inside threat detection method based on user language feature according to claim 3, it is characterized in that, it is described Analyzing and processing process to data for electronic documents comprises the following steps：

122) contents processing, is automated：Remove title datas at different levels, formatted data and the picture sound number in electronic document According to, only extract electronic document in plain text content；

123), automation polymerization：The data for electronic documents of each user is subjected to automation audit and contents processing is automated Content of text aggregates into one big text and stored.

6. a kind of inside threat detection method based on user language feature according to claim 3, it is characterized in that, it is described Analyzing and processing process to social networking application data comprises the following steps：

132) contents processing, is automated：Picture, sound and the hyperlink data in social networking application status data are removed, is only located The content of text write in reason state by the user；

133), automation polymerization：The social networking application data of each user are subjected to automation audit and contents processing is automated Content of text data aggregate is into one big text and is stored.

7. a kind of inside threat detection method based on user language feature according to claim 4, it is characterized in that, in people In lattice psychological characteristics vector building process, using literary Psychoanalysis System in the minds of the text of Institute of Developed Organisms, Academia Sinica to each The mail text analysis of user, obtains the word frequency result of corresponding important part of speech, is used as Chinese word LIWC analysis result； By LIWC parts of speech and the feature association of five-factor model personality, 18 sub- dimensional characteristics numerical value of five-factor model personality are calculated, the use is used as The personality characteristic vector at family.

8. a kind of inside threat detection method based on user language feature according to claim 2, it is characterized in that, it is described The sub- dimension of 18 five-factor model personalities is respectively：Anxiety speciality, angry speciality, depressed speciality, self-consciousness speciality, impulsion speciality, Fragile speciality, trust speciality, moral speciality, profit his speciality, cooperation speciality, modest speciality, sympathize with speciality, self efficacy, order Speciality, responsibility speciality, sense of accomplishment, self-discipline speciality and careful speciality.

9. a kind of inside threat detection method based on user language feature according to claim 2, it is characterized in that, it is described The calculating process of 118 sub- dimensional characteristics numerical value is as follows：

<mrow> <msub> <mi>Feat</mi> <mi>i</mi> </msub> <mo>&RightArrow;</mo> <mo>{</mo> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mrow> <mi>i</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mrow> <mi>i</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mrow> <mi>i</mi> <mo>,</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>,</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mrow> <mo>(</mo> <msub> <mi>q</mi> <mrow> <mi>i</mi> <mo>,</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>,</mo> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>,</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> </mrow> </msub> <mo>)</mo> </mrow> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, Feat_iRepresent i-th of sub- dimension, and (q_i,j,c_i,j) represent corresponding LIWC parts of speech q_i,jAnd its corresponding statistics phase Closing property c_i,j, and N_iFor the LIWC part of speech number related to i-th of sub- dimension statistically significant；

<mrow> <msub> <mi>Feat</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>i</mi> </msub> </munderover> <msub> <mi>q</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>,</mo> <mrow> <mo>(</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>~</mo> <mn>18</mn> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>~</mo> <msub> <mi>N</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Wherein, Feat_iRepresent the personality characteristic vector of any one in 18 dimensions of user, q_jWith c_jRepresent respectively Word frequency value and corresponding statistic correlation on j-th of the part of speech for the LIWC that the user associates in i-th of dimension.

10. a kind of inside threat detection method based on user language feature according to claim 2, it is characterized in that, institute State and threaten confidence level TCD calculation formula as follows：

<mrow> <msub> <mi>TCD</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mtable> <mtr> <mtd> <mrow> <mi>i</mi> <mi>f</mi> <mo>:</mo> <msub> <mi>Z</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>></mo> <msub> <mi>MV</mi> <mi>j</mi> </msub> <mo>&RightArrow;</mo> <mn>1</mn> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>e</mi> <mi>l</mi> <mi>s</mi> <mi>e</mi> <mo>:</mo> <mn>0</mn> </mrow> </mtd> </mtr> </mtable> <mo>,</mo> <mi>j</mi> <mo>=</mo> <mn>1</mn> <mo>~</mo> <mn>18</mn> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

TCD_i={ 1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1 } (5)

Wherein, Z_ijRepresent the i row j column datas in Matrix_2, i.e., the Z score of i-th user's jth row dimensional characteristics, MV_jRepresent J-th of value in mean vector Mean_value；The number of numerical value ' 1 ' is 14 in the threat confidence level of user in formula (5), If the number 14 of numerical value ' 1 ' is more than given threshold k, the user is corrected for normal users, and by from Rejected in AbnormalUsers set.