CN107196942B - Internal threat detection method based on user language features - Google Patents

Internal threat detection method based on user language features Download PDF

Info

Publication number
CN107196942B
CN107196942B CN201710374486.9A CN201710374486A CN107196942B CN 107196942 B CN107196942 B CN 107196942B CN 201710374486 A CN201710374486 A CN 201710374486A CN 107196942 B CN107196942 B CN 107196942B
Authority
CN
China
Prior art keywords
user
data
users
trait
personality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710374486.9A
Other languages
Chinese (zh)
Other versions
CN107196942A (en
Inventor
杨光
王继志
杨英
陈丽娟
陈振娅
文立强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN201710374486.9A priority Critical patent/CN107196942B/en
Publication of CN107196942A publication Critical patent/CN107196942A/en
Application granted granted Critical
Publication of CN107196942B publication Critical patent/CN107196942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses an internal threat detection method based on user language features, which comprises the steps of firstly analyzing language data of a user, extracting the language features, establishing a numerical feature vector capable of representing the personality psychological features of the user, then constructing a classifier, training the classifier to identify the user with abnormal personality psychological features, finally analyzing the feature vector offset degree of the user with abnormal personality psychological features to screen out false-alarm users, and reporting the rest users as internal potential malicious users to a safety manager for analysis and response. The invention fully considers the psychological characteristics of an attacker in the internal attack, carries out psychological modeling from the personality angle, constructs an anomaly detection classifier, makes up the defect that the existing detection method only pays attention to the attack process and neglects the attack subject, can distinguish 'anomaly' from 'malice' in a fine-grained manner, can comprehensively analyze and detect the internal threat, and effectively reduces the problems of high false report and missed report of the traditional internal threat detection method.

Description

Internal threat detection method based on user language features
Technical Field
The invention relates to an internal threat detection method based on user language features, and belongs to the technical field of information security construction/network security.
Background
With the development of networks, the security of network information draws more and more attention from society, and various security products such as anti-virus software, firewalls, intrusion detection and the like are widely applied. However, these information security products are only for defending external intrusion and theft, and with the development of people's knowledge and technology of network security, it is found that the disclosure and intrusion events caused by internal personnel account for a large proportion, such as the snooker ' prism door ' event in 2013, which is a typical security case for disclosure of internal personnel. Therefore, the internal threat should be dealt with as well as the external intrusion, but no effective internal threat detection mechanism exists in reality.
Since the insider threat attackers are generally employees (on-duty or off-duty) of a company or organization, contractors, business partners, etc. and have access rights to the organized system, network, and data, the insider threat is usually extremely concealed and dangerous, and the traditional defense-in-depth system based on the security devices such as firewall and IDS cannot effectively cope with the insider threat.
The key of detecting the internal threat lies in perfect internal security audit, and the core of the method is to take a user as a center and record all key operations and behaviors of the user in a system and a network so as to form a behavior track of the user in the internal network. The focus of current internal security audits is the following actions:
document auditing: operations such as writing, creating, copying and deleting of the audit document;
printing audit: printing events and file contents initiated by audit users and the like;
logging and auditing: the behavior of an audit user logging in the system, and the operation of logging out, restarting and closing the system;
and (4) process audit: auditing the process of user creation and closing;
network monitoring: auditing WEB access behaviors including accessing a target IP/Port, a page request and the like;
equipment audit: auditing the use behaviors of the USB and other movable storage devices, such as copied and deleted files;
e, mail auditing: and auditing the mail behavior of the user, such as receiving/sending persons in the mail header information, mail titles, partial texts, the number (type) of attachments and the like.
Multi-dimensional, fine-grained internal security audits inevitably result in huge data volumes, with the attendant dramatically increasing detection complexity presenting challenges to internal threat detection. Therefore, big data security research combined with big data analytics technology to model user behavior, especially for internal security audit logs, has become a research hotspot today. However, in practice, the false alarm rate of the internal threat detection system is high due to the fact that the data source of the internal threat detection system is not enough to depict dimension, the detection system is single in structure, and the practicability is poor, so that it is necessary to design an internal threat detection system with good practicability.
The key point of the design and research of the existing internal threat detection system is to establish an internal threat classifier based on an internal security audit log of a user by applying an anomaly detection method, and the method mainly comprises the following steps:
and (4) internal security audit acquisition: deploying an internal security audit system, collecting internal system and network behaviors such as document access of a user, formatting the internal system and the network behaviors, and transmitting the internal system and the network behaviors to a classifier construction module;
an abnormality detection classifier: learning a user behavior model from the received data by using an anomaly detection method, and constructing an anomaly detection classifier;
user behavior detection: detecting the user behavior log of a specific time window by an anomaly detection classifier, and judging whether the user behavior log is an internal threat;
the internal threat detection method based on anomaly detection can cope with most internal attack situations in practice, however, the precondition hypothesis has the defect that the method cannot be ignored, namely the hypothesis is that: the malicious behavior of the internal malicious user is different from the normal working behavior, so the malicious behavior can be distinguished by means of abnormal detection; in practice, the assumed malicious behavior and the assumed abnormal behavior are not completely equivalent, that is, the two types of behavior sets are not equal, so that if only the abnormal behavior detection is considered, a high false alarm rate (a normal user is identified as a malicious user) and a high false negative (a malicious user is identified as a normal user) are inevitably caused, and specifically, the following two cases can be referred to:
1. the project manager A and the project manager B often communicate with the progress affairs of the cooperative project through mails at ordinary times, and the project technical materials to be kept secret are sent to the project manager B through the mails on a certain day (the malicious behavior is not abnormal);
2. the buyer A always purchases from the supply and distribution company B, and suddenly purchases from the supply and distribution company C, but can not judge that A receives the rebate of C (the abnormal behavior is not necessarily malicious)
As described above, the core of the existing internal threat detection system is to build a classifier for detecting an anomaly by means of a policy check and anomaly detection of user behavior during an attack. However, the premise of analyzing the features of the attack process simply assumes that the boundaries between "abnormal" and "malicious" are confused, and in practice, the malicious behavior of the user may not belong to the abnormal behavior, and the abnormal behavior may not belong to the malicious behavior. The user system and network behavior data in the collected internal security audit log are not enough to finely distinguish the boundaries of 'abnormity' and 'malice', so that the internal threat detection system based on the existing data dimension inevitably has the problems of high false report and false report. The high false alarm results in low alarm quality, on one hand, analysts cannot analyze comprehensively, on the other hand, the system availability is reduced, and the result detection system is similar to a nominal system; high false negatives directly defeat security defense, causing enterprise or organizational assets to be at high risk. High false alarm and high false alarm are key factors for restricting the practicability of the internal threat detection system, and are also main problems of the current internal threat detection system.
Disclosure of Invention
Aiming at the defects of high false alarm and high false alarm rate of the existing internal threat detection method which depends on strategy check and behavior data abnormity detection in the prior art, the invention provides the internal threat detection method based on the user language characteristics, which can comprehensively analyze and detect the internal threats and effectively reduce the problems of high false alarm rate and false alarm rate of the traditional internal threat detection method.
The technical scheme adopted for solving the technical problems is as follows: a user language feature-based internal threat detection method is characterized by comprising the steps of firstly analyzing language data of a user, extracting language features, establishing a numerical feature vector capable of representing personality psychological features of the user, then constructing a classifier, training the classifier to identify the user with abnormal personality psychological features, finally analyzing the feature vector deviation degree of the user with abnormal personality psychological features to screen out false-alarm users, and reporting the rest users to a safety manager as internal potential malicious users to analyze and deal with the users.
Preferably, the internal threat detection method based on the user language features comprises the following steps:
1) and data preprocessing: analyzing and processing user language data of an internal auditing system at least including three aspects of automatic auditing, automatic content processing and automatic aggregation;
2) and constructing a personality psychological characteristic vector: firstly, analyzing user language data of each user, taking a word frequency result of a corresponding important word class as an analysis result of a Chinese word LIWC, and then, associating the word frequency result with the characteristics of the five lattices by means of the LIWC word class, and taking 18 sub-dimension characteristic values of the five lattices as the personality psychological characteristic vector of the user;
3) training a classifier: firstly, constructing a classifier, selecting audited user language data in a certain initial time period, calculating a personality psychological characteristic vector of each user, then training by applying a single-class support vector machine to obtain a psychological model of an initial user group, finally calculating the personality psychological characteristic vector based on user language data content modeling in any new subsequent time period, judging whether the personality psychological characteristic vector is abnormal or not by using the user group psychological model, and judging an abnormal user group set as Abnormals Users;
4) and calculating the confidence coefficient of the threat: calculating threat confidence degrees of the abnormal user group sets AbnormalUsers to further screen users; the threat confidence calculation process comprises the following specific steps:
41) for users in the abnormal user group set AbnormalUsers, a Matrix _1 is formed by 18-dimensional feature vectors corresponding to the users, the number of rows is the number of users of the abnormal users, and the rows are 18;
42) and calculating the Z score of each row of the matrix Martix _1 according to the columns to obtain Martrix _2, wherein the calculation formula of the Martrix _2 is as follows:
Figure BDA0001303645770000051
wherein, for the ith user in Matrix _1, XijRepresents the j-th dimension value of the digital image,
Figure BDA0001303645770000052
represents the mean value, σ, of the values of the j-th column in its matrixjRepresents the standard deviation of column j;
forming a new Matrix _2 after calculating the Z score for each user in the Matrix _ 1;
43) calculating the Mean value of each line of data of the matrix Martrix _2 to obtain a Mean value vector Mean _ value of 18 dimensions;
44) firstly, sequentially comparing the number of corresponding numerical values exceeding the Mean value vector Mean _ value in 18-dimensional feature vectors of each user in an abnormal user group set AbnormalUsers, then taking the obtained new 18-dimensional binary vector as a threat confidence coefficient TCD of the user, if the number of '1' in the threat confidence coefficient TCD exceeds a threshold value K, marking the user as a normal user, and deleting the user from the abnormal user group set AbnormalUsers;
45) and repeating the steps 41) to 44) until all users in all abnormal user group sets AbnormalUsers are judged, and finally reporting the users in the rest abnormal user group sets AbnormalUsers as internal potential malicious users to a security administrator for analysis and correspondence.
Preferably, the user language data includes work mail data, electronic document data, and social application data, the work mail data is text content of a work mail sent by a user, the electronic document data is text content written by the user and related to work and stored in an electronic version form, and the social application data is text content crawled by a social state of the user.
Preferably, the analysis processing procedure of the working mail data comprises the following steps:
111) and automatic audit: collecting work mail data in a certain time period;
112) and automated content processing: only analyzing mails sent by a user, screening out mail header information of each mail, and only extracting text contents; for a sent mail with a plurality of time tags, only a mail sent last time is considered;
113) and automated polymerization: and aggregating the text contents of the work mail data of each user subjected to automatic audit and automatic content processing into a large text file and storing the large text file.
Preferably, the analysis processing procedure of the electronic document data includes the steps of:
121) and automatic audit: collecting electronic document data in work in a certain time period;
122) and automated content processing: removing all levels of title data, format data and picture sound data in the electronic document, and only extracting pure text content in the electronic document;
123) and automated polymerization: and aggregating the text contents of the electronic document data of each user subjected to automatic audit and automatic content processing into a large text file and storing the large text file.
Preferably, the analysis processing process of the social application data comprises the following steps:
131) and automatic audit: collecting social application state data of an internal user in a certain time period;
132) and automated content processing: removing pictures, sound and hyperlink data in the social application state data, and only processing text content written by the user in the state;
133) and automated polymerization: and aggregating the text content data of each user for automatic audit and automatic content processing into a large text file and storing.
Preferably, in the construction process of the personality psychological characteristic vector, a Chinese and literary psychological analysis system of the psychological institute of Chinese academy of sciences is used for analyzing the mail text file of each user to obtain word frequency results of corresponding important parts of speech as analysis results of Chinese words LIWC; and calculating 18 sub-dimension characteristic values of the five lattices as the personality psychological characteristic vector of the user by means of the association of the LIWC part of speech and the characteristics of the five lattices.
Preferably, the 18 sub-dimensions of the five-person format are respectively: anxiety trait, anger trait, depression trait, self-awareness trait, impulse trait, fragility trait, trust trait, moral trait, ritate trait, cooperative trait, modesty trait, sympathic trait, self-potency, order trait, responsibility trait, sense of achievement, self-discipline trait, and cautious trait.
Preferably, the 118 sub-dimension feature values are calculated as follows:
for the ith dimension of the 18 sub-dimensions, the statistical relevance of the sub-dimension to the LIWC part of speech is:
Figure BDA0001303645770000071
wherein, FeatiRepresents the ith sub-dimension, and (q)i,j,ci,j) Indicating the corresponding LIWC word class qi,jAnd its corresponding statistical relevance ci,jAnd N isiCounting the number of LIWC parts of speech which are obviously related to the ith sub-dimension;
on the basis of formula (1), calculating the personality psychological characteristic vector of the user through formula (2):
Figure BDA0001303645770000072
wherein, FeatiPersonality psycho-feature vector, q, representing any one of 18 dimensions of the userjAnd cjRespectively represent the word frequency value and the corresponding statistical relevance of the user on the jth part of speech of the LIWC associated with the ith dimension.
Preferably, the threat confidence TCD is calculated as follows:
Figure BDA0001303645770000073
TCDi={1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1} (5)
wherein Z isijRepresents i rows and j columns of data in Matrix _2, namely Z score, MV, of the j column dimension characteristic of the ith userjRepresents the jth value in the Mean vector Mean _ value; the number of the value '1' in the threat confidence of the user in the formula (5) is 14, and if the number of the value '1' 14 is greater than a given threshold value K, the user is modified to be a normal user and is removed from the abrormalusers set.
The invention has the beneficial effects that: the invention extracts the language characteristics by analyzing the language data of the user and establishes the numerical characteristic vector which can represent the personality psychological characteristics of the user, then trains a classifier from the user to identify the users with abnormal personality psychological characteristics, and further analyzes the characteristic vector offset of the users, thereby screening out the false alarm users, and reporting the rest users as the internal potential malicious users to a safety manager for analysis and response. The invention fully considers the psychological characteristics of an attacker in the internal attack, carries out psychological modeling from the personality angle, constructs an anomaly detection classifier according to the psychological characteristics, overcomes the defect that the existing detection method only pays attention to the attack process and neglects the attack subject, can distinguish anomaly from maliciousness in a fine-grained manner, comprehensively analyzes and detects the internal threat, and effectively avoids the problems of high false report and missed report of the traditional internal threat detection method.
Compared with the prior art, the invention has the following characteristics:
modeling the characteristics of an attacker: the defect that the existing detection method only focuses on the characteristics of the attack process is made up, and the characteristics of an attacker are modeled, so that the possibility of analyzing an attack motivation and predicting the attack is provided; taking a working mail as an example, by analyzing user language features in the mail and combining with the study on statistical relevance of LIWC part of speech and personality features, 18-dimensional feature vectors representing personality psychological features of the user are constructed, and machine learning training is carried out to obtain the classifier.
Proposing a threat confidence level TCD: if the language feature modeling personality psychological features are independently used for judging malicious users, higher false alarm is bound to exist, so that the average deviation change of the users in 18 personality psychological dimensions is further analyzed for abnormal users detected by the classifier, the users with larger deviation are finally identified as normal users, and the users are deleted from the Abnormal Users, so that the false alarm rate of the detection method is reduced.
In addition to the main points, the invention also solves the defects of the traditional psychological detection method. The traditional psychological test method is mainly realized by depending on user psychological questionnaire test, colleagues or leadership evaluation and the like, wherein not only much time and economic cost are needed to be paid, but also more importantly, subjective deviation is difficult to avoid between user self evaluation and third-party evaluation, and laws and regulations such as privacy protection and the like can be possibly violated. The detection method provided by the invention is based on an internal auditing system, the whole analysis process is free from manual participation, the automatic operation is carried out, the original content file is automatically deleted after the LIWC part of speech analysis, the privacy of the staff is effectively protected, the detection of the internal malicious user is realized, the time and economic cost of the traditional detection are finally reduced, the legal morality risk is reduced, and the internal threat risk faced by enterprises and organizations is effectively reduced.
Drawings
The invention is described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a flow chart of a method for internal threat detection in accordance with the present invention.
Detailed Description
In order to clearly explain the technical features of the present invention, the following detailed description of the present invention is provided with reference to the accompanying drawings. The following disclosure provides many different embodiments, or examples, for implementing different features of the invention. To simplify the disclosure of the present invention, the components and arrangements of specific examples are described below. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. It should be noted that the components illustrated in the figures are not necessarily drawn to scale. Descriptions of well-known components and processing techniques and procedures are omitted so as to not unnecessarily limit the invention.
Internal attacks (or internal threats) are a new type of threat that is initiated by an insider in an enterprise or organization, and is different from the traditional network intrusion attacks. The insiders are located inside the traditional network security boundary and have key knowledge of security defense and attack targets, so that the insiders can bypass the existing security defense mechanism and carry out network attacks (such as stealing technical patents, client lists and the like) from the inside of enterprises or organizations, and huge loss is caused. The objects of the invention are: based on an internal information auditing system of an enterprise or government organization, the language data of users are collected to analyze the characteristics of the users, a psychological model is established for all the users, and internal potential malicious users, namely high-risk users which are likely to become internal attackers, are distinguished from the psychological model. And submitting a high-risk user list for internal security management personnel to analyze on the basis, and taking a coping action to prevent or stop internal attack behaviors.
The invention discloses an internal threat detection method based on user language features, which is characterized by comprising the steps of firstly analyzing language data of a user, extracting language features, establishing a numerical feature vector capable of representing personality psychological features of the user, then constructing a classifier, training the classifier to identify the user with abnormal personality psychological features, finally analyzing the feature vector deviation degree of the user with abnormal personality psychological features to screen out false-alarm users, and reporting the rest users as internal potential malicious users to a safety manager for analysis and response.
Preferably, as shown in fig. 1, the internal threat detection method based on the user language features includes the following steps:
1) and data preprocessing: analyzing and processing user language data of an internal auditing system at least including three aspects of automatic auditing, automatic content processing and automatic aggregation;
2) and constructing a personality psychological characteristic vector: firstly, analyzing user language data of each user, taking a word frequency result of a corresponding important word class as an analysis result of a Chinese word LIWC, and then, associating the word frequency result with the characteristics of the five lattices by means of the LIWC word class, and taking 18 sub-dimension characteristic values of the five lattices as the personality psychological characteristic vector of the user;
3) training a classifier: firstly, constructing a classifier, selecting audited user language data in a certain initial time period, calculating a personality psychological characteristic vector of each user, then training by applying a single-class support vector machine to obtain a psychological model of an initial user group, finally calculating the personality psychological characteristic vector based on user language data content modeling in any new subsequent time period, judging whether the personality psychological characteristic vector is abnormal or not by using the user group psychological model, and judging an abnormal user group set as Abnormals Users;
4) and calculating the confidence coefficient of the threat: calculating threat confidence degrees of the abnormal user group sets AbnormalUsers to further screen users; the threat confidence calculation process comprises the following specific steps:
41) for users in the abnormal user group set AbnormalUsers, a Matrix _1 is formed by 18-dimensional feature vectors corresponding to the users, the number of rows is the number of users of the abnormal users, and the rows are 18;
42) and calculating the Z score of each row of the matrix Martix _1 according to the columns to obtain Martrix _2, wherein the calculation formula of the Martrix _2 is as follows:
Figure BDA0001303645770000111
wherein, for the ith user in Matrix _1, XijRepresents the j-th dimension value of the digital image,
Figure BDA0001303645770000112
represents the mean value, σ, of the values of the j-th column in its matrixjRepresents the standard deviation of column j;
forming a new Matrix _2 after calculating the Z score for each user in the Matrix _ 1;
43) calculating the Mean value of each line of data of the matrix Martrix _2 to obtain a Mean value vector Mean _ value of 18 dimensions;
44) firstly, sequentially comparing the number of corresponding numerical values exceeding the Mean value vector Mean _ value in 18-dimensional feature vectors of each user in an abnormal user group set AbnormalUsers, then taking the obtained new 18-dimensional binary vector as a threat confidence coefficient TCD of the user, if the number of '1' in the threat confidence coefficient TCD exceeds a threshold value K, marking the user as a normal user, and deleting the user from the abnormal user group set AbnormalUsers;
45) and repeating the steps 41) to 44) until all users in all abnormal user group sets AbnormalUsers are judged, and finally reporting the users in the rest abnormal user group sets AbnormalUsers as internal potential malicious users to a security administrator for analysis and correspondence.
Preferably, the user language data includes work mail data, electronic document data, and social application data, the work mail data is text content of a work mail sent by a user, the electronic document data is text content written by the user and related to work and stored in an electronic version form, and the social application data is text content crawled by a social state of the user.
Preferably, the 18 sub-dimensions of the five-person format are respectively: anxiety trait, anger trait, depression trait, self-awareness trait, impulse trait, fragility trait, trust trait, moral trait, ritate trait, cooperative trait, modesty trait, sympathic trait, self-potency, order trait, responsibility trait, sense of achievement, self-discipline trait, and cautious trait.
The method mainly comprises the steps of analyzing language data of users, extracting language features, establishing a numerical feature vector capable of representing the personality psychological features of the users, training a classifier from the users to identify the users with abnormal personality psychological features, further analyzing the feature vector offset of the users, screening out false alarm users, and reporting the rest users to a safety manager as internal potential malicious users to analyze and correspond. As shown in fig. 2, the overall technical solution of the present invention may be divided into four main steps, namely data processing, personality psychology feature vector construction, classifier training, and threat confidence calculation, which are respectively set forth in detail below.
First, data preprocessing
The analyzed user language data comes from an internal auditing system and mainly comprises three types:
1. auditing the working mail: text content of work mails sent by the audited user;
2. electronic document content auditing: the work documents and forms audited in electronic version form, such as planning books and work reports related to work written by the user, and text contents in multimedia formats such as PPT;
3. social application content auditing: and auditing the text contents after the social states of the user, such as the microblog, the WeChat friend circle and the like, are crawled.
The analysis processing of the three types of language data sources is similar, and for convenience of explanation, the preprocessing operations of the three types of language data sources are explained separately below.
For the work mail, the data processing work mainly includes:
11) automatic audit: collecting work mail data for a certain period of time (months or a year);
12) automated content processing: analyzing only the sent mails of the user, and screening out the information of the mail header (title, sender, receiver and the like) and extracting only the text content for each mail; for a sent email with multiple time tags, only the most recent time is considered (e.g., only the reply text or the text at the time of forwarding is considered for both the forward and reply emails);
13) automated polymerization: for each user, all the text contents automatically processed according to the steps are aggregated into a large text file, and the large text file is stored and then is used for next analysis.
For an electronic document:
21) automatic audit: collecting electronic document data in work for a certain period of time (several months or one year);
22) automated content processing: removing multimedia data such as title data, format data, picture and sound of each level in the electronic document, and only extracting plain text content in the electronic document;
23) automated polymerization: for each user, aggregating all the corresponding electronic document contents automatically processed according to the steps into a large text file, and storing the large text file for the next analysis.
For social applications:
31) automatic audit: collecting social application state data (e.g., microblogs, circle of friends, etc.) of internal users over a period of time (months or a year);
32) automated content processing: removing unformatted data such as pictures, sounds and hyperlinks in the social application state data, and only processing text content written by the user in the state, namely not including text content of a forwarding type;
33) automated polymerization: for each user, aggregating all the corresponding social application state text content data automatically processed according to the steps into a large text file, and storing the large text file for the next analysis.
Second, psychological characteristic construction
The following processing and analysis processes of the invention are applicable to three types of data sources of work mails, electronic documents and social application states, and are not described in a distinguishing way.
A Chinese and Chinese psychological analysis system (http:// ccpl. psych. ac. cn/textmind /) 1 of the psychological research institute of Chinese academy of sciences is used for analyzing the mail text file of each user to obtain a word frequency result of a corresponding important word class as an analysis result of a Chinese word LIWC 2. Among them, LIWC (language acquisition and vocabulary counting library) is a widely used open analysis system for analyzing subjective factors such as thought, emotion and personality from languages, and a chinese-to-chinese analysis system is a scientific extension of the original english system on the chinese language vocabulary library. After the step is finished, deleting the original content file of each user to ensure privacy safety;
by means of the characteristic association of the LIWC word classes and the five-personality (3), 18 sub-dimension characteristic numerical values of the five-personality are calculated and serve as the personality psychological characteristic vector (4) of the user.
The sub-dimensions of the 18 grand five lattices are respectively:
Figure BDA0001303645770000141
the following describes how to calculate the feature value of each sub-dimension of each user according to the LIWC part-of-speech analysis result, taking the fragile feature as an example. From [ 4 ], the statistical relevance of the sub-dimension traits and the LIWC word classes can be obtained, for example, the fragile traits are: sensory vocabulary (0.18), anxiety vocabulary (0.16), articles (-0.16), first-person singular vocabulary (0.14), retrospective vocabulary (0.13), causal words (0.11), gap words (0.11), cognitive process words-cognitive processes (0.1), modifiers (0.1), second-person vocabulary- (-0.1).
The 10 LIWC parts of speech with strong correlation with the fragile characteristic are listed, the numerical value in the parentheses is a correlation coefficient, and the numerical score of the fragile characteristic of the user can be obtained according to the correlation and the LIWC part of speech analysis result of the user and is used as one numerical value in 18 sub-dimensions.
For the ith dimension of the 18 sub-dimensions, the statistical relevance of the sub-dimension to the LIWC part of speech is found by the search study [ 4 ]:
Figure BDA0001303645770000151
wherein, FeatiRepresents the ith sub-dimension, and (q)i,j,ci,j) Indicating the corresponding LIWC word class qi,jAnd its corresponding statistical relevance ci,jAnd N isiAnd counting the number of LIWC parts of speech which are obviously related to the ith sub-dimension. On the basis of equation (1), we calculate equation (2):
Figure BDA0001303645770000152
the above formula represents the method for calculating the 18-dimensional personality psychometric feature vector for any user. q. q.sjAnd cjRespectively represent the word frequency value and the corresponding statistical relevance of the user on the jth part of speech of the LIWC associated with the ith dimension. The remaining values were calculated according to this method in combination with the correlation. Finally for each userIn other words, 18-dimensional feature vectors representing the personality psychology features of the individual psychology features are obtained, and each numerical value is obtained by combining the LIWC word class analysis result and the personality trait correlation weighted sum according to the method.
Third, training classifier
In order to build a classifier by applying a machine learning algorithm, the invention proposes to select an initial certain time period (such as 1 month), calculate a personality psychological characteristic vector of each user according to a psychological characteristic building process according to an audited user work mail in the time period, and train by applying a single-Class support vector machine (One Class SVM, sklern-0.19 version algorithm library) to obtain a psychological model PsyModel of an initial user group.
And when any new time period (such as a certain month of the next time period) follows, calculating a personality psychological characteristic vector based on user work mail content modeling in the time period according to a method of a psychological characteristic construction process, judging whether the user group psychological model PsyModel obtained in the previous step is abnormal, and judging the abnormal user group set as AbnormalUsers.
Fourthly, calculating the confidence coefficient of the threat
For the user group set Abnormal Users judged to be abnormal by the classifier obtained in the process of training the classifier, certain normal users may be included, so that threat confidence needs to be calculated to further screen users. Specifically, the method comprises the following steps:
1) for users in the Absolul Users, a Matrix _1 is formed by 18-dimensional feature vectors corresponding to the users, the number of rows is the number of the users of the Absolul Users, and the rows are 18;
2) and calculating the Z score of each row of the matrix Martix _1 according to the columns to obtain Martrix _2, namely a formula:
Figure BDA0001303645770000161
wherein, for the ith user in Matrix _1, XijRepresents the j-th dimension value of the digital image,
Figure BDA0001303645770000162
represents the mean value, σ, of the values of the j-th column in its matrixjRepresents the standard deviation of the j-th column. After the Z score is calculated for each user (namely each row of data) in Matrix _1, a new Matrix _2 is formed;
3) calculating the Mean value of each column of data of the matrix Martrix _2 to obtain a Mean value vector Mean _ value of 18 dimensions;
4) for each user in the Absolu Users, sequentially comparing the number of corresponding numerical values exceeding the Mean vector Mean _ value in the 18-dimensional feature vectors, and then taking the obtained new 18-dimensional binary vector as the threat confidence coefficient (TCD) of the new 18-dimensional binary vector; if the number of '1' in the TCD exceeds a threshold value K, marking the user as a normal user, and deleting the user from the AbnormalUsers; k here is suggested to be 12, and can be flexibly adjusted between (12-16) according to the situation, and the specific formula is as follows:
Figure BDA0001303645770000163
TCDi={1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1} (5)
wherein Z isijRepresents i rows and j columns of data in Matrix _2, namely Z score, MV, of the j column dimension characteristic of the ith userjRepresenting the jth value in the Mean vector Mean value. The number of the value '1' in the threat confidence of the user in the formula (5) is 14, and is greater than the given threshold value K-12, so that the user is modified to be a normal user and is removed from the abnormalsuers set.
5) And repeating the steps 1) to 4) until all users in all the Absolul Users are judged, and finally reporting the rest of the Absolul Users as internal potential malicious users to a security administrator for analysis and response.
The invention provides a psychological modeling detection method which aims at the defects of high false alarm and high false negative alarm of the existing internal threat detection method which relies on strategy check and behavior data abnormity detection, constructs a psychological characteristic vector representing the personality characteristics of a user based on the language data characteristics of the user (staff), establishes an integral user group psychological model by means of a machine learning algorithm, and finally identifies an internal abnormal user. On the basis, the invention analyzes the integral deviation degree of the abnormal users identified in the last step in the characteristic dimension, thereby removing normal users which are possibly mistakenly reported, finally obtaining internal potential malicious users, and submitting the internal potential malicious users to a security administrator for further analysis and correspondence. The invention fully considers the psychological characteristics of an attacker in the internal attack, carries out psychological modeling from the personality angle, constructs an anomaly detection classifier, and makes up the defect that the existing detection method only pays attention to the attack process and ignores the attack subject, thereby distinguishing the anomaly from the maliciousness in a fine-grained manner, comprehensively analyzing and detecting the internal threat, and effectively reducing the problems of high false report and missed report of the traditional internal threat detection method.
According to the method, after identity identification information (such as a mail header, work document metadata, a social ID and the like) is deleted for user language data from an audit work mail, an audit work document and an audit social media application (a microblog, a friend circle and the like), text data are gathered into a large file, and then a word class result of an LIWC is obtained by analyzing the Chinese by means of a mental analysis system (1) for the Chinese in the center of the text;
according to the invention, 18-dimensional personality psychological characteristic vectors with anxiety traits as the first factor are established according to the research result (4) of statistical relevance between the LIWC word class (2) and the personality psychological characteristics (3);
for a user set AbnormalUsers judged to be abnormal by a classifier, analyzing Z scores of columns of users, calculating an average value according to the columns as a reference vector, calculating the number of column features exceeding the corresponding average value in each user as a threat confidence coefficient, judging the user set Abnormative Users if the user set AbnormalUsers exceed a preset threshold value K, and removing the user set AbnormalUsers;
references to which the present invention relates:
【1】 The psychological analysis system for Chinese and literature: http:// ccpl. psych. ac. cn/textmind ™
【2】LIWC Program:http://liwc.wpengine.com/
【3】 Five personality models:
http://www.baike.com/wiki/%E5%A4%A7%E4%BA%94%E4%BA%BA%E6%A0%BC%E7%90%86%E8%AE%BA
【4】 The LIWC parts of speech are associated with the five personality models:
https://www.researchgate.net/publication/44687893Personalityin100000Word sAlarge-scaleanalysisofpersonalityandworduseamongbloggers。
the foregoing is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the principle of the invention, and such modifications and improvements are also considered to be within the scope of the invention.

Claims (9)

1. An internal threat detection method based on user language features is characterized in that language data of a user is analyzed, language features are extracted, a numerical feature vector capable of representing personality psychological features of the user is established, a classifier is established, the classifier is trained to identify users with abnormal personality psychological features, finally, the feature vector deviation degree of the users with abnormal personality psychological features is analyzed to screen out false-alarm users, and the rest users are reported to a safety manager as internal potential malicious users to be analyzed and responded;
the internal threat detection method based on the user language features comprises the following steps:
1) and data preprocessing: analyzing and processing user language data of an internal auditing system at least including three aspects of automatic auditing, automatic content processing and automatic aggregation;
2) and constructing a personality psychological characteristic vector: firstly, analyzing user language data of each user, taking a word frequency result of a corresponding important word class as an analysis result of a Chinese word LIWC, and then, associating the word frequency result with the characteristics of the five lattices by means of the LIWC word class, and taking 18 sub-dimension characteristic values of the five lattices as the personality psychological characteristic vector of the user;
3) training a classifier: firstly, constructing a classifier, selecting audited user language data in a certain initial time period, calculating a personality psychological characteristic vector of each user, then training by applying a single-class support vector machine to obtain a psychological model of an initial user group, finally calculating the personality psychological characteristic vector based on user language data content modeling in any new subsequent time period, judging whether the personality psychological characteristic vector is abnormal or not by using the user group psychological model, and judging an abnormal user group set as Abnormals Users;
4) and calculating the confidence coefficient of the threat: calculating threat confidence degrees of the abnormal user group sets AbnormalUsers to further screen users; the threat confidence calculation process comprises the following specific steps:
41) for users in the abnormal user group set AbnormalUsers, a Matrix _1 is formed by 18-dimensional feature vectors corresponding to the users, the number of rows is the number of users of the abnormal users, and the rows are 18;
42) and calculating the Z score of each row of the matrix Martix _1 according to the columns to obtain Martrix _2, wherein the calculation formula of the Martrix _2 is as follows:
Figure FDA0002381731130000021
wherein, for the ith user in Matrix _1, XijRepresents the j-th dimension value of the digital image,
Figure FDA0002381731130000022
represents the mean value, σ, of the values of the j-th column in its matrixjRepresents the standard deviation of column j;
forming a new Matrix _2 after calculating the Z score for each user in the Matrix _ 1;
43) calculating the Mean value of each line of data of the matrix Martrix _2 to obtain a Mean value vector Mean _ value of 18 dimensions;
44) firstly, sequentially comparing the number of corresponding numerical values exceeding the Mean value vector Mean _ value in 18-dimensional feature vectors of each user in an abnormal user group set AbnormalUsers, then taking the obtained new 18-dimensional binary vector as a threat confidence coefficient TCD of the user, if the number of '1' in the threat confidence coefficient TCD exceeds a threshold value K, marking the user as a normal user, and deleting the user from the abnormal user group set AbnormalUsers;
45) and repeating the steps 41) to 44) until all users in all abnormal user group sets AbnormalUsers are judged, and finally reporting the users in the rest abnormal user group sets AbnormalUsers as internal potential malicious users to a security administrator for analysis and correspondence.
2. The method according to claim 1, wherein the user language data comprises work mail data, electronic document data and social application data, the work mail data is text content of a work mail sent by a user, the electronic document data is text content written by the user and related to work and stored in an electronic version form, and the social application data is text content crawled by the social state of the user.
3. The method of claim 2, wherein the analyzing and processing of the working mail data comprises the following steps:
111) and automatic audit: collecting work mail data in a certain time period;
112) and automated content processing: only analyzing mails sent by a user, screening out mail header information of each mail, and only extracting text contents; for a sent mail with a plurality of time tags, only a mail sent last time is considered;
113) and automated polymerization: and aggregating the text contents of the work mail data of each user subjected to automatic audit and automatic content processing into a large text file and storing the large text file.
4. The method according to claim 2, wherein the parsing process of the electronic document data comprises the following steps:
121) and automatic audit: collecting electronic document data in work in a certain time period;
122) and automated content processing: removing all levels of title data, format data and picture sound data in the electronic document, and only extracting pure text content in the electronic document;
123) and automated polymerization: and aggregating the text contents of the electronic document data of each user subjected to automatic audit and automatic content processing into a large text file and storing the large text file.
5. The internal threat detection method based on user language features as claimed in claim 2, wherein the analysis processing procedure of the social application data comprises the following steps:
131) and automatic audit: collecting social application state data of an internal user in a certain time period;
132) and automated content processing: removing pictures, sound and hyperlink data in the social application state data, and only processing text content written by the user in the state;
133) and automated polymerization: and aggregating the text content data of each user for automatic audit and automatic content processing into a large text file and storing.
6. The method according to claim 3, wherein in the process of constructing the personality psychology feature vector, the words frequency result of the corresponding important part of speech is obtained by analyzing the mail text file of each user by using a words and Chinese psychology analysis system of the institute of psychology and sciences in China as the analysis result of the Chinese word LIWC; and calculating 18 sub-dimension characteristic values of the five lattices as the personality psychological characteristic vector of the user by means of the association of the LIWC part of speech and the characteristics of the five lattices.
7. The method according to claim 1, wherein the 18 five-personality sub-dimensions are respectively: anxiety trait, anger trait, depression trait, self-awareness trait, impulse trait, fragility trait, trust trait, moral trait, ritate trait, cooperative trait, modesty trait, sympathic trait, self-potency, order trait, responsibility trait, sense of achievement, self-discipline trait, and cautious trait.
8. The method as claimed in claim 1, wherein the 18 sub-dimension feature values are calculated as follows:
for the ith dimension of the 18 sub-dimensions, the statistical relevance of the sub-dimension to the LIWC part of speech is:
Figure FDA0002381731130000041
wherein, FeatiRepresents the ith sub-dimension, and (q)i,j,ci,j) Indicating the corresponding LIWC word class qijAnd its corresponding statistical relevance cijAnd N isiCounting the number of LIWC parts of speech which are obviously related to the ith sub-dimension;
on the basis of formula (1), calculating the personality psychological characteristic vector of the user through formula (2):
Figure FDA0002381731130000042
wherein, FeatiPersonality psycho-feature vector, q, representing any one of 18 dimensions of the userjAnd cjRespectively represent the word frequency value and the corresponding statistical relevance of the user on the jth part of speech of the LIWC associated with the ith dimension.
9. The method of claim 1, wherein the threat confidence level TCD is calculated as follows:
Figure FDA0002381731130000051
or
TCDi={1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,1} (5)
Wherein Z isijRepresents i rows and j columns of data in Matrix _2, namely Z score, MV, of the j column dimension characteristic of the ith userjRepresents the jth value in the Mean vector Mean _ value; the number of the value '1' in the threat confidence of the user in the formula (5) is 14, and if the number of the value '1' 14 is greater than a given threshold value K, the user is modified to be a normal user and is removed from the abrormalusers set.
CN201710374486.9A 2017-05-24 2017-05-24 Internal threat detection method based on user language features Active CN107196942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710374486.9A CN107196942B (en) 2017-05-24 2017-05-24 Internal threat detection method based on user language features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710374486.9A CN107196942B (en) 2017-05-24 2017-05-24 Internal threat detection method based on user language features

Publications (2)

Publication Number Publication Date
CN107196942A CN107196942A (en) 2017-09-22
CN107196942B true CN107196942B (en) 2020-05-15

Family

ID=59874365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710374486.9A Active CN107196942B (en) 2017-05-24 2017-05-24 Internal threat detection method based on user language features

Country Status (1)

Country Link
CN (1) CN107196942B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110290155B (en) * 2019-07-23 2020-11-06 北京邮电大学 Defense method and device for social engineering attack
CN110807468B (en) * 2019-09-19 2023-06-20 平安科技(深圳)有限公司 Method, device, equipment and storage medium for detecting abnormal mail
CN110837604B (en) * 2019-10-16 2020-12-25 贝壳找房(北京)科技有限公司 Data analysis method and device based on housing monitoring platform
CN115022052B (en) * 2022-06-07 2023-05-30 山东省计算中心(国家超级计算济南中心) Internal user abnormal behavior fusion detection method and system based on user binary analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014205421A1 (en) * 2013-06-21 2014-12-24 Arizona Board Of Regents For The University Of Arizona Automated detection of insider threats
CN105005594A (en) * 2015-06-29 2015-10-28 嘉兴慧康智能科技有限公司 Abnormal Weibo user identification method
CN105138570A (en) * 2015-07-26 2015-12-09 吉林大学 Calculation method of crime degree of speech data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120137367A1 (en) * 2009-11-06 2012-05-31 Cataphora, Inc. Continuous anomaly detection based on behavior modeling and heterogeneous information analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014205421A1 (en) * 2013-06-21 2014-12-24 Arizona Board Of Regents For The University Of Arizona Automated detection of insider threats
CN105005594A (en) * 2015-06-29 2015-10-28 嘉兴慧康智能科技有限公司 Abnormal Weibo user identification method
CN105138570A (en) * 2015-07-26 2015-12-09 吉林大学 Calculation method of crime degree of speech data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
内部威胁检测研究;杨光 等;《内部威胁检测研究》;20160731;第1卷(第3期);正文第1节、第3.2.1节最后一段 *

Also Published As

Publication number Publication date
CN107196942A (en) 2017-09-22

Similar Documents

Publication Publication Date Title
Pacheco et al. Uncovering coordinated networks on social media: methods and case studies
US9336388B2 (en) Method and system for thwarting insider attacks through informational network analysis
Pacheco et al. Uncovering coordinated networks on social media
CN107577939B (en) Data leakage prevention method based on keyword technology
US8438174B2 (en) Automated forensic document signatures
Holton Identifying disgruntled employee systems fraud risk through text mining: A simple solution for a multi-billion dollar problem
CN107196942B (en) Internal threat detection method based on user language features
CN107172022B (en) APT threat detection method and system based on intrusion path
US20090164517A1 (en) Automated forensic document signatures
JP7466711B2 (en) System and method for using relationship structures for email classification - Patents.com
Whitham Automating the generation of enticing text content for high-interaction honeyfiles
CN112149749B (en) Abnormal behavior detection method, device, electronic equipment and readable storage medium
CN110134876B (en) Network space population event sensing and detecting method based on crowd sensing sensor
Zilberman et al. Analyzing group communication for preventing data leakage via email
CN107846389B (en) Internal threat detection method and system based on user subjective and objective data fusion
Lago et al. Visual and textual analysis for image trustworthiness assessment within online news
WO2021089196A1 (en) Method for intrusion detection to detect malicious insider threat activities and system for intrusion detection
Whitty Developing a conceptual model for insider threat
Griffin Using Big Data to Combat Enterprise Fraud.
US20220368714A1 (en) Method for intrusion detection to detect malicious insider threat activities and system for intrusion detection
Hassan et al. The role of artificial intelligence in cyber security and incident response
CN110598397A (en) Deep learning-based Unix system user malicious operation detection method
Michael et al. Discovering “Insider IT Sabotage” based on human behaviour
Sheldon et al. Policing and technology
Elmas et al. The power of deletions: Ephemeral astroturfing attacks on twitter trends

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant