CN110674288A

CN110674288A - User portrait method applied to network security field

Info

Publication number: CN110674288A
Application number: CN201810602323.6A
Authority: CN
Inventors: 杨育斌; 黄冠寰; 柯宗贵
Original assignee: Blue Shield Information Security Technology Co Ltd
Current assignee: Blue Shield Information Security Technology Co Ltd; Bluedon Information Security Technologies Co Ltd
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2020-01-10

Abstract

The invention discloses a user portrait method applied to the field of network security. A machine learning method such as semantic mining, time series fitting, clustering, relevance analysis and the like is added to a traditional user portrait method based on statistics and rules, a user behavior model is deeply mined and analyzed, and more accurate and effective abnormality detection capability is provided.

Description

User portrait method applied to network security field

Technical Field

The invention relates to the technical field of information security, in particular to a user portrait method applied to the field of network security.

Background

The difficulty with user profiling is modeling analysis of large-scale historical data. The method mainly comprises two difficulties, namely arrangement and analysis of a large amount of semi-structured unstructured data and lack of deep insight on user operation behaviors. A traditional user portrait method mainly counts some attributes of a user such as times, frequency, occurrence time period and the like through simple statistics, and identifies a sample which is seriously deviated from a sample mean value as an anomaly by using a 3Sigma principle based on a normal distribution hypothesis (although no proof that the attributes obey the normal distribution is provided). In addition, the traditional user portrait method emphasizes extraction and analysis of structured data, and only adopts a simple rule matching mode to extract unstructured and semi-structured data.

Disclosure of Invention

In order to overcome the defects of the prior art, when network security is faced, the method not only pays attention to the flow generated by the user, but also extracts and establishes an index system for the characteristics of the user such as natural attributes, operation habits and the like from multiple dimensions, and extracts deeper information by utilizing a semantic mining technology, thereby helping a manager to quantitatively learn the user and the security situation of the user.

1. Data is collected from multiple data sources, and besides flow and logs, natural attributes of users are collected.

2. In data preprocessing, besides traditional statistical description, semantic extraction is carried out on unstructured data to obtain vectorized representation.

3. And modeling the user portrait by comprehensively using machine learning algorithms such as clustering, group division, recommendation and the like.

The technical scheme of the invention has the following beneficial effects:

the invention collects information from multiple dimensions and describes user behaviors more comprehensively; through deep mining of text type data, user behaviors are described more accurately, and the false alarm rate is reduced through correlation analysis.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the present invention;

Detailed Description

First, obtain the data

Data for a user is obtained from a plurality of data sources. The structured data includes: demographic attributes of the user (such as age, gender, jail, department, education level, etc.), job characteristics (such as work duration, workload, work age, shift time, work time period, etc.). The semi-structured data includes: system operation (such as authority compliance, common functions, operation frequency, operation time period, effective operands and the like), and sensitive operation (such as unauthorized operation, non-self operation, data leakage and the like). The unstructured data mainly comprises specific content of flow, and the format of the unstructured data comprises characters, pictures, audio, video and the like.

Second, data preprocessing

Mainly processing structured and semi-structured data. The main directions of treatment are two: statistical descriptions, and vectorized descriptions.

In machine learning, non-digital data can only be embedded by one-hot without conversion, and has no great practical significance. And based on the extraction of the keywords, the extracted keywords need to be classified again to reduce the number of categories, which is cumbersome and has no direct mathematical representation of the data.

To address this problem, for unstructured data of the textual type, doc2vec is used to extract the vectorized representation of the document. Doc2vec differs slightly from word2vec, and when the next word is predicted, Doc vector and word vector are concatenated together to make the prediction. After training is complete, each word is mapped to a unique vector, and each doc is also mapped to a unique vector. At this time, this doc vector can be used in subsequent machine learning tasks, and like the word vector, the doc vector also has the capability of measuring the similarity between two different texts to some extent.

Thirdly, establishing a user model

1) Application behavior model for a user

And carrying out group division according to the behaviors, wherein the behaviors of all users in the same group are taken as the behavior baselines of each member in the group.

The method comprises the following steps:

① use HTTP protocol data to split IP and URL to find out users, applications, actions

② adding external conditions, such as time-slicing, to define application behavior

③ calculating user behavior matrix (UV matrix)

④ according to UV matrix, using clustering method (Kmeans, Gaussian clustering, density clustering) to find out user groups with similar behaviors

⑤ statistics of application behavior types and application behavior times of all members in the same group

⑥ taking the statistics of the above steps as the behavior baseline of the group of users

⑦ specifically, a group of user behavior baselines may be "approval operation (3 times) in the application (0A) accessed in the time period (10-11)", and when a group of users performs the approval operation 100 times while satisfying the above conditions, an exception is reported

Data to be detected:

2) data behavior model of user

And taking the users as objects, and calculating the data behavior baseline of each user for each application.

The training steps are as follows:

① respectively collecting files related to each application used by a certain user

② extraction of content information for files involved in a particular user-specific application

③ segmenting the content information and extracting doc vector by doc2vec

④ Using LDA topic model of built-in keyword library, content classification (topic) and sensitive word extraction (keyword) are performed, where two results are obtained, user-application-sensitive word (secondary right) -label (primary right), "label (primary right) -sensitive word (secondary right).

⑤, according to doc vector, using clustering method (Kmeans, Gaussian clustering, density clustering), dividing similar data groups, according to the user's use condition (times, frequency, time period, etc.) to different data groups, constructing base line to the user's use data behavior.

Fourthly, prediction is carried out based on a user model

1. Predicting user behavior

1) Extracting features describing the behavior

And acquiring response text content returned by the accessed host, performing word segmentation, and extracting keywords or subjects by using an LDA subject model to serve as the characteristics of the behavior.

2) Learning user preference characteristics, recommending preferred behavior (HOST)

① classifier mode

The classifier is constructed by acquiring user preferred HOST (visited) and not preferred HOST (not visited) feature data. When a new batch of HOST appears, the classifier can identify whether the user is interested in the HOST.

② nearest neighbor similarity pattern

For a batch of new behaviors to be detected (HOST), extracting feature vectors by using the HOST visited by a user in the past, calculating K behaviors which are most similar to the batch of behaviors in all behaviors visited by the user in history by using Euclidean distance or cosine distance, and finding out the behavior with the minimum distance or the most similar distance as a recommendation result.

2. Predicting user behavior using data

1) Extracting historical behavior features

And extracting doc vectors from the data used by the user history, and clustering to obtain the information such as the classification, frequency, time period and the like of the frequently used data.

2) According to the information, a clustering (Kmeans, Gaussian clustering, density clustering and the like) method is used for searching the users similar to the data behaviors of the users.

3) From this group of similar users, a category of data that the user has not used but is likely to use in the user group is found, and the user is considered to have a certain possibility of using the category of data, so that an anomaly is not reported in such cases.

3. Detecting abnormal user data

Detecting abnormal data use behaviors of a user, and the steps are as follows:

① obtaining the files operated by the user in the application line

② content information extraction

③ participle for content information

④ matching the word segmentation result with the table of "label (first authority) -sensitive word (second authority)" to find out the probability of each kind of label

⑤, matching the class label list with the user's label list to determine the coincidence degree with the user's label list.

The user portrayal method applied to the network security field provided by the embodiment of the invention is described in detail above, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiment is only used to help understand the method of the invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. The invention relates to a user portrait method applied to the network security field, which extracts and establishes an evaluation index system from the characteristics of natural attributes, operation habits, flow use conditions and the like of users in multiple dimensions in the security field.

2. A user representation method as claimed in claim 1, wherein an LDA algorithm is used to perform semantic mining on textual data in traffic, and to extract subject matters and corresponding keywords, so as to further classify the data.

3. The user representation method applied to the network security field as claimed in claim 1, wherein potential association relationships among users are discovered through a group and recommendation algorithm based on traffic baselines, data usage baselines and natural attributes of the users, and potential additional features of the users are mined.