CN110674288A - User portrait method applied to network security field - Google Patents

User portrait method applied to network security field Download PDF

Info

Publication number
CN110674288A
CN110674288A CN201810602323.6A CN201810602323A CN110674288A CN 110674288 A CN110674288 A CN 110674288A CN 201810602323 A CN201810602323 A CN 201810602323A CN 110674288 A CN110674288 A CN 110674288A
Authority
CN
China
Prior art keywords
user
data
users
network security
security field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810602323.6A
Other languages
Chinese (zh)
Inventor
杨育斌
黄冠寰
柯宗贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Blue Shield Information Security Technology Co Ltd
Bluedon Information Security Technologies Co Ltd
Original Assignee
Blue Shield Information Security Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Blue Shield Information Security Technology Co Ltd filed Critical Blue Shield Information Security Technology Co Ltd
Priority to CN201810602323.6A priority Critical patent/CN110674288A/en
Publication of CN110674288A publication Critical patent/CN110674288A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a user portrait method applied to the field of network security. A machine learning method such as semantic mining, time series fitting, clustering, relevance analysis and the like is added to a traditional user portrait method based on statistics and rules, a user behavior model is deeply mined and analyzed, and more accurate and effective abnormality detection capability is provided.

Description

User portrait method applied to network security field
Technical Field
The invention relates to the technical field of information security, in particular to a user portrait method applied to the field of network security.
Background
The difficulty with user profiling is modeling analysis of large-scale historical data. The method mainly comprises two difficulties, namely arrangement and analysis of a large amount of semi-structured unstructured data and lack of deep insight on user operation behaviors. A traditional user portrait method mainly counts some attributes of a user such as times, frequency, occurrence time period and the like through simple statistics, and identifies a sample which is seriously deviated from a sample mean value as an anomaly by using a 3Sigma principle based on a normal distribution hypothesis (although no proof that the attributes obey the normal distribution is provided). In addition, the traditional user portrait method emphasizes extraction and analysis of structured data, and only adopts a simple rule matching mode to extract unstructured and semi-structured data.
Disclosure of Invention
In order to overcome the defects of the prior art, when network security is faced, the method not only pays attention to the flow generated by the user, but also extracts and establishes an index system for the characteristics of the user such as natural attributes, operation habits and the like from multiple dimensions, and extracts deeper information by utilizing a semantic mining technology, thereby helping a manager to quantitatively learn the user and the security situation of the user.
1. Data is collected from multiple data sources, and besides flow and logs, natural attributes of users are collected.
2. In data preprocessing, besides traditional statistical description, semantic extraction is carried out on unstructured data to obtain vectorized representation.
3. And modeling the user portrait by comprehensively using machine learning algorithms such as clustering, group division, recommendation and the like.
The technical scheme of the invention has the following beneficial effects:
the invention collects information from multiple dimensions and describes user behaviors more comprehensively; through deep mining of text type data, user behaviors are described more accurately, and the false alarm rate is reduced through correlation analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the present invention;
Detailed Description
First, obtain the data
Data for a user is obtained from a plurality of data sources. The structured data includes: demographic attributes of the user (such as age, gender, jail, department, education level, etc.), job characteristics (such as work duration, workload, work age, shift time, work time period, etc.). The semi-structured data includes: system operation (such as authority compliance, common functions, operation frequency, operation time period, effective operands and the like), and sensitive operation (such as unauthorized operation, non-self operation, data leakage and the like). The unstructured data mainly comprises specific content of flow, and the format of the unstructured data comprises characters, pictures, audio, video and the like.
Second, data preprocessing
Mainly processing structured and semi-structured data. The main directions of treatment are two: statistical descriptions, and vectorized descriptions.
In machine learning, non-digital data can only be embedded by one-hot without conversion, and has no great practical significance. And based on the extraction of the keywords, the extracted keywords need to be classified again to reduce the number of categories, which is cumbersome and has no direct mathematical representation of the data.
To address this problem, for unstructured data of the textual type, doc2vec is used to extract the vectorized representation of the document. Doc2vec differs slightly from word2vec, and when the next word is predicted, Doc vector and word vector are concatenated together to make the prediction. After training is complete, each word is mapped to a unique vector, and each doc is also mapped to a unique vector. At this time, this doc vector can be used in subsequent machine learning tasks, and like the word vector, the doc vector also has the capability of measuring the similarity between two different texts to some extent.
Thirdly, establishing a user model
1) Application behavior model for a user
And carrying out group division according to the behaviors, wherein the behaviors of all users in the same group are taken as the behavior baselines of each member in the group.
The method comprises the following steps:
① use HTTP protocol data to split IP and URL to find out users, applications, actions
② adding external conditions, such as time-slicing, to define application behavior
③ calculating user behavior matrix (UV matrix)
④ according to UV matrix, using clustering method (Kmeans, Gaussian clustering, density clustering) to find out user groups with similar behaviors
⑤ statistics of application behavior types and application behavior times of all members in the same group
⑥ taking the statistics of the above steps as the behavior baseline of the group of users
⑦ specifically, a group of user behavior baselines may be "approval operation (3 times) in the application (0A) accessed in the time period (10-11)", and when a group of users performs the approval operation 100 times while satisfying the above conditions, an exception is reported
Data to be detected:
Figure BDA0001693548540000032
2) data behavior model of user
And taking the users as objects, and calculating the data behavior baseline of each user for each application.
The training steps are as follows:
① respectively collecting files related to each application used by a certain user
② extraction of content information for files involved in a particular user-specific application
③ segmenting the content information and extracting doc vector by doc2vec
④ Using LDA topic model of built-in keyword library, content classification (topic) and sensitive word extraction (keyword) are performed, where two results are obtained, user-application-sensitive word (secondary right) -label (primary right), "label (primary right) -sensitive word (secondary right).
⑤, according to doc vector, using clustering method (Kmeans, Gaussian clustering, density clustering), dividing similar data groups, according to the user's use condition (times, frequency, time period, etc.) to different data groups, constructing base line to the user's use data behavior.
Fourthly, prediction is carried out based on a user model
1. Predicting user behavior
1) Extracting features describing the behavior
And acquiring response text content returned by the accessed host, performing word segmentation, and extracting keywords or subjects by using an LDA subject model to serve as the characteristics of the behavior.
2) Learning user preference characteristics, recommending preferred behavior (HOST)
① classifier mode
The classifier is constructed by acquiring user preferred HOST (visited) and not preferred HOST (not visited) feature data. When a new batch of HOST appears, the classifier can identify whether the user is interested in the HOST.
② nearest neighbor similarity pattern
For a batch of new behaviors to be detected (HOST), extracting feature vectors by using the HOST visited by a user in the past, calculating K behaviors which are most similar to the batch of behaviors in all behaviors visited by the user in history by using Euclidean distance or cosine distance, and finding out the behavior with the minimum distance or the most similar distance as a recommendation result.
2. Predicting user behavior using data
1) Extracting historical behavior features
And extracting doc vectors from the data used by the user history, and clustering to obtain the information such as the classification, frequency, time period and the like of the frequently used data.
2) According to the information, a clustering (Kmeans, Gaussian clustering, density clustering and the like) method is used for searching the users similar to the data behaviors of the users.
3) From this group of similar users, a category of data that the user has not used but is likely to use in the user group is found, and the user is considered to have a certain possibility of using the category of data, so that an anomaly is not reported in such cases.
3. Detecting abnormal user data
Detecting abnormal data use behaviors of a user, and the steps are as follows:
① obtaining the files operated by the user in the application line
② content information extraction
③ participle for content information
④ matching the word segmentation result with the table of "label (first authority) -sensitive word (second authority)" to find out the probability of each kind of label
⑤, matching the class label list with the user's label list to determine the coincidence degree with the user's label list.
The user portrayal method applied to the network security field provided by the embodiment of the invention is described in detail above, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiment is only used to help understand the method of the invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (3)

1. The invention relates to a user portrait method applied to the network security field, which extracts and establishes an evaluation index system from the characteristics of natural attributes, operation habits, flow use conditions and the like of users in multiple dimensions in the security field.
2. A user representation method as claimed in claim 1, wherein an LDA algorithm is used to perform semantic mining on textual data in traffic, and to extract subject matters and corresponding keywords, so as to further classify the data.
3. The user representation method applied to the network security field as claimed in claim 1, wherein potential association relationships among users are discovered through a group and recommendation algorithm based on traffic baselines, data usage baselines and natural attributes of the users, and potential additional features of the users are mined.
CN201810602323.6A 2018-06-12 2018-06-12 User portrait method applied to network security field Pending CN110674288A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810602323.6A CN110674288A (en) 2018-06-12 2018-06-12 User portrait method applied to network security field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810602323.6A CN110674288A (en) 2018-06-12 2018-06-12 User portrait method applied to network security field

Publications (1)

Publication Number Publication Date
CN110674288A true CN110674288A (en) 2020-01-10

Family

ID=69065913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810602323.6A Pending CN110674288A (en) 2018-06-12 2018-06-12 User portrait method applied to network security field

Country Status (1)

Country Link
CN (1) CN110674288A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259948A (en) * 2020-01-13 2020-06-09 中孚安全技术有限公司 User safety behavior baseline analysis method based on fusion machine learning algorithm
CN111694952A (en) * 2020-04-16 2020-09-22 国家计算机网络与信息安全管理中心 Big data analysis model system based on microblog and implementation method thereof
CN115242466A (en) * 2022-07-04 2022-10-25 北京华圣龙源科技有限公司 Intrusion active trapping system and method based on high-simulation virtual environment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259948A (en) * 2020-01-13 2020-06-09 中孚安全技术有限公司 User safety behavior baseline analysis method based on fusion machine learning algorithm
CN111694952A (en) * 2020-04-16 2020-09-22 国家计算机网络与信息安全管理中心 Big data analysis model system based on microblog and implementation method thereof
CN115242466A (en) * 2022-07-04 2022-10-25 北京华圣龙源科技有限公司 Intrusion active trapping system and method based on high-simulation virtual environment

Similar Documents

Publication Publication Date Title
CN110516067B (en) Public opinion monitoring method, system and storage medium based on topic detection
WO2019227710A1 (en) Network public opinion analysis method and apparatus, and computer-readable storage medium
de Oliveira et al. A sensitive stylistic approach to identify fake news on social networking
Stein et al. Intrinsic plagiarism analysis
Wang et al. Word clustering based on POS feature for efficient twitter sentiment analysis
Qian et al. Identifying multiple userids of the same author
CN108021651B (en) Network public opinion risk assessment method and device
Cerón-Guzmán et al. A sentiment analysis system of Spanish tweets and its application in Colombia 2014 presidential election
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN109165529B (en) Dark chain tampering detection method and device and computer readable storage medium
CN110674288A (en) User portrait method applied to network security field
CN113076735A (en) Target information acquisition method and device and server
KR20210148574A (en) Systems and methods for analyzing the public data of SNS user channel and providing influence report
CN113282754A (en) Public opinion detection method, device, equipment and storage medium for news events
KR20210148573A (en) Systems and methods for gathering public data of SNS user channel and providing influence reports based on the collected public data
Prasad et al. An effective assessment of cluster tendency through sampling based multi-viewpoints visual method
Chua et al. Problem Understanding of Fake News Detection from a Data Mining Perspective
CN108519993A (en) The social networks focus incident detection method calculated based on multiple data stream
WO2023093116A1 (en) Method and apparatus for determining industrial chain node of enterprise, and terminal and storage medium
CN111222032A (en) Public opinion analysis method and related equipment
Al-Dyani et al. Challenges of event detection from social media streams
Rawat et al. Media bias detection using sentimental analysis and clustering algorithms
CN113691525A (en) Traffic data processing method, device, equipment and storage medium
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination