CN112765660A

CN112765660A - Terminal security analysis method and system based on MapReduce parallel clustering technology

Info

Publication number: CN112765660A
Application number: CN202110092700.8A
Authority: CN
Inventors: 李肯立; 李金娜; 杨志邦; 于思洋; 刘楚波; 唐卓; 肖国庆; 段明星; 阳王东; 李克勤
Original assignee: Hunan Kuangan Network Technology Co ltd; Hunan University
Current assignee: Hunan Kuangan Network Technology Co ltd
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-07

Abstract

The invention discloses a terminal security analysis method based on a MapReduce parallel clustering technology, which comprises the following steps: acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles; filtering the obtained multiple participles to obtain multiple filtered participles; extracting the features of each word segmentation after filtering by using a TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data; and calculating Euclidean distances between log vectors corresponding to the obtained log data and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center. The method and the device can reduce the influence caused by noise log interference, and can solve the problems that the existing terminal safety judgment has high labor cost and low speed, the classification result is influenced by self experiences of different technicians, and the traditional terminal safety classification method is inaccurate.

Description

Terminal security analysis method and system based on MapReduce parallel clustering technology

Technical Field

The invention belongs to the technical field of information security, and particularly relates to a terminal security analysis method and system based on a MapReduce parallel clustering technology.

Background

With the rapid development of computer network technology, the dependence of social circles on the network is increasingly enhanced, the network intrusion events of viruses, spyware and hackers are increased, and the network and information security problems are increasingly highlighted. However, the conventional security defense concept is limited to the defense in the gateway level, the network boundary (firewall, virus prevention, vulnerability scanning) and other aspects, the important security protection devices are mainly concentrated at the entrance of the machine room or the network, the security threat from the outside of the network is greatly reduced under the monitoring of the security devices, conversely, the security threat from the internal computer terminal becomes a general problem, and after years of network security construction, the development trend of the network security is shifted to the security management of the internal terminal by the protection of the core network and the backbone network. The part with the largest network security management workload is converted into a terminal security part, the risk assessment of the terminal security is a key link for ensuring the network security, in order to prevent the network from being threatened, objective and effective analysis and assessment are carried out on the terminal security condition of the network according to the log indexes of the terminal, a security strategy is perfected and a risk processing plan is made according to the assessment result, and the network information security is protected to the greatest extent.

The traditional network terminal security assessment mainly uses an osmosis test technology, mainly depends on a scanning tool based on a host, obtains an assessment value of the network terminal security condition according to the obtained system vulnerability information by monitoring the network terminal system security vulnerability, and then analyzes the security of the terminal by a qualitative, quantitative or quantitative and qualitative combination method.

However, the above network terminal security evaluation method based on penetration test technology still has non-negligible defects: firstly, because the terminal security relates to multiple layers and multiple factors and simultaneously has uncertainty and complexity, some traditional quantitative and qualitative analysis methods are contradictory to each other, so that indexes cannot be directly compared, the method is difficult to comprehensively evaluate the terminal security, and the finally obtained terminal security evaluation result is low; secondly, because the traditional network terminal safety assessment method depends on the existing safety leak library, the safety leak library is not updated timely enough, and therefore the novel network risk cannot be identified rapidly; thirdly, because the traditional network terminal security evaluation method mainly depends on the evaluator to grade according to the experience knowledge of the evaluator, the evaluation method needs to depend on professional equipment and tools and consumes a certain time, so that a large amount of manpower and material resource cost is needed, and the terminal security cannot be reflected timely and efficiently.

Disclosure of Invention

The invention provides a terminal security analysis method and system based on a MapReduce parallel clustering technology, aiming at solving the technical problems that the finally obtained terminal security evaluation result is low due to the fact that the terminal security cannot be comprehensively evaluated in the existing network terminal security evaluation method based on a penetration test technology, the novel network risk cannot be rapidly identified due to the fact that the existing security vulnerability library is relied on, the security vulnerability library cannot be updated timely, and the technical problems that time and labor are consumed and the terminal security cannot be timely and efficiently reflected due to the fact that professional equipment and tools are required to be relied on.

In order to achieve the above object, according to an aspect of the present invention, there is provided a terminal security analysis method based on a MapReduce parallel clustering technique, including the steps of:

(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;

(2) filtering the multiple participles obtained in the step (1) to obtain multiple filtered participles;

(3) extracting the features of each participle filtered in the step (2) by using a word frequency-inverse text frequency TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data, and X is (X ═ X)₁，x₂，...，x_num)，x_mRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]]；

(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.

Preferably, the step (1) of obtaining the log data is to obtain the log data of the computer by calling a Syslog interface;

the log data comprises program modules, severity, process names, generation time, log contents and the like;

severity includes errors, information, warnings, criticality, etc.;

the program modules include a kernel layer, a user layer, a mail system, authorization information, and the like.

Preferably, the K cluster centers are established by the following steps:

(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;

(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;

(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;

(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);

(4-5) for each log vector in the log vector set, calculating Euclidean distances from the log vector to each clustering center in the global variable set by using a MapReduce model, and establishing a key-value pair by using the clustering center corresponding to the minimum value in all the Euclidean distances as a key and the log vector as a value;

(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;

(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by K updated clustering centers, and returning to the step (4-5);

(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.

Preferably, the process of selecting K log vectors from the set of log vectors comprises the sub-steps of:

(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector set_avg；

(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)_avgCalculating the point density of each log vector in the log vector set

(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);

(4-4-4) obtaining

And taking the K log vector points with the maximum value as the clustering centers.

Preferably, the point density of each log vector is equal to:

wherein

Representing a log vector X_iAnd log vector X_jThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]N represents the total number of log vectors in the log vector set;

preferably, for log vector X_iIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is

Is equal to the vector X from the log_iThe maximum value in Euclidean distances to each other log vector in the log vector set;

for log vector X_iIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is

Is equal to the vector X from the log_iThe hit density is greater than the log vector X_iIs calculated from the minimum of the euclidean distances of each log vector of point density.

According to another aspect of the present invention, there is provided a terminal security analysis system based on MapReduce parallel clustering technology, including:

the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring log data from a terminal and processing the log data by using a natural language processing library to obtain a plurality of participles;

the second module is used for filtering the multiple participles obtained by the first module to obtain multiple filtered participles;

a third module, configured to extract, using a word frequency-inverse text frequency TF-IDF algorithm, features of each participle filtered by the second module, where all the features form a log vector X corresponding to the log data, and X ═ X₁，x₂，…，x_num)，x_mRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]]；

And the fourth module is used for calculating the Euclidean distance between the log vector corresponding to the log data obtained by the third module and each preset clustering center in the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) because the steps (1) to (4) are adopted, the terminal safety is classified by adopting the clustering technology in combination with the self log characteristics of the terminal, and the technical problem of low accuracy of the terminal safety evaluation result caused by undefined quantitative and qualitative methods in the existing network terminal safety evaluation method based on the penetration test technology can be solved;

(2) because the invention adopts the steps (4-2) to (4-8), the clustering center is updated according to the new log data acquired from the terminal, so the technical problem that the new network risk can not be rapidly identified due to the fact that the security vulnerability library is not updated timely in the existing network terminal security assessment method based on the penetration test technology can be solved;

(3) because the steps (1) to (4) are adopted, the evaluation process is automatically carried out without human intervention, and the technical problems that in the existing network terminal safety evaluation method based on the penetration test technology, an evaluator needs to grade according to own experience knowledge, the evaluation method depends on professional equipment and tools and consumes certain time, a large amount of manpower and material resources cost is consumed, and the terminal safety cannot be reflected timely and efficiently can be solved;

(4) because the steps (4-4) to (4-7) are adopted, the selected value of the initial clustering center is optimized, and a MapReduce parallel programming model is adopted, the convergence speed and the running speed of a clustering algorithm are increased, the efficiency and the performance are improved, and the time cost is saved;

(5) the invention has the advantages of high operation speed, high efficiency, economic cost saving, objective and efficient safety division reference for the actual environment and contribution to improving the safety of network space.

Drawings

Fig. 1 is a flowchart of a terminal security analysis method based on a MapReduce parallel clustering technique according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The basic idea of the invention is that log data is collected from a terminal, text preprocessing such as word segmentation and filtering is carried out on the log data, vector representation is carried out on the log data of the terminal by using a Term frequency-inverse text frequency (TF-IDF) algorithm, parallel clustering processing is carried out on the log data by adopting a MapReduce parallel technology, the speed of clustering processing is fully accelerated, and a terminal security analysis method based on the MapReduce parallel clustering technology is provided by comparing the Euclidean distance between a newly collected log vector and a clustering center.

As shown in fig. 1, the present invention provides a terminal security analysis method based on MapReduce parallel clustering technology, which includes the following steps:

specifically, the natural language processing library used in this step may be, for example, a jieba library, which is a Python third-party chinese part word library.

In this step, the time interval for acquiring log data is 1 minute to 10 minutes, preferably 5 minutes;

specifically, the terminal in the present invention refers to a computer (which may be a personal computer or a notebook computer), and the obtaining of the log data in this step is to obtain the log data of the computer by calling a Syslog interface.

The log data comprises a program module, severity, a process name, generation time, log content and the like, wherein the severity mainly comprises errors, information, warnings, keys and the like, the program module comprises a kernel layer, a user layer, a mail system, authorization information and the like, and theoretically, the log data corresponding to each time period corresponds to a complete physical behavior.

specifically, the filtering process in this step includes, but is not limited to: deleting stop words (such as auxiliary words of's', 'o' and the like), special symbols (such as punctuation marks, mathematical operators and the like), and link addresses (such as advertisement links and the like) which are irrelevant to safety judgment;

the method has the advantages that the segmentation filtering processing can remove the segmentation irrelevant to the terminal security judgment, and reduce the interference of the segmentation on the accuracy of the security judgment result.

(3) Extracting the features of each participle filtered in the step (2) by using a Term frequency-inverse text frequency (TF-IDF) algorithm, wherein all the features form a log vector X corresponding to the log data, and X is equal to (X-IDF)₁，x₂，...，x_num)，x_mRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]]；

The steps (1) to (4) have the advantages that the terminal safety is evaluated by adopting the clustering result by combining the log characteristics of the terminal, so that the accuracy of the terminal safety evaluation is guaranteed; in addition, by adopting a MapReduce parallel programming model, the running speed of a clustering algorithm is increased, and the efficiency and the performance of the clustering algorithm are improved;

specifically, the value range of K is a natural number equal to or greater than 2, and the larger the value of K is, the more levels used for safety determination are, and the smaller the value of K is, the fewer levels used for safety determination are.

For example, if K is 10, if the final 2 nd clustering center in this step has the minimum euclidean distance, the final security level of the terminal is considered to be the lowest; if the final 10 th clustering center in the step has the minimum Euclidean distance, the final security level of the terminal is considered to be the highest; and if the final 5 th clustering center in the step has the minimum Euclidean distance, the final security level of the terminal is considered to be medium.

The K clustering centers in the step are obtained by establishing the following steps:

the process of this step is identical to that of step (1), and is not described herein again.

the process of this step is identical to that of the step (2), and is not described herein again.

specifically, the process of selecting K log vectors from the log vector set in this step specifically includes the following substeps:

Wherein the content of the first and second substances,

representing a log vector X_iAnd log vector X_jThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]N represents log vector in log vector collectionTotal number;

specifically, for log vector X_iIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is

Is equal to the vector X from the log_iThe maximum value in Euclidean distances to each other log vector in the log vector set; for log vector X_iIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is

(4-4-4) obtaining

Taking K log vector points with the maximum value as a clustering center;

the above substeps (4-4-1) to (4-4-4) have the advantage that they optimize the selection value of the initial clustering center, and accelerate the convergence rate of the clustering algorithm.

(4-5) for each log vector in the log vector set, calculating the Euclidean distance from the log vector to each clustering center in the global variable set by using a MapReduce model, and establishing a Key Value pair by using the clustering center corresponding to the minimum Value in all Euclidean distances as a Key (Key) and the log vector as a Value (Value);

(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by the updated K clustering centers, and returning to the step (4-5).

The method has the advantages that the new log data acquired from the terminal updates the clustering center, so that the novel network risk can be rapidly identified, the terminal safety can be reflected in time, and the safety of a network space can be guaranteed;

it should be noted that, when returning to step (4-2) and executing the subsequent steps, all log data in step (4-4) should include all log data before executing step (4-8) and new log data obtained by (4-8); meanwhile, each log data in steps (4-2) and (4-3) refers to the new log data acquired in step (4-8).

In summary, the terminal security analysis method based on MapReduce parallel clustering provided by the invention performs text preprocessing, text representation, log clustering model generation and terminal security judgment on the terminal logs, and is necessary for judging the terminal security in the network and improving the network security in the actual network.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A terminal security analysis method based on a MapReduce parallel clustering technology is characterized by comprising the following steps:

(2) and (3) filtering the multiple participles obtained in the step (1) to obtain the filtered multiple participles.

(3) Extracting the features of each participle filtered in the step (2) by using a word frequency-inverse text frequency TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data, and X is (X ═ X)₁，x₂，…，x_num)，x_mRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]]；

2. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 1,

the log data acquired in the step (1) is acquired by calling a Syslog interface;

severity includes errors, information, warnings, criticality, etc.;

3. The terminal security analysis method based on the MapReduce parallel clustering technology as claimed in claim 1 or 2, wherein the K clustering centers are obtained by the following steps:

and (4-5) for each log vector in the log vector set, calculating Euclidean distances from the log vector to each cluster center in the global variable set by using a MapReduce model, and establishing a key-value pair by using the cluster center corresponding to the minimum value in all the Euclidean distances as a key and the log vector as a value.

4. The MapReduce parallel clustering technology-based terminal security analysis method as claimed in any one of claims 1 to 3, wherein the process of selecting K log vectors from the log vector set comprises the following sub-steps:

(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)_avgCalculating the point density of each log vector in the log vector setDegree of rotation

(4-4-4) obtaining

5. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 4, wherein the point density of each log vector is equal to:

wherein

d(X_i，X_j) Representing a log vector X_iAnd log vector X_jThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]And n represents the total number of log vectors in the log vector collection.

6. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 5,

for log vector X_iIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is

Is equal to the vector X from the log_iIn Euclidean distance to each of the remaining log vectors in the log vector collectionA maximum value;

7. A terminal security analysis system based on a MapReduce parallel clustering technology is characterized by comprising:

a third module, configured to extract, using a word frequency-inverse text frequency TF-IDF algorithm, features of each participle filtered by the second module, where all the features form a log vector X corresponding to the log data, and X ═ X₁，x₂，...，x_num)，x_mRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]]；