CN112765660A - Terminal security analysis method and system based on MapReduce parallel clustering technology - Google Patents
Terminal security analysis method and system based on MapReduce parallel clustering technology Download PDFInfo
- Publication number
- CN112765660A CN112765660A CN202110092700.8A CN202110092700A CN112765660A CN 112765660 A CN112765660 A CN 112765660A CN 202110092700 A CN202110092700 A CN 202110092700A CN 112765660 A CN112765660 A CN 112765660A
- Authority
- CN
- China
- Prior art keywords
- log
- vector
- clustering
- log data
- terminal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005516 engineering process Methods 0.000 title claims abstract description 24
- 238000004458 analytical method Methods 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 121
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000001914 filtration Methods 0.000 claims abstract description 13
- 238000003058 natural language processing Methods 0.000 claims abstract description 10
- 230000008569 process Effects 0.000 claims description 12
- 241000764238 Isis Species 0.000 claims description 6
- 238000013475 authorization Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 abstract description 5
- 238000011156 evaluation Methods 0.000 description 11
- 230000008901 benefit Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 230000035515 penetration Effects 0.000 description 5
- 241000700605 Viruses Species 0.000 description 2
- 230000007123 defense Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Bioethics (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a terminal security analysis method based on a MapReduce parallel clustering technology, which comprises the following steps: acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles; filtering the obtained multiple participles to obtain multiple filtered participles; extracting the features of each word segmentation after filtering by using a TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data; and calculating Euclidean distances between log vectors corresponding to the obtained log data and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center. The method and the device can reduce the influence caused by noise log interference, and can solve the problems that the existing terminal safety judgment has high labor cost and low speed, the classification result is influenced by self experiences of different technicians, and the traditional terminal safety classification method is inaccurate.
Description
Technical Field
The invention belongs to the technical field of information security, and particularly relates to a terminal security analysis method and system based on a MapReduce parallel clustering technology.
Background
With the rapid development of computer network technology, the dependence of social circles on the network is increasingly enhanced, the network intrusion events of viruses, spyware and hackers are increased, and the network and information security problems are increasingly highlighted. However, the conventional security defense concept is limited to the defense in the gateway level, the network boundary (firewall, virus prevention, vulnerability scanning) and other aspects, the important security protection devices are mainly concentrated at the entrance of the machine room or the network, the security threat from the outside of the network is greatly reduced under the monitoring of the security devices, conversely, the security threat from the internal computer terminal becomes a general problem, and after years of network security construction, the development trend of the network security is shifted to the security management of the internal terminal by the protection of the core network and the backbone network. The part with the largest network security management workload is converted into a terminal security part, the risk assessment of the terminal security is a key link for ensuring the network security, in order to prevent the network from being threatened, objective and effective analysis and assessment are carried out on the terminal security condition of the network according to the log indexes of the terminal, a security strategy is perfected and a risk processing plan is made according to the assessment result, and the network information security is protected to the greatest extent.
The traditional network terminal security assessment mainly uses an osmosis test technology, mainly depends on a scanning tool based on a host, obtains an assessment value of the network terminal security condition according to the obtained system vulnerability information by monitoring the network terminal system security vulnerability, and then analyzes the security of the terminal by a qualitative, quantitative or quantitative and qualitative combination method.
However, the above network terminal security evaluation method based on penetration test technology still has non-negligible defects: firstly, because the terminal security relates to multiple layers and multiple factors and simultaneously has uncertainty and complexity, some traditional quantitative and qualitative analysis methods are contradictory to each other, so that indexes cannot be directly compared, the method is difficult to comprehensively evaluate the terminal security, and the finally obtained terminal security evaluation result is low; secondly, because the traditional network terminal safety assessment method depends on the existing safety leak library, the safety leak library is not updated timely enough, and therefore the novel network risk cannot be identified rapidly; thirdly, because the traditional network terminal security evaluation method mainly depends on the evaluator to grade according to the experience knowledge of the evaluator, the evaluation method needs to depend on professional equipment and tools and consumes a certain time, so that a large amount of manpower and material resource cost is needed, and the terminal security cannot be reflected timely and efficiently.
Disclosure of Invention
The invention provides a terminal security analysis method and system based on a MapReduce parallel clustering technology, aiming at solving the technical problems that the finally obtained terminal security evaluation result is low due to the fact that the terminal security cannot be comprehensively evaluated in the existing network terminal security evaluation method based on a penetration test technology, the novel network risk cannot be rapidly identified due to the fact that the existing security vulnerability library is relied on, the security vulnerability library cannot be updated timely, and the technical problems that time and labor are consumed and the terminal security cannot be timely and efficiently reflected due to the fact that professional equipment and tools are required to be relied on.
In order to achieve the above object, according to an aspect of the present invention, there is provided a terminal security analysis method based on a MapReduce parallel clustering technique, including the steps of:
(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;
(2) filtering the multiple participles obtained in the step (1) to obtain multiple filtered participles;
(3) extracting the features of each participle filtered in the step (2) by using a word frequency-inverse text frequency TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data, and X is (X ═ X)1,x2,...,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
Preferably, the step (1) of obtaining the log data is to obtain the log data of the computer by calling a Syslog interface;
the log data comprises program modules, severity, process names, generation time, log contents and the like;
severity includes errors, information, warnings, criticality, etc.;
the program modules include a kernel layer, a user layer, a mail system, authorization information, and the like.
Preferably, the K cluster centers are established by the following steps:
(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;
(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;
(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;
(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);
(4-5) for each log vector in the log vector set, calculating Euclidean distances from the log vector to each clustering center in the global variable set by using a MapReduce model, and establishing a key-value pair by using the clustering center corresponding to the minimum value in all the Euclidean distances as a key and the log vector as a value;
(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;
(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by K updated clustering centers, and returning to the step (4-5);
(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.
Preferably, the process of selecting K log vectors from the set of log vectors comprises the sub-steps of:
(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector setavg;
(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)avgCalculating the point density of each log vector in the log vector set
(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);
(4-4-4) obtainingAnd taking the K log vector points with the maximum value as the clustering centers.
Preferably, the point density of each log vector is equal to:
whereinRepresenting a log vector XiAnd log vector XjThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]N represents the total number of log vectors in the log vector set;
preferably, for log vector XiIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector isIs equal to the vector X from the logiThe maximum value in Euclidean distances to each other log vector in the log vector set;
for log vector XiIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector isIs equal to the vector X from the logiThe hit density is greater than the log vector XiIs calculated from the minimum of the euclidean distances of each log vector of point density.
According to another aspect of the present invention, there is provided a terminal security analysis system based on MapReduce parallel clustering technology, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring log data from a terminal and processing the log data by using a natural language processing library to obtain a plurality of participles;
the second module is used for filtering the multiple participles obtained by the first module to obtain multiple filtered participles;
a third module, configured to extract, using a word frequency-inverse text frequency TF-IDF algorithm, features of each participle filtered by the second module, where all the features form a log vector X corresponding to the log data, and X ═ X1,x2,…,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
And the fourth module is used for calculating the Euclidean distance between the log vector corresponding to the log data obtained by the third module and each preset clustering center in the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) because the steps (1) to (4) are adopted, the terminal safety is classified by adopting the clustering technology in combination with the self log characteristics of the terminal, and the technical problem of low accuracy of the terminal safety evaluation result caused by undefined quantitative and qualitative methods in the existing network terminal safety evaluation method based on the penetration test technology can be solved;
(2) because the invention adopts the steps (4-2) to (4-8), the clustering center is updated according to the new log data acquired from the terminal, so the technical problem that the new network risk can not be rapidly identified due to the fact that the security vulnerability library is not updated timely in the existing network terminal security assessment method based on the penetration test technology can be solved;
(3) because the steps (1) to (4) are adopted, the evaluation process is automatically carried out without human intervention, and the technical problems that in the existing network terminal safety evaluation method based on the penetration test technology, an evaluator needs to grade according to own experience knowledge, the evaluation method depends on professional equipment and tools and consumes certain time, a large amount of manpower and material resources cost is consumed, and the terminal safety cannot be reflected timely and efficiently can be solved;
(4) because the steps (4-4) to (4-7) are adopted, the selected value of the initial clustering center is optimized, and a MapReduce parallel programming model is adopted, the convergence speed and the running speed of a clustering algorithm are increased, the efficiency and the performance are improved, and the time cost is saved;
(5) the invention has the advantages of high operation speed, high efficiency, economic cost saving, objective and efficient safety division reference for the actual environment and contribution to improving the safety of network space.
Drawings
Fig. 1 is a flowchart of a terminal security analysis method based on a MapReduce parallel clustering technique according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the invention is that log data is collected from a terminal, text preprocessing such as word segmentation and filtering is carried out on the log data, vector representation is carried out on the log data of the terminal by using a Term frequency-inverse text frequency (TF-IDF) algorithm, parallel clustering processing is carried out on the log data by adopting a MapReduce parallel technology, the speed of clustering processing is fully accelerated, and a terminal security analysis method based on the MapReduce parallel clustering technology is provided by comparing the Euclidean distance between a newly collected log vector and a clustering center.
As shown in fig. 1, the present invention provides a terminal security analysis method based on MapReduce parallel clustering technology, which includes the following steps:
(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;
specifically, the natural language processing library used in this step may be, for example, a jieba library, which is a Python third-party chinese part word library.
In this step, the time interval for acquiring log data is 1 minute to 10 minutes, preferably 5 minutes;
specifically, the terminal in the present invention refers to a computer (which may be a personal computer or a notebook computer), and the obtaining of the log data in this step is to obtain the log data of the computer by calling a Syslog interface.
The log data comprises a program module, severity, a process name, generation time, log content and the like, wherein the severity mainly comprises errors, information, warnings, keys and the like, the program module comprises a kernel layer, a user layer, a mail system, authorization information and the like, and theoretically, the log data corresponding to each time period corresponds to a complete physical behavior.
(2) Filtering the multiple participles obtained in the step (1) to obtain multiple filtered participles;
specifically, the filtering process in this step includes, but is not limited to: deleting stop words (such as auxiliary words of's', 'o' and the like), special symbols (such as punctuation marks, mathematical operators and the like), and link addresses (such as advertisement links and the like) which are irrelevant to safety judgment;
the method has the advantages that the segmentation filtering processing can remove the segmentation irrelevant to the terminal security judgment, and reduce the interference of the segmentation on the accuracy of the security judgment result.
(3) Extracting the features of each participle filtered in the step (2) by using a Term frequency-inverse text frequency (TF-IDF) algorithm, wherein all the features form a log vector X corresponding to the log data, and X is equal to (X-IDF)1,x2,...,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
The steps (1) to (4) have the advantages that the terminal safety is evaluated by adopting the clustering result by combining the log characteristics of the terminal, so that the accuracy of the terminal safety evaluation is guaranteed; in addition, by adopting a MapReduce parallel programming model, the running speed of a clustering algorithm is increased, and the efficiency and the performance of the clustering algorithm are improved;
specifically, the value range of K is a natural number equal to or greater than 2, and the larger the value of K is, the more levels used for safety determination are, and the smaller the value of K is, the fewer levels used for safety determination are.
For example, if K is 10, if the final 2 nd clustering center in this step has the minimum euclidean distance, the final security level of the terminal is considered to be the lowest; if the final 10 th clustering center in the step has the minimum Euclidean distance, the final security level of the terminal is considered to be the highest; and if the final 5 th clustering center in the step has the minimum Euclidean distance, the final security level of the terminal is considered to be medium.
The K clustering centers in the step are obtained by establishing the following steps:
(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;
the process of this step is identical to that of step (1), and is not described herein again.
(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;
the process of this step is identical to that of the step (2), and is not described herein again.
(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;
(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);
specifically, the process of selecting K log vectors from the log vector set in this step specifically includes the following substeps:
(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector setavg;
(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)avgCalculating the point density of each log vector in the log vector set
Wherein the content of the first and second substances,representing a log vector XiAnd log vector XjThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]N represents log vector in log vector collectionTotal number;
(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);
specifically, for log vector XiIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector isIs equal to the vector X from the logiThe maximum value in Euclidean distances to each other log vector in the log vector set; for log vector XiIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector isIs equal to the vector X from the logiThe hit density is greater than the log vector XiIs calculated from the minimum of the euclidean distances of each log vector of point density.
the above substeps (4-4-1) to (4-4-4) have the advantage that they optimize the selection value of the initial clustering center, and accelerate the convergence rate of the clustering algorithm.
(4-5) for each log vector in the log vector set, calculating the Euclidean distance from the log vector to each clustering center in the global variable set by using a MapReduce model, and establishing a Key Value pair by using the clustering center corresponding to the minimum Value in all Euclidean distances as a Key (Key) and the log vector as a Value (Value);
(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;
(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by the updated K clustering centers, and returning to the step (4-5).
(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.
The method has the advantages that the new log data acquired from the terminal updates the clustering center, so that the novel network risk can be rapidly identified, the terminal safety can be reflected in time, and the safety of a network space can be guaranteed;
it should be noted that, when returning to step (4-2) and executing the subsequent steps, all log data in step (4-4) should include all log data before executing step (4-8) and new log data obtained by (4-8); meanwhile, each log data in steps (4-2) and (4-3) refers to the new log data acquired in step (4-8).
In summary, the terminal security analysis method based on MapReduce parallel clustering provided by the invention performs text preprocessing, text representation, log clustering model generation and terminal security judgment on the terminal logs, and is necessary for judging the terminal security in the network and improving the network security in the actual network.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (7)
1. A terminal security analysis method based on a MapReduce parallel clustering technology is characterized by comprising the following steps:
(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;
(2) and (3) filtering the multiple participles obtained in the step (1) to obtain the filtered multiple participles.
(3) Extracting the features of each participle filtered in the step (2) by using a word frequency-inverse text frequency TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data, and X is (X ═ X)1,x2,…,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
2. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 1,
the log data acquired in the step (1) is acquired by calling a Syslog interface;
the log data comprises program modules, severity, process names, generation time, log contents and the like;
severity includes errors, information, warnings, criticality, etc.;
the program modules include a kernel layer, a user layer, a mail system, authorization information, and the like.
3. The terminal security analysis method based on the MapReduce parallel clustering technology as claimed in claim 1 or 2, wherein the K clustering centers are obtained by the following steps:
(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;
(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;
(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;
(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);
and (4-5) for each log vector in the log vector set, calculating Euclidean distances from the log vector to each cluster center in the global variable set by using a MapReduce model, and establishing a key-value pair by using the cluster center corresponding to the minimum value in all the Euclidean distances as a key and the log vector as a value.
(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;
(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by K updated clustering centers, and returning to the step (4-5);
(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.
4. The MapReduce parallel clustering technology-based terminal security analysis method as claimed in any one of claims 1 to 3, wherein the process of selecting K log vectors from the log vector set comprises the following sub-steps:
(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector setavg;
(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)avgCalculating the point density of each log vector in the log vector setDegree of rotation
(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);
5. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 4, wherein the point density of each log vector is equal to:
6. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 5,
for log vector XiIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector isIs equal to the vector X from the logiIn Euclidean distance to each of the remaining log vectors in the log vector collectionA maximum value;
for log vector XiIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector isIs equal to the vector X from the logiThe hit density is greater than the log vector XiIs calculated from the minimum of the euclidean distances of each log vector of point density.
7. A terminal security analysis system based on a MapReduce parallel clustering technology is characterized by comprising:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring log data from a terminal and processing the log data by using a natural language processing library to obtain a plurality of participles;
the second module is used for filtering the multiple participles obtained by the first module to obtain multiple filtered participles;
a third module, configured to extract, using a word frequency-inverse text frequency TF-IDF algorithm, features of each participle filtered by the second module, where all the features form a log vector X corresponding to the log data, and X ═ X1,x2,...,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
And the fourth module is used for calculating the Euclidean distance between the log vector corresponding to the log data obtained by the third module and each preset clustering center in the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110092700.8A CN112765660A (en) | 2021-01-25 | 2021-01-25 | Terminal security analysis method and system based on MapReduce parallel clustering technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110092700.8A CN112765660A (en) | 2021-01-25 | 2021-01-25 | Terminal security analysis method and system based on MapReduce parallel clustering technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112765660A true CN112765660A (en) | 2021-05-07 |
Family
ID=75706955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110092700.8A Pending CN112765660A (en) | 2021-01-25 | 2021-01-25 | Terminal security analysis method and system based on MapReduce parallel clustering technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112765660A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407656A (en) * | 2021-06-24 | 2021-09-17 | 上海上讯信息技术股份有限公司 | Method and equipment for fast online log clustering |
CN113486354A (en) * | 2021-08-20 | 2021-10-08 | 国网山东省电力公司电力科学研究院 | Firmware safety evaluation method, system, medium and electronic equipment |
CN113505823A (en) * | 2021-07-02 | 2021-10-15 | 中国联合网络通信集团有限公司 | Supply chain security analysis method and computer-readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965846A (en) * | 2014-12-31 | 2015-10-07 | 深圳市华傲数据技术有限公司 | Virtual human establishing method on MapReduce platform |
CN105740397A (en) * | 2016-01-28 | 2016-07-06 | 广州市讯飞樽鸿信息技术有限公司 | Big data parallel operation-based voice mail business data analysis method |
US20160196174A1 (en) * | 2015-01-02 | 2016-07-07 | Tata Consultancy Services Limited | Real-time categorization of log events |
CN109284371A (en) * | 2018-09-03 | 2019-01-29 | 平安证券股份有限公司 | Anti- fraud method, electronic device and computer readable storage medium |
CN111489030A (en) * | 2020-04-09 | 2020-08-04 | 河北利至人力资源服务有限公司 | Text word segmentation based job leaving prediction method and system |
-
2021
- 2021-01-25 CN CN202110092700.8A patent/CN112765660A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965846A (en) * | 2014-12-31 | 2015-10-07 | 深圳市华傲数据技术有限公司 | Virtual human establishing method on MapReduce platform |
US20160196174A1 (en) * | 2015-01-02 | 2016-07-07 | Tata Consultancy Services Limited | Real-time categorization of log events |
CN105740397A (en) * | 2016-01-28 | 2016-07-06 | 广州市讯飞樽鸿信息技术有限公司 | Big data parallel operation-based voice mail business data analysis method |
CN109284371A (en) * | 2018-09-03 | 2019-01-29 | 平安证券股份有限公司 | Anti- fraud method, electronic device and computer readable storage medium |
CN111489030A (en) * | 2020-04-09 | 2020-08-04 | 河北利至人力资源服务有限公司 | Text word segmentation based job leaving prediction method and system |
Non-Patent Citations (1)
Title |
---|
贾淑芳: "基于用户日志聚类的查询扩展", 《万方数据学位论文库》, 22 December 2010 (2010-12-22), pages 1 - 62 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113407656A (en) * | 2021-06-24 | 2021-09-17 | 上海上讯信息技术股份有限公司 | Method and equipment for fast online log clustering |
CN113407656B (en) * | 2021-06-24 | 2023-03-07 | 上海上讯信息技术股份有限公司 | Method and equipment for fast online log clustering |
CN113505823A (en) * | 2021-07-02 | 2021-10-15 | 中国联合网络通信集团有限公司 | Supply chain security analysis method and computer-readable storage medium |
CN113505823B (en) * | 2021-07-02 | 2023-06-23 | 中国联合网络通信集团有限公司 | Supply chain security analysis method and computer readable storage medium |
CN113486354A (en) * | 2021-08-20 | 2021-10-08 | 国网山东省电力公司电力科学研究院 | Firmware safety evaluation method, system, medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110958220B (en) | Network space security threat detection method and system based on heterogeneous graph embedding | |
CN109347801B (en) | Vulnerability exploitation risk assessment method based on multi-source word embedding and knowledge graph | |
CN111428231B (en) | Safety processing method, device and equipment based on user behaviors | |
CN112765660A (en) | Terminal security analysis method and system based on MapReduce parallel clustering technology | |
CN111695597B (en) | Credit fraud group identification method and system based on improved isolated forest algorithm | |
CN107273752B (en) | Vulnerability automatic classification method based on word frequency statistics and naive Bayes fusion model | |
Xiao et al. | From patching delays to infection symptoms: Using risk profiles for an early discovery of vulnerabilities exploited in the wild | |
CN102045358A (en) | Intrusion detection method based on integral correlation analysis and hierarchical clustering | |
WO2017152877A1 (en) | Network threat event evaluation method and apparatus | |
CN113609261B (en) | Vulnerability information mining method and device based on knowledge graph of network information security | |
CN105072214A (en) | C&C domain name identification method based on domain name feature | |
CN109376537B (en) | Asset scoring method and system based on multi-factor fusion | |
CN110011976B (en) | Network attack destruction capability quantitative evaluation method and system | |
CN115622738A (en) | RBF neural network-based safety emergency disposal system and method | |
CN111159702B (en) | Process list generation method and device | |
CN114091042A (en) | Risk early warning method | |
CN115879017A (en) | Automatic classification and grading method and device for power sensitive data and storage medium | |
CN112637108B (en) | Internal threat analysis method and system based on anomaly detection and emotion analysis | |
CN109344913B (en) | Network intrusion behavior detection method based on improved MajorCluster clustering | |
Harbola et al. | Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set | |
Kim et al. | Comparative experiment on TTP classification with class imbalance using oversampling from CTI dataset | |
CN110719278A (en) | Method, device, equipment and medium for detecting network intrusion data | |
CN114024761A (en) | Network threat data detection method and device, storage medium and electronic equipment | |
CN113055368B (en) | Web scanning identification method and device and computer storage medium | |
CN114866297A (en) | Network data detection method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220701 Address after: 410000 No. 102, Heguang Road, Xianghu street, Furong district, Changsha City, Hunan Province Applicant after: Hunan Kuangan Network Technology Co.,Ltd. Address before: Yuelu District City, Hunan province 410082 Changsha Lushan Road No. 1 Applicant before: HUNAN University Applicant before: Hunan kuang'an Network Technology Co., Ltd |
|
TA01 | Transfer of patent application right |