CN112765660A - Terminal security analysis method and system based on MapReduce parallel clustering technology - Google Patents

Terminal security analysis method and system based on MapReduce parallel clustering technology Download PDF

Info

Publication number
CN112765660A
CN112765660A CN202110092700.8A CN202110092700A CN112765660A CN 112765660 A CN112765660 A CN 112765660A CN 202110092700 A CN202110092700 A CN 202110092700A CN 112765660 A CN112765660 A CN 112765660A
Authority
CN
China
Prior art keywords
log
vector
clustering
log data
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110092700.8A
Other languages
Chinese (zh)
Inventor
李肯立
李金娜
杨志邦
于思洋
刘楚波
唐卓
肖国庆
段明星
阳王东
李克勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Kuangan Network Technology Co ltd
Original Assignee
Hunan Kuangan Network Technology Co ltd
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Kuangan Network Technology Co ltd, Hunan University filed Critical Hunan Kuangan Network Technology Co ltd
Priority to CN202110092700.8A priority Critical patent/CN112765660A/en
Publication of CN112765660A publication Critical patent/CN112765660A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a terminal security analysis method based on a MapReduce parallel clustering technology, which comprises the following steps: acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles; filtering the obtained multiple participles to obtain multiple filtered participles; extracting the features of each word segmentation after filtering by using a TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data; and calculating Euclidean distances between log vectors corresponding to the obtained log data and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center. The method and the device can reduce the influence caused by noise log interference, and can solve the problems that the existing terminal safety judgment has high labor cost and low speed, the classification result is influenced by self experiences of different technicians, and the traditional terminal safety classification method is inaccurate.

Description

Terminal security analysis method and system based on MapReduce parallel clustering technology
Technical Field
The invention belongs to the technical field of information security, and particularly relates to a terminal security analysis method and system based on a MapReduce parallel clustering technology.
Background
With the rapid development of computer network technology, the dependence of social circles on the network is increasingly enhanced, the network intrusion events of viruses, spyware and hackers are increased, and the network and information security problems are increasingly highlighted. However, the conventional security defense concept is limited to the defense in the gateway level, the network boundary (firewall, virus prevention, vulnerability scanning) and other aspects, the important security protection devices are mainly concentrated at the entrance of the machine room or the network, the security threat from the outside of the network is greatly reduced under the monitoring of the security devices, conversely, the security threat from the internal computer terminal becomes a general problem, and after years of network security construction, the development trend of the network security is shifted to the security management of the internal terminal by the protection of the core network and the backbone network. The part with the largest network security management workload is converted into a terminal security part, the risk assessment of the terminal security is a key link for ensuring the network security, in order to prevent the network from being threatened, objective and effective analysis and assessment are carried out on the terminal security condition of the network according to the log indexes of the terminal, a security strategy is perfected and a risk processing plan is made according to the assessment result, and the network information security is protected to the greatest extent.
The traditional network terminal security assessment mainly uses an osmosis test technology, mainly depends on a scanning tool based on a host, obtains an assessment value of the network terminal security condition according to the obtained system vulnerability information by monitoring the network terminal system security vulnerability, and then analyzes the security of the terminal by a qualitative, quantitative or quantitative and qualitative combination method.
However, the above network terminal security evaluation method based on penetration test technology still has non-negligible defects: firstly, because the terminal security relates to multiple layers and multiple factors and simultaneously has uncertainty and complexity, some traditional quantitative and qualitative analysis methods are contradictory to each other, so that indexes cannot be directly compared, the method is difficult to comprehensively evaluate the terminal security, and the finally obtained terminal security evaluation result is low; secondly, because the traditional network terminal safety assessment method depends on the existing safety leak library, the safety leak library is not updated timely enough, and therefore the novel network risk cannot be identified rapidly; thirdly, because the traditional network terminal security evaluation method mainly depends on the evaluator to grade according to the experience knowledge of the evaluator, the evaluation method needs to depend on professional equipment and tools and consumes a certain time, so that a large amount of manpower and material resource cost is needed, and the terminal security cannot be reflected timely and efficiently.
Disclosure of Invention
The invention provides a terminal security analysis method and system based on a MapReduce parallel clustering technology, aiming at solving the technical problems that the finally obtained terminal security evaluation result is low due to the fact that the terminal security cannot be comprehensively evaluated in the existing network terminal security evaluation method based on a penetration test technology, the novel network risk cannot be rapidly identified due to the fact that the existing security vulnerability library is relied on, the security vulnerability library cannot be updated timely, and the technical problems that time and labor are consumed and the terminal security cannot be timely and efficiently reflected due to the fact that professional equipment and tools are required to be relied on.
In order to achieve the above object, according to an aspect of the present invention, there is provided a terminal security analysis method based on a MapReduce parallel clustering technique, including the steps of:
(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;
(2) filtering the multiple participles obtained in the step (1) to obtain multiple filtered participles;
(3) extracting the features of each participle filtered in the step (2) by using a word frequency-inverse text frequency TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data, and X is (X ═ X)1,x2,...,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
Preferably, the step (1) of obtaining the log data is to obtain the log data of the computer by calling a Syslog interface;
the log data comprises program modules, severity, process names, generation time, log contents and the like;
severity includes errors, information, warnings, criticality, etc.;
the program modules include a kernel layer, a user layer, a mail system, authorization information, and the like.
Preferably, the K cluster centers are established by the following steps:
(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;
(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;
(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;
(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);
(4-5) for each log vector in the log vector set, calculating Euclidean distances from the log vector to each clustering center in the global variable set by using a MapReduce model, and establishing a key-value pair by using the clustering center corresponding to the minimum value in all the Euclidean distances as a key and the log vector as a value;
(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;
(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by K updated clustering centers, and returning to the step (4-5);
(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.
Preferably, the process of selecting K log vectors from the set of log vectors comprises the sub-steps of:
(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector setavg
(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)avgCalculating the point density of each log vector in the log vector set
Figure BDA0002913284750000043
(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);
(4-4-4) obtaining
Figure BDA0002913284750000044
And taking the K log vector points with the maximum value as the clustering centers.
Preferably, the point density of each log vector is equal to:
Figure BDA0002913284750000041
wherein
Figure BDA0002913284750000042
Representing a log vector XiAnd log vector XjThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]N represents the total number of log vectors in the log vector set;
preferably, for log vector XiIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is
Figure BDA0002913284750000045
Is equal to the vector X from the logiThe maximum value in Euclidean distances to each other log vector in the log vector set;
for log vector XiIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is
Figure BDA0002913284750000046
Is equal to the vector X from the logiThe hit density is greater than the log vector XiIs calculated from the minimum of the euclidean distances of each log vector of point density.
According to another aspect of the present invention, there is provided a terminal security analysis system based on MapReduce parallel clustering technology, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring log data from a terminal and processing the log data by using a natural language processing library to obtain a plurality of participles;
the second module is used for filtering the multiple participles obtained by the first module to obtain multiple filtered participles;
a third module, configured to extract, using a word frequency-inverse text frequency TF-IDF algorithm, features of each participle filtered by the second module, where all the features form a log vector X corresponding to the log data, and X ═ X1,x2,…,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
And the fourth module is used for calculating the Euclidean distance between the log vector corresponding to the log data obtained by the third module and each preset clustering center in the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) because the steps (1) to (4) are adopted, the terminal safety is classified by adopting the clustering technology in combination with the self log characteristics of the terminal, and the technical problem of low accuracy of the terminal safety evaluation result caused by undefined quantitative and qualitative methods in the existing network terminal safety evaluation method based on the penetration test technology can be solved;
(2) because the invention adopts the steps (4-2) to (4-8), the clustering center is updated according to the new log data acquired from the terminal, so the technical problem that the new network risk can not be rapidly identified due to the fact that the security vulnerability library is not updated timely in the existing network terminal security assessment method based on the penetration test technology can be solved;
(3) because the steps (1) to (4) are adopted, the evaluation process is automatically carried out without human intervention, and the technical problems that in the existing network terminal safety evaluation method based on the penetration test technology, an evaluator needs to grade according to own experience knowledge, the evaluation method depends on professional equipment and tools and consumes certain time, a large amount of manpower and material resources cost is consumed, and the terminal safety cannot be reflected timely and efficiently can be solved;
(4) because the steps (4-4) to (4-7) are adopted, the selected value of the initial clustering center is optimized, and a MapReduce parallel programming model is adopted, the convergence speed and the running speed of a clustering algorithm are increased, the efficiency and the performance are improved, and the time cost is saved;
(5) the invention has the advantages of high operation speed, high efficiency, economic cost saving, objective and efficient safety division reference for the actual environment and contribution to improving the safety of network space.
Drawings
Fig. 1 is a flowchart of a terminal security analysis method based on a MapReduce parallel clustering technique according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The basic idea of the invention is that log data is collected from a terminal, text preprocessing such as word segmentation and filtering is carried out on the log data, vector representation is carried out on the log data of the terminal by using a Term frequency-inverse text frequency (TF-IDF) algorithm, parallel clustering processing is carried out on the log data by adopting a MapReduce parallel technology, the speed of clustering processing is fully accelerated, and a terminal security analysis method based on the MapReduce parallel clustering technology is provided by comparing the Euclidean distance between a newly collected log vector and a clustering center.
As shown in fig. 1, the present invention provides a terminal security analysis method based on MapReduce parallel clustering technology, which includes the following steps:
(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;
specifically, the natural language processing library used in this step may be, for example, a jieba library, which is a Python third-party chinese part word library.
In this step, the time interval for acquiring log data is 1 minute to 10 minutes, preferably 5 minutes;
specifically, the terminal in the present invention refers to a computer (which may be a personal computer or a notebook computer), and the obtaining of the log data in this step is to obtain the log data of the computer by calling a Syslog interface.
The log data comprises a program module, severity, a process name, generation time, log content and the like, wherein the severity mainly comprises errors, information, warnings, keys and the like, the program module comprises a kernel layer, a user layer, a mail system, authorization information and the like, and theoretically, the log data corresponding to each time period corresponds to a complete physical behavior.
(2) Filtering the multiple participles obtained in the step (1) to obtain multiple filtered participles;
specifically, the filtering process in this step includes, but is not limited to: deleting stop words (such as auxiliary words of's', 'o' and the like), special symbols (such as punctuation marks, mathematical operators and the like), and link addresses (such as advertisement links and the like) which are irrelevant to safety judgment;
the method has the advantages that the segmentation filtering processing can remove the segmentation irrelevant to the terminal security judgment, and reduce the interference of the segmentation on the accuracy of the security judgment result.
(3) Extracting the features of each participle filtered in the step (2) by using a Term frequency-inverse text frequency (TF-IDF) algorithm, wherein all the features form a log vector X corresponding to the log data, and X is equal to (X-IDF)1,x2,...,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
The steps (1) to (4) have the advantages that the terminal safety is evaluated by adopting the clustering result by combining the log characteristics of the terminal, so that the accuracy of the terminal safety evaluation is guaranteed; in addition, by adopting a MapReduce parallel programming model, the running speed of a clustering algorithm is increased, and the efficiency and the performance of the clustering algorithm are improved;
specifically, the value range of K is a natural number equal to or greater than 2, and the larger the value of K is, the more levels used for safety determination are, and the smaller the value of K is, the fewer levels used for safety determination are.
For example, if K is 10, if the final 2 nd clustering center in this step has the minimum euclidean distance, the final security level of the terminal is considered to be the lowest; if the final 10 th clustering center in the step has the minimum Euclidean distance, the final security level of the terminal is considered to be the highest; and if the final 5 th clustering center in the step has the minimum Euclidean distance, the final security level of the terminal is considered to be medium.
The K clustering centers in the step are obtained by establishing the following steps:
(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;
the process of this step is identical to that of step (1), and is not described herein again.
(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;
the process of this step is identical to that of the step (2), and is not described herein again.
(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;
(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);
specifically, the process of selecting K log vectors from the log vector set in this step specifically includes the following substeps:
(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector setavg
(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)avgCalculating the point density of each log vector in the log vector set
Figure BDA0002913284750000082
Figure BDA0002913284750000081
Wherein the content of the first and second substances,
Figure BDA0002913284750000091
representing a log vector XiAnd log vector XjThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]N represents log vector in log vector collectionTotal number;
(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);
specifically, for log vector XiIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is
Figure BDA0002913284750000092
Is equal to the vector X from the logiThe maximum value in Euclidean distances to each other log vector in the log vector set; for log vector XiIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is
Figure BDA0002913284750000093
Is equal to the vector X from the logiThe hit density is greater than the log vector XiIs calculated from the minimum of the euclidean distances of each log vector of point density.
(4-4-4) obtaining
Figure BDA0002913284750000094
Taking K log vector points with the maximum value as a clustering center;
the above substeps (4-4-1) to (4-4-4) have the advantage that they optimize the selection value of the initial clustering center, and accelerate the convergence rate of the clustering algorithm.
(4-5) for each log vector in the log vector set, calculating the Euclidean distance from the log vector to each clustering center in the global variable set by using a MapReduce model, and establishing a Key Value pair by using the clustering center corresponding to the minimum Value in all Euclidean distances as a Key (Key) and the log vector as a Value (Value);
(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;
(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by the updated K clustering centers, and returning to the step (4-5).
(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.
The method has the advantages that the new log data acquired from the terminal updates the clustering center, so that the novel network risk can be rapidly identified, the terminal safety can be reflected in time, and the safety of a network space can be guaranteed;
it should be noted that, when returning to step (4-2) and executing the subsequent steps, all log data in step (4-4) should include all log data before executing step (4-8) and new log data obtained by (4-8); meanwhile, each log data in steps (4-2) and (4-3) refers to the new log data acquired in step (4-8).
In summary, the terminal security analysis method based on MapReduce parallel clustering provided by the invention performs text preprocessing, text representation, log clustering model generation and terminal security judgment on the terminal logs, and is necessary for judging the terminal security in the network and improving the network security in the actual network.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A terminal security analysis method based on a MapReduce parallel clustering technology is characterized by comprising the following steps:
(1) acquiring log data from a terminal, and processing the log data by using a natural language processing library to obtain a plurality of participles;
(2) and (3) filtering the multiple participles obtained in the step (1) to obtain the filtered multiple participles.
(3) Extracting the features of each participle filtered in the step (2) by using a word frequency-inverse text frequency TF-IDF algorithm, wherein all the features form a log vector X corresponding to the log data, and X is (X ═ X)1,x2,…,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
(4) And (4) calculating Euclidean distances between the log vector corresponding to the log data obtained in the step (3) and each preset clustering center of the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
2. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 1,
the log data acquired in the step (1) is acquired by calling a Syslog interface;
the log data comprises program modules, severity, process names, generation time, log contents and the like;
severity includes errors, information, warnings, criticality, etc.;
the program modules include a kernel layer, a user layer, a mail system, authorization information, and the like.
3. The terminal security analysis method based on the MapReduce parallel clustering technology as claimed in claim 1 or 2, wherein the K clustering centers are obtained by the following steps:
(4-1) acquiring a plurality of log data from a terminal, and processing each log data by using a natural language processing library to obtain a plurality of participles corresponding to the log data;
(4-2) for each log data, filtering a plurality of participles corresponding to the log data to obtain a plurality of filtered participles;
(4-3) for each log data, processing the plurality of participles filtered in the step (4-2) by using a TF-IDF algorithm to obtain a log vector corresponding to the log data;
(4-4) selecting K log vectors from all log vectors corresponding to all log data as clustering centers, and putting the clustering centers into a global variable set (which is initially empty);
and (4-5) for each log vector in the log vector set, calculating Euclidean distances from the log vector to each cluster center in the global variable set by using a MapReduce model, and establishing a key-value pair by using the cluster center corresponding to the minimum value in all the Euclidean distances as a key and the log vector as a value.
(4-6) for each clustering center in the plurality of key value pairs established in the step (4-5), forming a set by using the clustering center and all values corresponding to the clustering center as keys, and calculating an average value of all log vectors in the set to be used as the clustering center after the clustering center is updated;
(4-7) judging whether the set formed by all the updated clustering centers is completely the same as the global variable set obtained in the step (4-4), if so, entering the step (4-8), otherwise, replacing the global variable set by the set formed by K updated clustering centers, and returning to the step (4-5);
(4-8) judging whether new log data are acquired from the terminal, if so, returning to the step (4-2), and if not, ending the process.
4. The MapReduce parallel clustering technology-based terminal security analysis method as claimed in any one of claims 1 to 3, wherein the process of selecting K log vectors from the log vector set comprises the following sub-steps:
(4-4-1) calculating the average Euclidean distance d of all log vectors in the log vector setavg
(4-4-2) average Euclidean distance d obtained according to the step (4-4-1)avgCalculating the point density of each log vector in the log vector setDegree of rotation
Figure FDA0002913284740000021
(4-4-3) for each log vector in the log vector set, obtaining the shortest distance corresponding to the log vector according to the point density of the log vector obtained in the step (4-4-2);
(4-4-4) obtaining
Figure FDA0002913284740000035
And taking the K log vector points with the maximum value as the clustering centers.
5. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 4, wherein the point density of each log vector is equal to:
Figure FDA0002913284740000031
wherein
Figure FDA0002913284740000032
d(Xi,Xj) Representing a log vector XiAnd log vector XjThe Euclidean distance between the two is larger than the sum of i and j, both belong to [1, n ]]And n represents the total number of log vectors in the log vector collection.
6. The MapReduce parallel clustering technology-based terminal security analysis method as set forth in claim 5,
for log vector XiIn other words, if the point density is the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is
Figure FDA0002913284740000033
Is equal to the vector X from the logiIn Euclidean distance to each of the remaining log vectors in the log vector collectionA maximum value;
for log vector XiIn other words, if the point density is not the maximum of the point densities corresponding to all the log vectors, the shortest distance corresponding to the log vector is
Figure FDA0002913284740000034
Is equal to the vector X from the logiThe hit density is greater than the log vector XiIs calculated from the minimum of the euclidean distances of each log vector of point density.
7. A terminal security analysis system based on a MapReduce parallel clustering technology is characterized by comprising:
the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring log data from a terminal and processing the log data by using a natural language processing library to obtain a plurality of participles;
the second module is used for filtering the multiple participles obtained by the first module to obtain multiple filtered participles;
a third module, configured to extract, using a word frequency-inverse text frequency TF-IDF algorithm, features of each participle filtered by the second module, where all the features form a log vector X corresponding to the log data, and X ═ X1,x2,...,xnum),xmRepresents the mth of all features, num represents the total number of features of all the participles extracted, and m is [1, num ]];
And the fourth module is used for calculating the Euclidean distance between the log vector corresponding to the log data obtained by the third module and each preset clustering center in the K clustering centers, acquiring the clustering center corresponding to the minimum value of all the Euclidean distances, and determining the final security level of the terminal according to the clustering center.
CN202110092700.8A 2021-01-25 2021-01-25 Terminal security analysis method and system based on MapReduce parallel clustering technology Pending CN112765660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110092700.8A CN112765660A (en) 2021-01-25 2021-01-25 Terminal security analysis method and system based on MapReduce parallel clustering technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110092700.8A CN112765660A (en) 2021-01-25 2021-01-25 Terminal security analysis method and system based on MapReduce parallel clustering technology

Publications (1)

Publication Number Publication Date
CN112765660A true CN112765660A (en) 2021-05-07

Family

ID=75706955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110092700.8A Pending CN112765660A (en) 2021-01-25 2021-01-25 Terminal security analysis method and system based on MapReduce parallel clustering technology

Country Status (1)

Country Link
CN (1) CN112765660A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407656A (en) * 2021-06-24 2021-09-17 上海上讯信息技术股份有限公司 Method and equipment for fast online log clustering
CN113486354A (en) * 2021-08-20 2021-10-08 国网山东省电力公司电力科学研究院 Firmware safety evaluation method, system, medium and electronic equipment
CN113505823A (en) * 2021-07-02 2021-10-15 中国联合网络通信集团有限公司 Supply chain security analysis method and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965846A (en) * 2014-12-31 2015-10-07 深圳市华傲数据技术有限公司 Virtual human establishing method on MapReduce platform
CN105740397A (en) * 2016-01-28 2016-07-06 广州市讯飞樽鸿信息技术有限公司 Big data parallel operation-based voice mail business data analysis method
US20160196174A1 (en) * 2015-01-02 2016-07-07 Tata Consultancy Services Limited Real-time categorization of log events
CN109284371A (en) * 2018-09-03 2019-01-29 平安证券股份有限公司 Anti- fraud method, electronic device and computer readable storage medium
CN111489030A (en) * 2020-04-09 2020-08-04 河北利至人力资源服务有限公司 Text word segmentation based job leaving prediction method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965846A (en) * 2014-12-31 2015-10-07 深圳市华傲数据技术有限公司 Virtual human establishing method on MapReduce platform
US20160196174A1 (en) * 2015-01-02 2016-07-07 Tata Consultancy Services Limited Real-time categorization of log events
CN105740397A (en) * 2016-01-28 2016-07-06 广州市讯飞樽鸿信息技术有限公司 Big data parallel operation-based voice mail business data analysis method
CN109284371A (en) * 2018-09-03 2019-01-29 平安证券股份有限公司 Anti- fraud method, electronic device and computer readable storage medium
CN111489030A (en) * 2020-04-09 2020-08-04 河北利至人力资源服务有限公司 Text word segmentation based job leaving prediction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾淑芳: "基于用户日志聚类的查询扩展", 《万方数据学位论文库》, 22 December 2010 (2010-12-22), pages 1 - 62 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407656A (en) * 2021-06-24 2021-09-17 上海上讯信息技术股份有限公司 Method and equipment for fast online log clustering
CN113407656B (en) * 2021-06-24 2023-03-07 上海上讯信息技术股份有限公司 Method and equipment for fast online log clustering
CN113505823A (en) * 2021-07-02 2021-10-15 中国联合网络通信集团有限公司 Supply chain security analysis method and computer-readable storage medium
CN113505823B (en) * 2021-07-02 2023-06-23 中国联合网络通信集团有限公司 Supply chain security analysis method and computer readable storage medium
CN113486354A (en) * 2021-08-20 2021-10-08 国网山东省电力公司电力科学研究院 Firmware safety evaluation method, system, medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110958220B (en) Network space security threat detection method and system based on heterogeneous graph embedding
CN109347801B (en) Vulnerability exploitation risk assessment method based on multi-source word embedding and knowledge graph
CN111428231B (en) Safety processing method, device and equipment based on user behaviors
CN112765660A (en) Terminal security analysis method and system based on MapReduce parallel clustering technology
CN111695597B (en) Credit fraud group identification method and system based on improved isolated forest algorithm
CN107273752B (en) Vulnerability automatic classification method based on word frequency statistics and naive Bayes fusion model
Xiao et al. From patching delays to infection symptoms: Using risk profiles for an early discovery of vulnerabilities exploited in the wild
CN102045358A (en) Intrusion detection method based on integral correlation analysis and hierarchical clustering
WO2017152877A1 (en) Network threat event evaluation method and apparatus
CN113609261B (en) Vulnerability information mining method and device based on knowledge graph of network information security
CN105072214A (en) C&C domain name identification method based on domain name feature
CN109376537B (en) Asset scoring method and system based on multi-factor fusion
CN110011976B (en) Network attack destruction capability quantitative evaluation method and system
CN115622738A (en) RBF neural network-based safety emergency disposal system and method
CN111159702B (en) Process list generation method and device
CN114091042A (en) Risk early warning method
CN115879017A (en) Automatic classification and grading method and device for power sensitive data and storage medium
CN112637108B (en) Internal threat analysis method and system based on anomaly detection and emotion analysis
CN109344913B (en) Network intrusion behavior detection method based on improved MajorCluster clustering
Harbola et al. Improved intrusion detection in DDoS applying feature selection using rank & score of attributes in KDD-99 data set
Kim et al. Comparative experiment on TTP classification with class imbalance using oversampling from CTI dataset
CN110719278A (en) Method, device, equipment and medium for detecting network intrusion data
CN114024761A (en) Network threat data detection method and device, storage medium and electronic equipment
CN113055368B (en) Web scanning identification method and device and computer storage medium
CN114866297A (en) Network data detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220701

Address after: 410000 No. 102, Heguang Road, Xianghu street, Furong district, Changsha City, Hunan Province

Applicant after: Hunan Kuangan Network Technology Co.,Ltd.

Address before: Yuelu District City, Hunan province 410082 Changsha Lushan Road No. 1

Applicant before: HUNAN University

Applicant before: Hunan kuang'an Network Technology Co., Ltd

TA01 Transfer of patent application right