CN106960153B - Virus type identification method and device - Google Patents

Virus type identification method and device Download PDF

Info

Publication number
CN106960153B
CN106960153B CN201610018316.2A CN201610018316A CN106960153B CN 106960153 B CN106960153 B CN 106960153B CN 201610018316 A CN201610018316 A CN 201610018316A CN 106960153 B CN106960153 B CN 106960153B
Authority
CN
China
Prior art keywords
virus
behavior data
vocabularies
word frequency
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610018316.2A
Other languages
Chinese (zh)
Other versions
CN106960153A (en
Inventor
程利军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610018316.2A priority Critical patent/CN106960153B/en
Publication of CN106960153A publication Critical patent/CN106960153A/en
Application granted granted Critical
Publication of CN106960153B publication Critical patent/CN106960153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for identifying virus types. Wherein, the method comprises the following steps: preprocessing behavior data to be detected of the virus to obtain a word frequency vector; obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values; and determining the cluster center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus. The invention solves the technical problems of false alarm and false alarm in virus identification by adopting a characteristic value matching mode in the related technology.

Description

Virus type identification method and device
Technical Field
The invention relates to the field of information security, in particular to a virus type identification method and a virus type identification device.
Background
In the related art, virus (e.g., worm) identification is generally performed by means of feature value matching in network traffic or by using behavior-based identification technology. However, since there is a possibility that the virus (e.g. worm) may fail if it has a variant, there may be false alarm or false negative if the normal features are matched with the failed features, and there is a problem that the virus (e.g. worm) is identified by using the feature value matching method in the related art.
Disclosure of Invention
According to an aspect of the embodiments of the present application, there is provided a method for identifying a type of a virus, including: preprocessing behavior data to be detected of the virus to obtain a word frequency vector; obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values; and determining the cluster center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus.
According to another aspect of the embodiments of the present application, there is also provided a virus type identification apparatus, including: the first preprocessing module is used for preprocessing behavior data to be detected of the virus to obtain a word frequency vector; the first acquisition module is used for acquiring the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values; and the determining module is used for determining the cluster center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus.
In the embodiment of the application, the behavior data to be detected of the virus is preprocessed, the distance values between the word frequency vectors obtained after preprocessing and the clustering center points of all the classifications are calculated, the classification of the clustering center point corresponding to the minimum distance value in the distance values is determined as the type of the virus, namely, the technical means that the classification of the virus belonging to the behavior data to be detected is determined by calculating the distance between the data sample to be detected and the clustering center points of all the classifications, so that the unknown variant sample can be well identified, the possibility of false report and false report of virus identification is avoided, and the technical problems that false report and false report can exist in the virus identification in the related technology in the mode of characteristic value matching are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a computer terminal of a virus type identification method according to an embodiment of the present application;
FIG. 2 is a first flowchart of a virus type identification method according to example 1 of the present application;
FIG. 3 is a flowchart II of a virus type identification method according to embodiment 1 of the present application;
FIG. 4 is a flow chart of obtaining worm behavior data according to an alternative embodiment of the present application;
FIG. 5 is a flow chart of classifying collected data according to an alternative embodiment of the present application;
FIG. 6 is a first block diagram of a virus type identification apparatus according to an embodiment of the present application;
FIG. 7 is a block diagram of a virus type identification apparatus according to an embodiment of the present application;
FIG. 8 is a block diagram of a virus type identification apparatus according to an embodiment of the present application;
fig. 9 is a block diagram of a fourth configuration of a virus type identification device according to an embodiment of the present application;
fig. 10 is a block diagram of a computer terminal according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, terms related to the present application are explained as follows:
honeypots, a computer system running on the internet, contain certain vulnerabilities that can trick viruses into intrusion.
Example 1
There is also provided, in accordance with an embodiment of the present application, a method embodiment for type identification of viruses, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method provided by the embodiment 1 of the present application can be executed in a mobile terminal, a computer terminal or a similar computing device. Taking an example of the method running on a computer terminal, fig. 1 is a hardware structure block diagram of a computer terminal of the virus type identification method according to the embodiment of the present application. As shown in fig. 1, the computer terminal 10 may include one or more (only one shown) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the virus type identification method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implementing the above-mentioned virus type identification method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
Under the above operating environment, the present application provides a method for identifying the type of a virus as shown in fig. 2. Fig. 2 is a first flowchart of a virus type identification method according to embodiment 1 of the present application, and as shown in fig. 2, the method includes steps S202 to S206:
step S202, preprocessing behavior data to be detected of the virus to obtain a word frequency vector;
in an optional embodiment of the present application, the element in the word frequency vector represents a word frequency of a vocabulary, and the vocabulary may be obtained directly after performing a word segmentation process on the behavior data to be tested, or may be obtained by performing a word segmentation process on the behavior data to be tested and then performing filtering, and for the latter, the step S202 may be represented as: performing word segmentation processing on the behavior data to be detected to obtain a plurality of words; filtering the plurality of vocabularies according to a preset rule to obtain filtered vocabularies, acquiring the word frequencies of the filtered vocabularies, and forming the word frequencies of the filtered vocabularies into word frequency vectors; and the number of the filtered words is equal to the dimension of the word frequency vector. For the latter, the pre-processing of the behavior data to be detected may include performing word segmentation and filtering on the behavior data to be detected to obtain word frequencies of required words, and making the word frequencies into one vector, i.e., the word frequency vector. That is, the elements in the word frequency vector represent the word frequencies of the words, and for example, if the word frequency vector is (1,2,3), the number of words is 3, and the word frequencies of 3 words are 1,2, and 3, respectively.
The virus may be a trojan horse virus or a worm virus, but is not limited thereto.
It should be noted that the preset rule may be to filter and remove a vocabulary with little influence on the subsequent calculation of similarity, and specifically, the preset rule may be implemented by one of the following manners: filtering the plurality of vocabularies according to the types of the vocabularies; and acquiring a word frequency tf-idf value of the vocabulary, and filtering a plurality of vocabularies according to the tf-idf value.
It should be noted that the types of words may be regular words and irregular words, and the irregular words may be words with special symbols, such as words beginning with '-' and beginning with '>', but not limited thereto, and the words that are irregular may be preset according to actual needs, and the irregular words may be directly filtered out because they have little influence on the similarity of subsequent calculations.
The word frequency represents the sum of the number of words obtained by dividing the occurrence frequency of one word by the word segmentation of the behavior data to be detected, and the sum is a numerical value. The larger the word frequency of a word is, the greater the influence of the word on the subsequent calculation similarity is, so that the words with smaller word frequency should be filtered out, and specifically, filtering a plurality of words according to tf-idf values may include: arranging tf-idf values of a plurality of vocabularies in an ascending order or a descending order; under the condition of ascending arrangement, taking the vocabulary from the first to the Nth as the vocabulary after filtering treatment; under the condition of descending order, taking the vocabulary from the last to the Nth as the vocabulary after filtering treatment; wherein N is equal to the dimension of the word frequency vector. Filtering the plurality of words according to tf-idf values may also include: and randomly selecting N tf-idf values, and taking N vocabularies corresponding to the N selected tf-idf values as filtered vocabularies, wherein N is equal to the dimensionality of the word frequency vector. But is not limited thereto.
In an embodiment of the present application, fig. 3 is a flowchart of a virus type identification method according to embodiment 1 of the present application, and as shown in fig. 3, before step S202, the method further includes:
step S302, acquiring behavior data to be tested from a honeypot used for providing the operating system with the vulnerability.
It should be noted that the honeypot is a computer system running on the internet, and includes a certain vulnerability, and can trap viruses to invade. The intrusion virus can be analyzed and processed by collecting intrusion data. Of course, the honeypot may be used for other purposes, such as finding an attack, generating an alarm, etc., but is not limited thereto. In the embodiment of the present application, the behavior data to be tested may be understood as intrusion data obtained after a virus intrudes into a honeypot, and the behavior data to be tested has operation behavior characteristics of the virus.
It should be noted that the obtaining of the behavior data to be tested is not limited to the obtaining in the honeypot manner, but may also be extracted from the network traffic or obtained from a sandbox (sandbox) at the network end, specifically, obtaining the behavior data (i.e., the behavior data to be tested) from the sandbox at the network end may be that the behavior data shows that the behavior data captures a virus entity in the network traffic and then is placed in the sandbox, and then obtaining the behavior data (i.e., the behavior data to be tested) of the virus from the sandbox.
It should be noted that, after step S302 and before step S202, the method may further include: and standardizing the acquired behavior data to be tested. The way of normalization is represented by: processing the data which is consistent in five tuples and in a preset time as a group of data; it can also be represented as: processing the data of the same honeypot, the same IP address and the same time as a whole piece of data; but is not limited thereto.
The above quintuple may include: source IP address, source port, destination IP address, destination port, and transport layer protocol.
Step S204, obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values;
the distance between the word frequency vector and the cluster center point of each class may be an euclidean distance, an absolute distance, a mahalanobis distance, or the like, but is not limited thereto. Correspondingly, if different distance types are used, the way of calculating the distance value is different, for example, if the distance is Euclidean distance, then the distance value can be calculated
Figure GDA0002735898450000051
To calculate the distance value, where d represents the distance, i-1, 2 … N, where N represents the dimension of the word frequency vector,xi1,xi2 respectively indicate the coordinates of both parties for calculating the distance or the corresponding elements of the word frequency vector.
In an embodiment of the present application, the above classifications are obtained by performing a clustering algorithm on a large amount of collected behavior data of the virus, and specifically, before step S204, the method may further include: counting behavior data of the virus; performing the same treatment on the behavior data as the pretreatment to obtain the treated behavior data; and classifying the preprocessed behavior data by adopting a preset clustering algorithm to obtain each classification.
It should be noted that the behavior data of the virus is also obtained by the honeypot, and the predetermined clustering algorithm may be a K-means clustering algorithm, a hierarchical clustering algorithm, a self-organizing map SOM clustering algorithm, or a fuzzy C-means FCM clustering algorithm, but is not limited thereto.
The behavior data of the virus is counted, preprocessed and trained and classified to obtain various classes, namely, a large amount of behavior sample data of the virus is trained and analyzed in a machine learning mode, so that the problems of missing report and false report when the virus is identified in a characteristic value matching mode can be avoided to a great extent.
Taking the classification of the row behavior data by adopting the K-means clustering algorithm as an example, the specific classification process can be expressed as follows: a, firstly, assuming that k points exist, randomly generating k clustering central points (initial clustering central points); b, respectively calculating the distance between the data set and the k points, and classifying the closest distance into one class; c, recalculating the clustering centers of the obtained classes; and D, repeating B and C until the new centroid (cluster center) is equal to the original centroid or is smaller than a specified threshold value, and finishing the algorithm, namely finishing the classification.
Step S206, determining the cluster center point corresponding to the minimum distance value in the distance values as the type of the virus.
Through the steps, the behavior data to be detected of the virus are preprocessed, the distance values between the word frequency vectors obtained after preprocessing and the clustering center points of all the classifications are calculated, the classification of the clustering center point corresponding to the minimum distance value in the distance values is determined as the type of the virus, namely, the technical means that the classification of the virus belonging to the behavior data to be detected is determined by calculating the distance between the data sample to be detected and the clustering center points of all the classifications, and then the unknown variant sample can be well identified, so that the possibility of false report and false report of virus identification is avoided, and the technical problem that false report and false report can exist in the virus identification in the related technology by adopting a characteristic value matching mode is solved.
After step S206, the method may further include: after the type of the virus is determined, an alert message is generated. And related personnel can be informed to correspondingly process the type of virus through alarm information.
For example, in the case that the virus is a worm virus, the worm virus is generally performed by using a telnet protocol during the propagation process, where the telnet protocol is a plaintext transmission protocol, and thus the method is applicable to intrusion detection on plaintext transmission at a network end.
It should be noted that, since there may be many types of viruses of the same type, the method can be used to determine which type of the viruses in the data to be analyzed belongs to, but is not limited to this, and it can also be implemented to determine whether the data to be analyzed is normal.
For a better understanding of the present application, the following further explains embodiments of the present application in conjunction with alternative embodiments.
The embodiment of the present application provides an optional virus type identification method, and a flow of the optional virus type identification method is described below by taking the virus as a worm virus as an example.
The alternative virus type identification method comprises the following steps: intercepting traffic from the route; normalization data: processing the data which is consistent in five-tuple in the flow and within one minute as a group of data; processing the data and establishing a matrix vector (corresponding to the word frequency vector in the above embodiment) based on the keywords obtained in the training process; calculating the distance between the vector and the known K central points (corresponding to the cluster central points of the respective classes in the above embodiment); the distance is the smallest, and the worm is the class worm; and an alarm is given.
In particular, the preferred method may comprise the steps of:
(1) acquisition of worm behavior data:
there are various schemes for acquiring the behavior of worms, and in the preferred embodiment, acquisition is performed in honeypots. Examples of behavior data obtained are as follows:
Figure GDA0002735898450000071
fig. 4 is a flowchart of acquiring worm behavior data according to an alternative embodiment of the present application, and as shown in fig. 4, the acquiring worm behavior data process includes:
step S402, the Attach logic (ssh/telnet), i.e. the worm logs in the honeypot in ssh or telnet mode;
step S404, Exec multiplied by bash, namely worms enter the honeypot;
step S406, Get the pid, namely obtain the ID number of the course;
step S408, Find the IP (this pid) of the worm according to the ID number;
step S410, Store in the db (IP, time, content), i.e. the IP where the worm is located, the time the worm attacks the honeypot, and the information of the content of the worm attacking the honeypot are stored in the database.
(2) Preprocessing worm behavior data:
in the honeypot, a large amount of data of the type is obtained, and the data is preprocessed firstly; collecting data of the same IP and the same honeypot as a whole data at the same time; then, calculating the similarity of the data; word segmentation is performed and irregular words, such as beginning with '-' and beginning with '>' are removed. The tf-idf values of these words are then calculated. After calculation, tf-idf values of the words are compared, the first 20 words are taken as a 20-dimensional vector, and the word frequency of the word in each data is calculated and taken as a 20-dimensional word frequency vector.
(3) And classifying the data sets by adopting a clustering algorithm:
taking a k-means clustering algorithm as an example, the data are classified and calculated, wherein the distance between a point and a point adopts an Euclidean distance, and other distance formulas can be adopted for replacing the Euclidean distance. (Note: this clustering algorithm is not limited to the k-means clustering algorithm), the classification process is as follows:
a, assuming that k points exist, randomly generating k central points;
b, respectively calculating the distance between the data set and the k points, and classifying the closest distance into one class;
c, recalculating centers of obtained various classifications;
d, repeating B and C until the new centroid is equal to the original centroid or smaller than a specified threshold value, and finishing the algorithm.
(4) The identification process (corresponding to step S202 to step S206 in embodiment 1 described above).
After obtaining the central points, if a new data set appears, the distance between the new data set and the central points and the classification of the closest point to the new data set are calculated according to the preprocessing, and the classification is the worm type.
The (1) to (3) in this alternative embodiment correspond to the behavior data of the statistical virus in the above-described embodiment 1; performing the same treatment on the behavior data as the pretreatment to obtain the treated behavior data; and classifying the preprocessed behavior data by adopting a preset clustering algorithm to obtain the process of each class.
It can thus be seen that the above alternative embodiment is largely divided into two stages: training and classifying stage and identifying stage. Training and classifying: after data is collected by means of honeypots, the collected data needs to be trained (i.e., classified). Fig. 5 is a flowchart of classifying collected data according to an alternative embodiment of the present application, as shown in fig. 5, mainly including:
step S501, standardizing data (namely standardizing the behavior data of worms);
step S502, performing word segmentation on the data and removing similar stop words;
step S503, selecting keywords (corresponding to the filtered vocabulary in the above embodiment 1) by calculating tf-idf values;
step S504, building N-dimensional word frequency vectors by using the selected keywords;
step S505; processing the data by adopting a clustering algorithm;
step S506, k clustering center points are obtained.
Wherein, step S505 includes:
step S505-1, randomly generating k initial clustering center points;
step S505-2, respectively calculating the distances to the k initial clustering center points, and dividing the data into k classes by the aid of the different classes with the closest distances;
step S505-3, recalculating the clustering center points of each classification;
and step S505-4, judging whether the calculated clustering center point is not changed any more, and finishing classification under the condition of yes.
And (3) identification: after the data are obtained, the data are standardized, word segmentation is carried out, word frequency vectors are established, and then the distance between the data and k points is judged, and if the distance is close to the k points, the worm corresponding to the data is the worm of which type.
It should be noted that the manner adopted by the above-mentioned process (2) and the processes of step S501 to step S504 is also applicable to the processes of normalizing, segmenting words and establishing word frequency vectors for data in the recognition stage. Step S502 and step S504 correspond to step S202 in embodiment 1 described above.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
Through the above description of the embodiments, those skilled in the art can clearly understand that the virus type identification method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
According to an embodiment of the present invention, there is further provided an apparatus for implementing the method for identifying a type of a virus, where fig. 6 is a first structural block diagram of the apparatus for identifying a type of a virus according to the embodiment of the present application, and as shown in fig. 6, the apparatus includes:
the first preprocessing module 62 is configured to preprocess behavior data to be detected of a virus to obtain a word frequency vector;
the virus may be a trojan horse virus or a worm virus, but is not limited thereto.
The preprocessing process may be completed by an independent module (e.g., the first preprocessing module 62), or may be completed by a plurality of sub-modules or units included in the module, and for the latter, the first preprocessing module 62 may include only a word segmentation unit, and is configured to perform word segmentation processing on the behavior data to be detected, obtain word frequencies of words after the word segmentation processing, and form the word frequencies of the words after the word segmentation processing into the word frequency vector; or may include both the word segmentation unit and the filtering unit, specifically, as shown in fig. 7, the first preprocessing module 62 includes: the word segmentation unit 72 is used for performing word segmentation processing on the behavior data to be detected to obtain a plurality of words; a filtering unit 74, connected to the word segmentation unit 72, configured to filter a plurality of words according to a preset rule to obtain filtered words, obtain word frequencies of the filtered words, and form the word frequencies of the filtered words into the word frequency vector; and the number of the filtered words is equal to the dimension of the word frequency vector. Namely, the word segmentation unit 72 performs word segmentation on the behavior data to be detected and the filtering unit 74 filters words obtained after the word segmentation, so as to complete the preprocessing process performed by the first processing module 62, and finally obtain a word frequency vector capable of representing the behavior data to be detected.
It should be noted that the elements in the word frequency vector represent the word frequencies of the words obtained through the above preprocessing, and for example, if the word frequency vector is (1,2,3), the number of the words is 3, and the word frequencies of the 3 words are 1,2, and 3, respectively.
It should be noted that the preset rule may be to filter and remove words that have little influence on the subsequent similarity calculation, specifically, the filtering unit 74 may complete the filtering process through a plurality of sub-units, for example, as shown in fig. 7, the filtering unit 74 may include at least one of the following: a first filtering subunit 720, configured to filter a plurality of the vocabularies according to the types of the vocabularies; the second filtering subunit 722 is configured to obtain a frequency tf-idf value of the vocabulary, and filter a plurality of vocabularies according to the tf-idf value.
It should be noted that the types of words can be regular words and irregular words, and for irregular words, it can be words with special symbols, such as words beginning with '-' and beginning with '>', but not limited thereto, and for which words are irregular words, they can be preset according to actual needs, and since the influence of irregular words on the similarity of subsequent calculation is not great, they can be directly filtered out by the first filtering subunit 720.
The word frequency represents the sum of the number of words obtained by dividing the occurrence frequency of one word by the word segmentation of the behavior data to be detected, and the sum is a numerical value. The larger the word frequency of a word is, the greater the influence of the word on the similarity of subsequent calculation is, so that the word with the smaller word frequency should be filtered out, specifically, for the second filtering subunit 722, the filtering process may be completed by a plurality of subunits, and the second filtering subunit 722 may include: a sorting unit 7220 configured to sort tf-idf values of the plurality of words in an ascending order or a descending order; a filtering subunit 7222, connected to the sorting unit 7220, for taking the first to nth words as the filtered words when the words are sorted in ascending order; or under the condition of descending order, taking the vocabulary from the last to the Nth as the vocabulary after filtering treatment; wherein N is equal to the dimension of the word frequency vector.
It should be noted that the second filtering subunit 722 may be further configured to randomly select N tf-idf values, and use N vocabularies corresponding to the selected N tf-idf values as filtered vocabularies, where N is equal to the dimension of the word frequency vector. But is not limited thereto.
In an embodiment of the present application, fig. 8 is a block diagram of a third structural block diagram of a virus type identification apparatus according to an embodiment of the present application, and as shown in fig. 8, the apparatus may further include:
and a second obtaining module 80, connected to the first preprocessing module 62, for obtaining behavior data of the sample to be tested from a honeypot used for providing the operating system with the vulnerability.
It should be noted that the honeypot is a computer system running on the internet, and includes a certain vulnerability, and can trap viruses to invade. The intrusion virus can be analyzed and processed by collecting intrusion data. Of course, the honeypot may be used for other purposes, such as finding an attack, generating an alarm, etc., but is not limited thereto. In the embodiment of the present application, the behavior data to be tested may be understood as intrusion data obtained after a virus intrudes into a honeypot, and the behavior data to be tested has operation behavior characteristics of the virus.
It should be noted that the second obtaining module 80 is not limited to obtain the behavior data of the sample to be tested in a honeypot manner, and may also be extracted from the network traffic or obtained from a sandbox (sandbox) at the network end, specifically, the behavior data of the virus (i.e., the behavior data of the sample to be tested) obtained from the sandbox at the network end may be placed into the sandbox after the virus entity is caught in the network traffic.
It should be noted that the first preprocessing module 62 may also be configured to normalize the acquired behavior data of the sample to be tested. The way of normalization is represented by: processing the data which is consistent in five tuples and in a preset time as a group of data; it can also be represented as: processing the data of the same honeypot, the same IP address and the same time as a whole piece of data; but is not limited thereto.
The above quintuple may include: source IP address, source port, destination IP address, destination port, and transport layer protocol.
A first obtaining module 64, connected to the first preprocessing module 62, for obtaining distances between the word frequency vectors and the clustering center points of each category of viruses to obtain a plurality of distance values;
the distance between the word frequency vector and the cluster center point of each class may be an euclidean distance, an absolute distance, a mahalanobis distance, or the like, but is not limited thereto. Correspondingly, if different distance types are used, the way of calculating the distance value is different, for example, if the distance is Euclidean distance, then the distance value can be calculated
Figure GDA0002735898450000121
To calculate the distance value, where d represents the distance, i-1, 2 … N, where N represents the dimension of the word frequency vector, and x represents the dimension of the word frequency vectori1,xi2 respectively indicate the coordinates of both parties for calculating the distance or the corresponding elements of the word frequency vector.
In an embodiment of the present application, the above classifications are obtained by performing a clustering algorithm on a large amount of collected behavior data of the virus, and specifically, fig. 9 is a block diagram of a fourth structural diagram of an apparatus for identifying a virus type according to an embodiment of the present application, as shown in fig. 9, the apparatus further includes: a statistic module 90 for counting the behavior data of the virus; a second preprocessing module 92, connected to the statistical module 90, for performing the same processing as the preprocessing on the behavior data to obtain the processed behavior data; and a classification module 94, connected to the second preprocessing module 92, configured to classify the preprocessed behavior data by using a predetermined clustering algorithm to obtain the classes.
It should be noted that the behavior data of the virus is also obtained by the honeypot, and the predetermined clustering algorithm may be a K-means clustering algorithm, a hierarchical clustering algorithm, a self-organizing map SOM clustering algorithm, or a fuzzy C-means FCM clustering algorithm, but is not limited thereto.
Through the statistical module 90, the second preprocessing module 92 and the classification module 94, a large amount of behavior sample data of the viruses are trained and analyzed in a machine learning mode, and the problems of false report and false report when the viruses are identified in a characteristic value matching mode can be avoided to a great extent.
And a determining module 66, connected to the first obtaining module 64, for determining the cluster center point corresponding to the minimum distance value among the plurality of distance values as the type of the virus.
By the device, the first preprocessing module 62, the first obtaining module 64 and the determining module 66 adopt a mode of preprocessing the behavior data to be detected of the virus, calculating the distance values between the word frequency vectors obtained after preprocessing and the clustering center points of each classification, determining the classification of the clustering center point corresponding to the minimum distance value in the distance values as the type of the virus, namely, the device determines which category of the virus corresponding to the behavior data to be tested belongs to by calculating the distance between the data sample to be tested and the clustering center point of each category, thereby being capable of well identifying the sample of unknown variety, further avoiding the possibility of false report and false report of virus identification, and further, the technical problems of misinformation and missing report in virus identification by adopting a characteristic value matching mode in the related technology are solved.
It should be noted that the above apparatus may further include: and the alarm module is used for generating alarm information after the type of the virus is determined. And related personnel can be informed to correspondingly process the virus through alarm information.
It should be noted that the above-mentioned apparatus can be applied to intrusion detection on the network side. For example, in the case that the virus is a worm virus, since the worm virus is generally performed using telnet protocol during the propagation process, the protocol is a plaintext transmission protocol, the method can be applied to intrusion detection on a network side regarding plaintext transmission.
It should be noted that, since there may be many kinds of viruses of the same type, the device can be used to know which kind of viruses in the data to be analyzed belongs to, but the invention is not limited thereto, and it can also be used to determine whether the data to be analyzed is normal.
The above modules may be included in one processor, may be included in a plurality of processors, and are not limited thereto.
Example 3
The embodiment of the invention can provide a computer terminal which can be any computer terminal device in a computer terminal group. Optionally, in this embodiment, the computer terminal may also be replaced with a terminal device such as a mobile terminal.
Optionally, in this embodiment, the computer terminal may be located in at least one network device of a plurality of network devices of a computer network.
In this embodiment, the computer terminal may execute the program code of the following steps in the method for identifying a type of a virus of an application program: preprocessing behavior data to be detected of the virus to obtain a word frequency vector; obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values; and determining the cluster center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus.
Alternatively, fig. 10 is a block diagram of a computer terminal according to an embodiment of the present invention. As shown in fig. 10, the computer terminal a may include: one or more processors 1002 (only one of which is shown), memory 1004, and transmission 1006, server 1008.
The memory 1004 may be used to store software programs and modules, such as program instructions/modules corresponding to the virus type identification method and apparatus in the embodiment of the present application, and the processor 1002 executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implementing the virus type identification method. The memory 1004 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, memory 1004 may further include memory located remotely from the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The server 1008 interacts with the computer terminal a via a network.
The processor 1002 may invoke the memory-stored information and applications via the transmission 1006 to perform the following steps: preprocessing behavior data to be detected of the virus to obtain a word frequency vector; obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values; and determining the cluster center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus.
By adopting the embodiment of the application, a scheme for identifying the type of the virus is provided. The method comprises the steps of preprocessing behavior data to be detected of the virus, calculating distance values between word frequency vectors obtained after preprocessing and clustering center points of all the classifications, determining the classification of the clustering center point corresponding to the minimum distance value in the distance values as the type of the virus, namely determining which classification the virus corresponding to the behavior data to be detected belongs to by calculating the distance between a data sample to be detected and the clustering center points of all the classifications, and further well identifying samples of unknown varieties, so that the possibility of false report and false report missing of virus identification are avoided, and the technical problems that false report and false report missing exist in virus identification in a characteristic value matching mode in the related technology are solved.
It can be understood by those skilled in the art that the structure shown in fig. 10 is only an illustration, and the computer terminal may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, etc. Fig. 10 is a diagram illustrating a structure of the electronic device. For example, the computer terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 10, or have a different configuration than shown in FIG. 10.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Example 4
The embodiment of the invention also provides a storage medium. Optionally, in this embodiment, the storage medium may be configured to store a program code executed by the virus type identification method provided in embodiment 1.
Optionally, in this embodiment, the storage medium may be located in any one of computer terminals in a computer terminal group in a computer network, or in any one of mobile terminals in a mobile terminal group.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps: preprocessing behavior data to be detected of the virus to obtain a word frequency vector; obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values; and determining the cluster center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for identifying a type of a virus, comprising:
preprocessing behavior data to be detected of viruses to obtain word frequency vectors, wherein elements in the word frequency vectors represent word frequencies of vocabularies, and the vocabularies are obtained by segmenting the behavior data to be detected;
obtaining the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values;
determining the classification of the clustering center point corresponding to the minimum distance value in the plurality of distance values as the type of the virus;
preprocessing the behavior data to be detected of the virus to obtain a word frequency vector comprises the following steps:
performing word segmentation processing on the behavior data to be detected to obtain a plurality of words;
filtering the vocabularies according to a preset rule to obtain filtered vocabularies, acquiring the word frequencies of the filtered vocabularies, and forming the word frequencies of the filtered vocabularies into the word frequency vectors; and the number of the filtered words is equal to the dimension of the word frequency vector.
2. The method of claim 1, wherein filtering the plurality of words according to a predetermined rule comprises at least one of:
filtering the plurality of vocabularies according to the types of the vocabularies;
and acquiring a word frequency tf-idf value of the vocabulary, and filtering a plurality of vocabularies according to the tf-idf value.
3. The method of claim 2, wherein filtering a plurality of the words according to the tf-idf value comprises:
sorting the tf-idf values of a plurality of the words in ascending order or descending order;
under the condition of ascending arrangement, taking the vocabulary from the first to the Nth as the vocabulary after the filtering processing;
under the condition of descending order, taking the vocabulary from the last to the Nth as the vocabulary after the filtering processing;
and the dimension of the N is equal to that of the word frequency vector.
4. The method of claim 1, wherein prior to preprocessing the behavior data of the virus to be tested, the method further comprises:
and acquiring the behavior data to be tested from a honeypot for providing the operating system with the vulnerability.
5. The method of claim 1, wherein the respective classifications are derived by:
counting behavior data of the virus;
the behavior data is processed in the same way as the preprocessing, so that the processed behavior data is obtained;
and classifying the preprocessed behavior data by adopting a preset clustering algorithm to obtain each classification.
6. The method of claim 5, wherein the predetermined clustering algorithm comprises at least one of:
the method comprises a K-means clustering algorithm, a hierarchical clustering algorithm, a self-organizing mapping (SOM) clustering algorithm and a fuzzy C mean value (FCM) clustering algorithm.
7. An apparatus for identifying a type of a virus, comprising:
the system comprises a first preprocessing module, a second preprocessing module and a third preprocessing module, wherein the first preprocessing module is used for preprocessing behavior data to be detected of viruses to obtain word frequency vectors, elements in the word frequency vectors represent word frequencies of vocabularies, and the vocabularies are obtained by segmenting the behavior data to be detected;
the first acquisition module is used for acquiring the distance between the word frequency vector and the clustering center point of each category of the viruses to obtain a plurality of distance values;
a determining module, configured to determine, as the type of the virus, a classification where a cluster center point corresponding to a minimum distance value of the multiple distance values is located;
the first preprocessing module is further used for performing word segmentation processing on the behavior data to be detected to obtain a plurality of words; filtering the vocabularies according to a preset rule to obtain filtered vocabularies, acquiring the word frequencies of the filtered vocabularies, and forming the word frequencies of the filtered vocabularies into the word frequency vectors; and the number of the filtered words is equal to the dimension of the word frequency vector.
8. The apparatus of claim 7, wherein the first preprocessing module is further configured to filter a plurality of the vocabularies according to types of the vocabularies; or acquiring a word frequency tf-idf value of the vocabulary, and filtering a plurality of vocabularies according to the tf-idf value.
CN201610018316.2A 2016-01-12 2016-01-12 Virus type identification method and device Active CN106960153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610018316.2A CN106960153B (en) 2016-01-12 2016-01-12 Virus type identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610018316.2A CN106960153B (en) 2016-01-12 2016-01-12 Virus type identification method and device

Publications (2)

Publication Number Publication Date
CN106960153A CN106960153A (en) 2017-07-18
CN106960153B true CN106960153B (en) 2021-01-29

Family

ID=59481394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610018316.2A Active CN106960153B (en) 2016-01-12 2016-01-12 Virus type identification method and device

Country Status (1)

Country Link
CN (1) CN106960153B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562618A (en) * 2017-08-07 2018-01-09 北京奇安信科技有限公司 A kind of shellcode detection method and device
CN109522915B (en) * 2017-09-20 2022-08-23 腾讯科技(深圳)有限公司 Virus file clustering method and device and readable medium
CN107609400A (en) * 2017-09-28 2018-01-19 深信服科技股份有限公司 Computer virus classification method, system, equipment and computer-readable recording medium
CN109117635B (en) * 2018-09-06 2023-07-04 腾讯科技(深圳)有限公司 Virus detection method and device for application program, computer equipment and storage medium
CN109558467B (en) * 2018-12-07 2020-09-15 国网江苏省电力有限公司常州供电分公司 Method and system for identifying user category of electricity utilization
CN112395612A (en) * 2019-08-15 2021-02-23 中兴通讯股份有限公司 Malicious file detection method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604363A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on the file instruction frequency
CN101604365A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Determine the system and method for number of computer rogue program sample families
US20110219002A1 (en) * 2010-03-05 2011-09-08 Mcafee, Inc. Method and system for discovering large clusters of files that share similar code to develop generic detections of malware
CN102968591A (en) * 2012-11-21 2013-03-13 中国人民解放军国防科学技术大学 Malicious-software characteristic clustering analysis method and system based on behavior segment sharing
CN103761477A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Method and equipment for acquiring virus program samples
US20140165198A1 (en) * 2012-10-23 2014-06-12 Verint Systems Ltd. System and method for malware detection using multidimensional feature clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604363A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Computer rogue program categorizing system and sorting technique based on the file instruction frequency
CN101604365A (en) * 2009-07-10 2009-12-16 珠海金山软件股份有限公司 Determine the system and method for number of computer rogue program sample families
US20110219002A1 (en) * 2010-03-05 2011-09-08 Mcafee, Inc. Method and system for discovering large clusters of files that share similar code to develop generic detections of malware
US20140165198A1 (en) * 2012-10-23 2014-06-12 Verint Systems Ltd. System and method for malware detection using multidimensional feature clustering
CN102968591A (en) * 2012-11-21 2013-03-13 中国人民解放军国防科学技术大学 Malicious-software characteristic clustering analysis method and system based on behavior segment sharing
CN103761477A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Method and equipment for acquiring virus program samples

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
聚类算法在手机病毒入侵检测中的研究与实现;范茂;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120815(第08期);第18-19,38-49页 *

Also Published As

Publication number Publication date
CN106960153A (en) 2017-07-18

Similar Documents

Publication Publication Date Title
CN106960153B (en) Virus type identification method and device
US9781139B2 (en) Identifying malware communications with DGA generated domains by discriminative learning
US11496495B2 (en) System and a method for detecting anomalous patterns in a network
CN107222511B (en) Malicious software detection method and device, computer device and readable storage medium
CN108809745A (en) A kind of user's anomaly detection method, apparatus and system
US10484408B2 (en) Malicious communication pattern extraction apparatus, malicious communication pattern extraction method, and malicious communication pattern extraction program
CN111090807B (en) Knowledge graph-based user identification method and device
CN110365636B (en) Method and device for judging attack data source of industrial control honeypot
CN111368289B (en) Malicious software detection method and device
CN110647896B (en) Phishing page identification method based on logo image and related equipment
CN112070120A (en) Threat information processing method, device, electronic device and storage medium
CN108197474A (en) The classification of mobile terminal application and detection method
CN110647895B (en) Phishing page identification method based on login box image and related equipment
CN113328985A (en) Passive Internet of things equipment identification method, system, medium and equipment
CN116915450A (en) Topology pruning optimization method based on multi-step network attack recognition and scene reconstruction
CN111064719A (en) Method and device for detecting abnormal downloading behavior of file
CN117294497A (en) Network traffic abnormality detection method and device, electronic equipment and storage medium
CN112087450A (en) Abnormal IP identification method, system and computer equipment
CN109447177B (en) Account clustering method and device and server
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
Manokaran et al. An empirical comparison of machine learning algorithms for attack detection in internet of things edge
CN112149121A (en) Malicious file identification method, device, equipment and storage medium
CN106060025A (en) Automatic application classification method and automatic application classification device
CN117391214A (en) Model training method and device and related equipment
Malviya et al. An Efficient Network Intrusion Detection Based on Decision Tree Classifier & Simple K-Mean Clustering using Dimensionality Reduction-A Review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant