CN106713335B - Malicious software identification method and device - Google Patents

Malicious software identification method and device Download PDF

Info

Publication number
CN106713335B
CN106713335B CN201611265807.3A CN201611265807A CN106713335B CN 106713335 B CN106713335 B CN 106713335B CN 201611265807 A CN201611265807 A CN 201611265807A CN 106713335 B CN106713335 B CN 106713335B
Authority
CN
China
Prior art keywords
malware
url
family
software
specified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611265807.3A
Other languages
Chinese (zh)
Other versions
CN106713335A (en
Inventor
於大维
董浩
谢军
陆骋怀
尚进
蒋东毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hillstone Networks Co Ltd
Original Assignee
Hillstone Networks Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hillstone Networks Co Ltd filed Critical Hillstone Networks Co Ltd
Priority to CN201611265807.3A priority Critical patent/CN106713335B/en
Publication of CN106713335A publication Critical patent/CN106713335A/en
Application granted granted Critical
Publication of CN106713335B publication Critical patent/CN106713335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2463/00Additional details relating to network architectures or network communication protocols for network security covered by H04L63/00
    • H04L2463/144Detection or countermeasures against botnets

Abstract

The invention discloses a method and a device for identifying malicious software, which adopt the technical scheme in the application file to collect a large number of network behaviors of malicious software in advance, extract URL characteristics from the network behaviors, collect the URL characteristics of the network behaviors of designated software to be detected after establishing a preset rule, and compare the URL characteristics of the designated software to be detected and the URL characteristics of the designated software to be detected to determine whether the designated software to be detected is the malicious software. By adopting the technical scheme, the malicious software can be rapidly and accurately identified, and the technical problem that the malicious software is not conveniently and effectively identified in the related technology is solved.

Description

Malicious software identification method and device
Technical Field
The invention relates to the field of single-chip microcomputers, in particular to a method and a device for identifying malicious software.
Background
In the related art, malware refers to viruses, worms, and trojan horses that perform malicious tasks on computer systems, and control of infected hosts is implemented by destroying software processes. Malicious software is composed of multiple threats, an infected host is often controlled by a command control server of a hacker to form a BotNet (called BotNet under the english name), which is a group of computers centrally controlled by the hacker on the internet and is often used by the hacker to launch large-scale network attacks, such as distributed denial of service attacks (DDoS), massive spam and the like, and information stored by the computers controlled by the hacker can be acquired by the hacker. Therefore, botnets are extremely threatening risks for both network safe operation and user data security protection.
The computer with the Trojan horse virus is attacked by a hacker, and the hacker can manipulate it at will and use it to do anything just like a puppet. At the heart of botnets is a Command and Control (Command & Control) mechanism. There is a communication channel between the controlled host and the hacker that is unknown to the host user. The hacker sends commands, uploads files, initiates attacks, etc. to the controlled host through this channel. There are various communication mechanisms of C & C, and HTTP is the main communication protocol.
Based on analysis of the network behavior of a large amount of malware, technicians can build an effective model to identify communication between a controlled host and a botnet, thereby discovering hosts infected by malware. Infected hosts are found through network behaviors, isolation and cleaning can be carried out in time, and loss caused by threats is reduced. This is a challenging area and there are a number of techniques for solving this problem.
In the related art, there are about two ways to solve the above-mentioned problems:
the first is to build a blacklist database of C & C connection web sites. The infected host is detected by controlling and reporting the network usage of the host using Uniform Resource Locator (URL) filtering. The technique builds a library of feature fields for known C & C connections. Whether it is a C & C connection is determined by matching with the actual connection site parameters. The technology has the advantages of simplicity, accuracy and low false alarm rate. However, the first method has the following disadvantages: the feature field library is a static feature, and is ineligible for slight changes of the parameters of the connected website. If all the feature fields are searched in, the library becomes large and inefficient. The feature library needs to be updated in time to ensure its timeliness. The signature field, although derived from the network behavior of the malware sample, is not all connections that are malicious. Some of the connection features are in close proximity to a normally innocuous connection. This requires feature screening and selection to reduce false positives. The effort to maintain and update the feature library is relatively large.
And secondly, establishing a model by adopting a supervision machine learning method through the network behaviors of the malicious software samples and normal harmless network behaviors. The samples are labeled on the basis of collecting a large number of positive and negative samples. Then, a model is established by adopting a method of supervising machine learning, such as regression, random forest and the like. The model can make a judgment on the parameters of the connected website. The technology adopts machine learning and has certain dynamic adaptability. Malware that has not been encountered but has similarities has some ability to be identified. However, this technical method has the following problems: 1) an effective supervised machine learning model requires a large number of representative normal samples to learn. Due to the fact that the number of normal samples is too large, the shapes are various and complex, and comprehensive and representative sampling is difficult to achieve. And the number of the malicious samples is smaller, so that the established model has larger false alarm. 2) The accuracy requirements of supervised machine learning on sample labels are high. If the positive sample has an inaccurate label, the accuracy of the machine learning model is seriously influenced.
Aiming at the technical problem that malicious software is not conveniently and effectively identified in the related technology, no effective solution is provided at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for identifying malicious software, which are used for at least solving the technical problem that the malicious software is not conveniently and effectively identified in the related technology.
According to an embodiment of the present invention, there is provided a malware identification method, including: acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by running of specified software; and determining whether the designated software is the malicious software according to a preset rule and the characteristic dimension of the URL, wherein the preset rule is determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by the malicious software.
Optionally, the preset rule is: and under the condition that the similarity of the characteristic dimension of the URL between the designated software and the malware is larger than a preset threshold value, determining that the designated software is the malware, wherein the URL is a URL corresponding to the network behavior generated by the software, and the characteristic dimension of the URL of the malware is the characteristic dimension of the URL which is shared by all the malware in the designated malware family.
Optionally, the similarity of the feature dimensions of the URLs among the members in the designated malicious family is higher than a preset value, the designated malicious family is a preset set of malicious software, and the feature dimensions of the URLs of the network behaviors of the software are acquired by the following method: acquiring respective network behaviors of the plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.
Optionally, in a case that the similarity between the specified software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, determining that the specified software is the malware in the family cluster, where the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.
Optionally, after dividing the malware in the specified malware family into a plurality of families cluster by aggregating feature dimensions of URLs, determining categories of network behaviors of the malware in the families cluster, and determining categories of the families cluster according to the categories, where the categories include one of: c & C connection, file downloading and advertisement clicking.
Optionally, malware in the specified malware family is updated periodically.
According to another embodiment of the present invention, there is also provided a malware identification apparatus including:
the acquisition module is used for acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by the specified software during operation;
the determining module is used for determining whether the designated software is the malicious software according to preset rules and the characteristic dimension of the URL, wherein the preset rules are determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by the malicious software.
Optionally, the determining module is further configured to determine that the specified software is malware when the similarity of the feature dimension of the URL between the specified software and the malware is greater than a preset threshold, where the URL is a URL corresponding to a network behavior generated by the software, and the feature dimension of the URL of the malware is a feature dimension of a URL that all malware in a specified malware family have in common.
Optionally, the similarity of the feature dimensions of URLs among members in the specified malicious family is higher than a preset value, the specified malicious family is a preset set of malicious software, and the obtaining module is further configured to obtain the feature dimensions of URLs of network behaviors of the software by: acquiring respective network behaviors of a plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.
Optionally, the determining module is further configured to determine that the specified software is a malware in a family cluster if the similarity between the specified software and a feature dimension of a URL in the malware in the family cluster is greater than a preset threshold, where the family cluster is a family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.
In the embodiment of the invention, the network behaviors of a large amount of malicious software are collected in advance, URL characteristics are extracted from the network behaviors, after the preset rule is established, the URL characteristics of the network behaviors of the designated software to be detected are collected, and the URL characteristics of the designated software to be detected are compared to determine whether the designated software to be detected is the malicious software. By adopting the technical scheme, the malicious software can be rapidly and accurately identified, and the technical problem that the malicious software is not conveniently and effectively identified in the related technology is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method of malware identification according to an embodiment of the present invention;
FIG. 2 is a flow chart of establishing a model for detecting malicious intent, in accordance with a preferred embodiment of the present invention;
fig. 3 is a block diagram of a malware recognition apparatus according to a preferred embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a method for malware identification, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of a method for identifying malware according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step S102, acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated when the specified software runs;
step S104, determining whether the designated software is malware according to a preset rule and a feature dimension of the URL, wherein the preset rule is determined according to the feature dimension of the URL corresponding to the network behavior generated by the malware.
Through the steps, the network behaviors of a large amount of malicious software are collected in advance, URL features are extracted from the network behaviors, after the preset rules are established, the URL features of the network behaviors of the designated software to be detected are collected, and the URL features of the designated software to be detected and detected are compared to determine whether the designated software to be detected is the malicious software. By adopting the technical scheme, the malicious software can be rapidly and accurately identified, and the technical problem that the malicious software is not conveniently and effectively identified in the related technology is solved.
Optionally, the preset rule is: and under the condition that the similarity of the feature dimension of the URL between the designated software and the malware is larger than a preset threshold value, determining that the designated software is the malware, wherein the URL is a URL corresponding to the network behavior generated by the software, and the feature dimension of the URL of the malware is the feature dimension of the URL which is shared by all the malware in the designated malware family. It should be added that the feature dimension of the URL is an integer vector (the following description is also relevant to explain), and the way of calculating the similarity of the feature dimensions of two URLs may be a technical way of calculating the similarity between vectors in the related art. The method for judging whether the vectors are close is a distance formula. The distance formula may be selected by the specific characteristics of the application. The selected distance formula is to be kept consistent in all calculation steps.
Optionally, the similarity of the feature dimensions of the URLs among the members in the specified malicious family is higher than a preset value, the specified malicious family is a preset set of malicious software, and the feature dimensions of the URLs of the network behaviors of the software are acquired by the following method: acquiring respective network behaviors of the plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting the parameter of the URL according to the key-value pair to obtain a plurality of parameter segments, assigning the parameter segments of the key-value pair to an n-dimensional space, and then obtaining an integer vector of the URL, wherein the integer vector of the URL is the characteristic dimension of the URL, the parameter segments are codes of the key-value pair, n represents the dimension number of the characteristic space, and n is an integer.
Optionally, in a case that the similarity between the specified software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, determining that the specified software is the malware in the family cluster, where the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist in the plurality of malware families.
Optionally, after dividing the malware in the specified malware family into a plurality of families cluster by aggregating the feature dimensions of the URLs, determining the category of the network behavior of the malware in the family cluster, and determining the category of the family cluster according to the category, where the category includes one of: c & C connection, file downloading and advertisement clicking. It should be added that the category of the above-mentioned groups is not limited to the above-mentioned examples.
Optionally, malware in the specified malware family is updated periodically. Sources of updates may include new malware samples, feedback of device detection results, and adaptation to the device deployment environment.
The following detailed description is given with reference to preferred embodiments of the present invention.
The preferred embodiment of the invention adopts a novel method to extract the characteristic dimension of the parameters of the network connection URL based on the network behavior characteristics of the malicious software family. The extracted dimensionality uses an unsupervised machine learning aggregation (clustering) method to extract accurate features, incorrect samples and noises are effectively removed, and the generated model has very high accuracy and low false alarm rate.
The principle of the method is to fully utilize some essential characteristics of the development and evolution of the malicious software. In the related art, one of the essentials of malware is the reuse of code modules. Many malware are malicious varieties that are created by modifications based on existing malware. Some malware source code has been marketed and sold in underground black markets. These factors cause the website characteristics of malicious connections to be preserved to a large extent in the same family of variants of malware. While the commonality is retained, there is also a degree of drift flare. A reliable and efficient detection method needs to extract common similarities and remove non-common noise.
There may also be a great similarity between certain network connection behaviors of different families of malware to other families due to the reuse of modules. Similar features of cross-family malicious software are identified and combined into one feature, so that the feature identification model is reduced, and the detection efficiency is improved.
By adopting the technical scheme in the preferred embodiment of the invention, the following three technical problems are mainly solved:
1. distinguishing malicious connections from non-malicious connections in a malware sample: the collected network behavior of a large amount of malware is the basis for our analytical modeling. The network behavior of the sample is not entirely malicious connections, many of which are also present in normal software. It is a technical challenge to determine that a connection is a malicious connection.
2. Effective characteristic engineering: the evolutionary process of malware produces a large number of variants that make extraction of identifying features difficult. To solve this problem, feature dimensions and matching methods need to have some ambiguity. Instead of identifying existing known samples, it is also possible to identify unknown variants.
3. Machine learning modeling techniques that do not rely on benign samples. The method has less requirements on benign samples, and the benign samples are only used for model cleaning to reduce the false alarm rate.
Fig. 2 is a flowchart of establishing a malicious model detection method according to a preferred embodiment of the present invention, as shown in fig. 2, including the following steps:
step one, collecting and classifying malicious software, running in a sandbox technology, and collecting network behaviors of the malicious software.
And step two, analyzing the network behaviors of the various protocols, and extracting the URL characteristics of the HTTP protocol.
And thirdly, splitting parameter key value pairs of the URL, wherein each key value pair is an integer value coded as n (n is a fixed value). A URL can be converted into an integer vector in this way. This vector is the feature dimension of the URL. The number of feature dimensions for such a mathematization is fixed. This step is a process of dimension extraction and dimension reduction.
And step four, the feature dimensions of all URLs of one malware family are grouped into a plurality of similar families together by using an aggregation method. Sample URLs that do not satisfy the aggregation condition are treated as noise removal. Each ethnic group is a signature of a feature of this malware family. A malicious family may have one to many ethnic characteristics. Different from the traditional string matching method, the calculation of the aggregation similarity adopts a specific calculation formula of the preferred embodiment of the invention, and the physical meaning of the URL parameter can be effectively reflected.
Controlling the conditions of aggregation allows distinguishing truly generic malicious URLs from false malicious URLs that are not statistically significant. In the process of later updating, as the samples increase, the false malicious URLs can also become meaningful true malicious URLs due to the addition of new samples.
And step five, carrying out cross-family aggregation on a plurality of families (clusters) of the malware families together. Similar clusters from different families may be combined into one cluster. The commonality cluster across families reflects the characteristics of malware modularization and code reuse.
And step six, in order to reduce false alarm, cleaning the model by using a large amount of normal flow connections. At this step, we use the same similarity calculation method. Cluster similar to benign connections will be removed from the model. This step also makes class labels for cluster. For some malicious connections, through our analytical research, the type of connection can be determined, such as C & C connection, file download, advertisement click, and so on. These categorical URLs can be applied to cluster using the same affinity matching method. Therefore, for the cluster matched with the malicious connection, the specific type of the malicious connection can be provided, and a user can have more accurate cognition.
And step seven, the mathematical characteristic model of the cluster generated in the step six is issued to the deployed intelligent firewall equipment. The firewall device employs a model to examine the HTTP URLs of user traffic. The detected malicious connection may generate a threat event that is exposed to the user on the firewall device. And simultaneously, uploading meta data of the inspection result to the cloud for further analysis.
And step eight, updating the model. The method can continuously update samples of the malicious software, and regularly repeats the steps from one step to six to build a new model. Meanwhile, the cloud end can also perform big data analysis on the detection data uploaded by all the devices, and the model is corrected according to the feedback result. And the updated model is issued to the equipment for upgrading.
The model generated by the preferred embodiment of the invention has the characteristics of small volume, rapid calculation, high accuracy, high malicious software coverage rate and the like. Preliminary data shows that applicants extracted a model of about nine thousand clusters from over one million malware samples. The model can detect 85% of known malicious samples, and the coverage rate of a malicious family reaches over 90%. While the false alarm rate for benign samples is negligibly low. More importantly, the method of the preferred embodiment of the present invention also has a high detection rate of unknown malware variants, thereby preventing the disease.
The following are specific examples of preferred embodiments of the invention.
The preferred embodiment of the invention is a novel method for analyzing big data of the URL connected by the malicious software and extracting the commonality of the malicious connection so as to establish a mathematical model. The model established by the method has the characteristics of small volume, small calculation complexity, high efficiency and quickness in detection and the like.
Using the method of the preferred embodiment of the present invention, the user needs to first collect a large number of malware samples of the network behavior. It is common practice to run malware in a sandbox environment while crawling network messages. Malware is typically classified by malicious families, under which there are different variants. We group malicious links into families, including variants under the same family. This allows the extraction of commonalities between different varieties of the same family. The following is an example, under a family Trojan [ Rootkit ]/Win32.Small, there are two malicious connections. (of course actual data, there may be tens of thousands of connections under a family)
Family: trojan [ Rootkit ]/Win32.Small
The first malicious connection: http:// domain.com/conn? user & ver & 2.0& key & 123abc
Second malicious connection: http:// cc. domain/connpath? key 123456& user jane & ver 3.5
For the first connection, three pairs of parameters are extracted: user joe, ver 2.0, key 123abc
For the second connection, three pairs of parameters were extracted: 12345 key, jane, ver 3.5
(the parameter is a key value format)
Millions of parameter key-value pairs may be extracted for actual data. From these parameter pairs, feature dimension extraction is required. The extraction of features reflects the nature of the malicious connection changes. For the parameter of ver 2.0, it is converted into ver 2.0, ver numerical (type of 2.0), string (type of ver) 2.0. Each transformed value is then mapped onto a finite space of fixed dimension n. Assuming that n is 100, the four values map to (5,32,91, 99). The specific mapping formula may be determined by the actual application.
Vector connecting web site → n-dimensional space, here, a method for extracting feature dimensions and reducing the number of dimensions, is one of the technical key points of the preferred embodiment of the present invention. Its function is to translate hundreds of millions of parameters to facilitate machine learning of large data.
Assuming that a malicious family has M connection sites, we can obtain an M × N dimensional matrix. And performing clustering aggregation on the dimension matrix, and combining the similar vectors to generate a cluster. This reduces the matrix of MxN to p clusters. One Cluster is represented by an n-dimensional vector with its center point. By adjusting the polymerization parameters, p can be controlled to a small and effective range.
The method for judging whether the vectors are close is a distance formula. The distance formula may be selected by the specific characteristics of the application. The selected distance formula is to be kept consistent in all calculation steps.
P n-dimensions are mathematical models of threat detection. For an unknown connection, http:// somewhere/connection? The user & version 2.3, we can extract dimension and vectorize the connecting parameters by the same method. And (4) performing distance calculation on the vector to be detected and the model clusters, and judging the vector to be detected as malicious connection if the vector meets the requirement of a distance threshold.
The detection model provided by the preferred embodiment of the invention can be deployed in an exit firewall of a company network to detect the external connection of an intranet host. The preferred embodiment of the invention can detect that the intranet host downloads malicious software, plug-ins and the like from a malicious website; requests by infected hosts to the botnet control center may also be detected. The detection result can remind the IT department of the company to further analyze the suspicious host, such as antivirus scanning and the like. If the diagnosis is confirmed, further protective measures can be taken for isolation. The firewall may also be configured with some automatic policies to mitigate the harm of the suspicious host, such as limiting network connection, preventing file from being uploaded, and so on, so as to avoid the harm caused by trojan and virus.
The unique and novel cluster technology of the preferred embodiment of the invention makes the feedback and updating of the model simple and easy. The user may identify a false positive for the model so that similar URLs are no longer reported. The cloud model update can also quickly send the user feedback to more devices.
In the related art, the high-level threats completely pose a significant challenge to the information of enterprises due to their diverse, persistent and difficult-to-detect characteristics. The technical scheme in the preferred embodiment of the invention can provide protection for enterprises and governments, and effectively detect the controlled host infected by the malicious software in the shortest possible time on the basis of the existing firewall. And the enterprises can take effective measures before the harm occurs, so that the loss in all aspects is reduced.
By adopting the technical scheme in the preferred embodiment of the invention, the following effects are realized:
1. and extracting correct and effective malicious website connections from the network behaviors of all the malicious software samples, removing benign noise, and establishing a rapid and efficient detection model. The detection model adopts a similarity matching method, can effectively adapt to the variation of the network connection of the malicious software, and helps to detect unknown variation.
2. The detection method is rapid and efficient. The preferred embodiment of the invention adopts a novel conversion formula to convert parameter strings with infinite possibility theoretically into mathematical dimensionality with fixed number, thereby effectively solving the problem of dimension magic spells and enabling quick and efficient detection to be possible.
Example 2
Fig. 3 is a block diagram of a malware recognition apparatus according to a preferred embodiment of the present invention, as shown in fig. 3, the apparatus includes:
an obtaining module 32, configured to obtain a uniform resource locator URL corresponding to a network behavior generated by the specified software during running;
a determining module 34, connected to the obtaining module 32, for determining whether the specified software is malware according to a preset rule and a feature dimension of the URL, where the preset rule is determined according to the feature dimension of the URL corresponding to network behaviors generated by a plurality of malware.
Optionally, the determining module 34 is further configured to determine that the specified software is malware if the similarity of the feature dimension of the URL between the specified software and the malware is greater than a preset threshold, where the URL is a URL corresponding to a network behavior generated by the software, and the feature dimension of the URL of the malware is a feature dimension of a URL that all malware in a specified malware family have in common.
Optionally, the similarity of the feature dimension of the URL between the members in the specified malicious family is higher than a preset value, the specified malicious family is a set of preset malicious software, and the obtaining module 32 is further configured to obtain the feature dimension of the URL of the network behavior of the software by: acquiring respective network behaviors of the malicious software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting the parameter of the URL according to the key-value pair to obtain a plurality of parameter segments, assigning the parameter segments of the key-value pair to an n-dimensional space, and then obtaining an integer vector of the URL, wherein the integer vector of the URL is the characteristic dimension of the URL, the parameter segments are codes of the key-value pair, n represents the dimension number of the characteristic space, and n is an integer.
Optionally, the determining module 34 is further configured to determine that the specified software is the malware in the family cluster if the similarity between the specified software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, where the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist in the plurality of malware families.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A method for identifying malware, comprising:
acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by running of specified software;
determining whether the designated software is malware or not according to preset rules and the characteristic dimension of the URL, wherein the preset rules are determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by a plurality of malware;
wherein, the preset rule is as follows: determining the designated software as malware when the similarity of the feature dimensions of the URLs between the designated software and the malware is larger than a preset threshold, wherein the URL is a URL corresponding to a network behavior generated by the software, and the feature dimensions of the URLs of the malware are feature dimensions of URLs which are common to all malware in a designated malware family;
the similarity of the feature dimensions of the URLs among the members in the specified malware family is higher than a preset value, the specified malware family is a preset set of malware, and the feature dimensions of the URLs of the network behaviors of the software are obtained in the following mode:
acquiring respective network behaviors of a plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior;
splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.
2. The method according to claim 1, wherein in case that the similarity between the designated software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, determining that the designated software is the malware in the family cluster, wherein the family cluster is the family cluster in the designated malware family, and the family cluster is obtained by:
and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.
3. The method of claim 2, wherein after dividing the malware in the specified malware family into a plurality of families cluster by aggregating feature dimensions of URLs, determining categories of network behaviors of the malware in the families cluster, and determining categories of the families cluster according to the categories, wherein the categories comprise one of: c & C connection, file downloading and advertisement clicking.
4. The method of claim 1, further comprising: periodically updating malware in the specified malware family.
5. An apparatus for identifying malware, the apparatus comprising:
the acquisition module is used for acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by the specified software during operation;
the determining module is used for determining whether the designated software is the malicious software or not according to preset rules and the characteristic dimension of the URL, wherein the preset rules are determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by a plurality of malicious software;
the determining module is further configured to determine that the designated software is malware when the similarity of the feature dimensions of the URLs between the designated software and the malware is greater than a preset threshold, where the URL is a URL corresponding to a network behavior generated by the software, and the feature dimension of the URL of the malware is a feature dimension of a URL common to all malware in a designated malware family;
the similarity of the feature dimensions of the URLs among the members in the specified malware family is higher than a preset value, the specified malware family is a preset set of malware, and the obtaining module is further used for obtaining the feature dimensions of the URLs of the network behaviors of the software in the following modes:
acquiring respective network behaviors of a plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.
6. The apparatus of claim 5, wherein the determining module is further configured to determine that the specified software is a malware in a family cluster if the similarity between the specified software and a feature dimension of a URL in the malware in the family cluster is greater than a preset threshold, wherein the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by:
and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.
CN201611265807.3A 2016-12-30 2016-12-30 Malicious software identification method and device Active CN106713335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611265807.3A CN106713335B (en) 2016-12-30 2016-12-30 Malicious software identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611265807.3A CN106713335B (en) 2016-12-30 2016-12-30 Malicious software identification method and device

Publications (2)

Publication Number Publication Date
CN106713335A CN106713335A (en) 2017-05-24
CN106713335B true CN106713335B (en) 2020-10-30

Family

ID=58905647

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611265807.3A Active CN106713335B (en) 2016-12-30 2016-12-30 Malicious software identification method and device

Country Status (1)

Country Link
CN (1) CN106713335B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107222511B (en) * 2017-07-25 2021-08-13 深信服科技股份有限公司 Malicious software detection method and device, computer device and readable storage medium
CN107609400A (en) * 2017-09-28 2018-01-19 深信服科技股份有限公司 Computer virus classification method, system, equipment and computer-readable recording medium
CN110399722B (en) * 2019-02-20 2024-03-26 腾讯科技(深圳)有限公司 Virus family generation method, device, server and storage medium
CN109951484B (en) * 2019-03-20 2021-01-26 四川长虹电器股份有限公司 Test method and system for attacking machine learning product
CN110765393A (en) * 2019-09-17 2020-02-07 微梦创科网络科技(中国)有限公司 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression
CN112580027A (en) * 2020-12-15 2021-03-30 北京天融信网络安全技术有限公司 Malicious sample determination method and device, storage medium and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
CN104239582A (en) * 2014-10-14 2014-12-24 北京奇虎科技有限公司 Method and device for identifying phishing webpage based on feature vector model
CN104537303A (en) * 2014-12-30 2015-04-22 中国科学院深圳先进技术研究院 Distinguishing system and method for phishing website
CN104579773A (en) * 2014-12-31 2015-04-29 北京奇虎科技有限公司 Domain name system analysis method and device
CN104794051A (en) * 2014-01-21 2015-07-22 中国科学院声学研究所 Automatic Android platform malicious software detecting method
CN105825129A (en) * 2015-01-04 2016-08-03 中国移动通信集团设计院有限公司 Converged communication malicious software identification method and system
CN106131071A (en) * 2016-08-26 2016-11-16 北京奇虎科技有限公司 A kind of Web method for detecting abnormality and device
US9531736B1 (en) * 2012-12-24 2016-12-27 Narus, Inc. Detecting malicious HTTP redirections using user browsing activity trees

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100583738C (en) * 2007-08-17 2010-01-20 东南大学 Fishing webpage detection method based on image processing
US8019700B2 (en) * 2007-10-05 2011-09-13 Google Inc. Detecting an intrusive landing page
CN102340424B (en) * 2010-07-21 2013-12-04 中国移动通信集团山东有限公司 Bad message detection method and bad message detection device
US9288220B2 (en) * 2013-11-07 2016-03-15 Cyberpoint International Llc Methods and systems for malware detection
EP3731458A1 (en) * 2014-01-24 2020-10-28 McAfee, LLC Automatic placeholder finder-filler
CN104331436B (en) * 2014-10-23 2017-06-06 西安交通大学 The quick classifying method of malicious code based on family gene code
US9398047B2 (en) * 2014-11-17 2016-07-19 Vade Retro Technology, Inc. Methods and systems for phishing detection
CN105893848A (en) * 2016-04-27 2016-08-24 南京邮电大学 Precaution method for Android malicious application program based on code behavior similarity matching
CN106131016B (en) * 2016-07-13 2019-05-03 北京知道创宇信息技术有限公司 Malice URL detects interference method, system and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708186A (en) * 2012-05-11 2012-10-03 上海交通大学 Identification method of phishing sites
US9531736B1 (en) * 2012-12-24 2016-12-27 Narus, Inc. Detecting malicious HTTP redirections using user browsing activity trees
CN104794051A (en) * 2014-01-21 2015-07-22 中国科学院声学研究所 Automatic Android platform malicious software detecting method
CN104239582A (en) * 2014-10-14 2014-12-24 北京奇虎科技有限公司 Method and device for identifying phishing webpage based on feature vector model
CN104537303A (en) * 2014-12-30 2015-04-22 中国科学院深圳先进技术研究院 Distinguishing system and method for phishing website
CN104579773A (en) * 2014-12-31 2015-04-29 北京奇虎科技有限公司 Domain name system analysis method and device
CN105825129A (en) * 2015-01-04 2016-08-03 中国移动通信集团设计院有限公司 Converged communication malicious software identification method and system
CN106131071A (en) * 2016-08-26 2016-11-16 北京奇虎科技有限公司 A kind of Web method for detecting abnormality and device

Also Published As

Publication number Publication date
CN106713335A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106713335B (en) Malicious software identification method and device
Shibahara et al. Efficient dynamic malware analysis based on network behavior using deep learning
CN107645503B (en) Rule-based method for detecting DGA family to which malicious domain name belongs
US8108931B1 (en) Method and apparatus for identifying invariants to detect software tampering
Shabtai et al. F-sign: Automatic, function-based signature generation for malware
CN107241296B (en) Webshell detection method and device
CN110210213B (en) Method and device for filtering malicious sample, storage medium and electronic device
CN110188538B (en) Method and device for detecting data by adopting sandbox cluster
Hu et al. BAYWATCH: robust beaconing detection to identify infected hosts in large-scale enterprise networks
CN110198303A (en) Threaten the generation method and device, storage medium, electronic device of information
CN110149319B (en) APT organization tracking method and device, storage medium and electronic device
Xiaofang et al. Malware variant detection using similarity search over content fingerprint
CN110149318B (en) Mail metadata processing method and device, storage medium and electronic device
US20230252136A1 (en) Apparatus for processing cyber threat information, method for processing cyber threat information, and medium for storing a program processing cyber threat information
Fatemi et al. Threat hunting in windows using big security log data
CN110188537B (en) Data separation storage method and device, storage medium and electronic device
US20240054210A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
Wahanggara et al. Malware detection through call system on android smartphone using vector machine method
CN110224975B (en) APT information determination method and device, storage medium and electronic device
Alosefer et al. Predicting client-side attacks via behaviour analysis using honeypot data
EP3146460B1 (en) Identifying suspected malware files and sites based on presence in known malicious environment
US20230048076A1 (en) Cyber threat information processing apparatus, cyber threat information processing method, and storage medium storing cyber threat information processing program
Matsuda et al. Detection of malicious tools by monitoring dll using deep learning
Pan Iot network behavioral fingerprint inference with limited network traces for cyber investigation
Kim et al. Shapelets-based intrusion detection for protection traffic flooding attacks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215163 No. 181 Jingrun Road, Suzhou High-tech Zone, Jiangsu Province

Applicant after: SHANSHI NETWORK COMMUNICATION TECHNOLOGY CO., LTD.

Address before: 215163 No. 181 Jingrun Road, Suzhou High-tech Zone, Jiangsu Province

Applicant before: HILLSTONE NETWORKS

GR01 Patent grant
GR01 Patent grant