CN106713335B

CN106713335B - Malicious software identification method and device

Info

Publication number: CN106713335B
Application number: CN201611265807.3A
Authority: CN
Inventors: 於大维; 董浩; 谢军; 陆骋怀; 尚进; 蒋东毅
Original assignee: Hillstone Networks Co Ltd
Current assignee: Hillstone Networks Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-10-30
Anticipated expiration: 2036-12-30
Also published as: CN106713335A

Abstract

The invention discloses a method and a device for identifying malicious software, which adopt the technical scheme in the application file to collect a large number of network behaviors of malicious software in advance, extract URL characteristics from the network behaviors, collect the URL characteristics of the network behaviors of designated software to be detected after establishing a preset rule, and compare the URL characteristics of the designated software to be detected and the URL characteristics of the designated software to be detected to determine whether the designated software to be detected is the malicious software. By adopting the technical scheme, the malicious software can be rapidly and accurately identified, and the technical problem that the malicious software is not conveniently and effectively identified in the related technology is solved.

Description

Malicious software identification method and device

Technical Field

The invention relates to the field of single-chip microcomputers, in particular to a method and a device for identifying malicious software.

Background

In the related art, malware refers to viruses, worms, and trojan horses that perform malicious tasks on computer systems, and control of infected hosts is implemented by destroying software processes. Malicious software is composed of multiple threats, an infected host is often controlled by a command control server of a hacker to form a BotNet (called BotNet under the english name), which is a group of computers centrally controlled by the hacker on the internet and is often used by the hacker to launch large-scale network attacks, such as distributed denial of service attacks (DDoS), massive spam and the like, and information stored by the computers controlled by the hacker can be acquired by the hacker. Therefore, botnets are extremely threatening risks for both network safe operation and user data security protection.

The computer with the Trojan horse virus is attacked by a hacker, and the hacker can manipulate it at will and use it to do anything just like a puppet. At the heart of botnets is a Command and Control (Command & Control) mechanism. There is a communication channel between the controlled host and the hacker that is unknown to the host user. The hacker sends commands, uploads files, initiates attacks, etc. to the controlled host through this channel. There are various communication mechanisms of C & C, and HTTP is the main communication protocol.

Based on analysis of the network behavior of a large amount of malware, technicians can build an effective model to identify communication between a controlled host and a botnet, thereby discovering hosts infected by malware. Infected hosts are found through network behaviors, isolation and cleaning can be carried out in time, and loss caused by threats is reduced. This is a challenging area and there are a number of techniques for solving this problem.

In the related art, there are about two ways to solve the above-mentioned problems:

the first is to build a blacklist database of C & C connection web sites. The infected host is detected by controlling and reporting the network usage of the host using Uniform Resource Locator (URL) filtering. The technique builds a library of feature fields for known C & C connections. Whether it is a C & C connection is determined by matching with the actual connection site parameters. The technology has the advantages of simplicity, accuracy and low false alarm rate. However, the first method has the following disadvantages: the feature field library is a static feature, and is ineligible for slight changes of the parameters of the connected website. If all the feature fields are searched in, the library becomes large and inefficient. The feature library needs to be updated in time to ensure its timeliness. The signature field, although derived from the network behavior of the malware sample, is not all connections that are malicious. Some of the connection features are in close proximity to a normally innocuous connection. This requires feature screening and selection to reduce false positives. The effort to maintain and update the feature library is relatively large.

And secondly, establishing a model by adopting a supervision machine learning method through the network behaviors of the malicious software samples and normal harmless network behaviors. The samples are labeled on the basis of collecting a large number of positive and negative samples. Then, a model is established by adopting a method of supervising machine learning, such as regression, random forest and the like. The model can make a judgment on the parameters of the connected website. The technology adopts machine learning and has certain dynamic adaptability. Malware that has not been encountered but has similarities has some ability to be identified. However, this technical method has the following problems: 1) an effective supervised machine learning model requires a large number of representative normal samples to learn. Due to the fact that the number of normal samples is too large, the shapes are various and complex, and comprehensive and representative sampling is difficult to achieve. And the number of the malicious samples is smaller, so that the established model has larger false alarm. 2) The accuracy requirements of supervised machine learning on sample labels are high. If the positive sample has an inaccurate label, the accuracy of the machine learning model is seriously influenced.

Aiming at the technical problem that malicious software is not conveniently and effectively identified in the related technology, no effective solution is provided at present.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying malicious software, which are used for at least solving the technical problem that the malicious software is not conveniently and effectively identified in the related technology.

According to an embodiment of the present invention, there is provided a malware identification method, including: acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by running of specified software; and determining whether the designated software is the malicious software according to a preset rule and the characteristic dimension of the URL, wherein the preset rule is determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by the malicious software.

Optionally, the preset rule is: and under the condition that the similarity of the characteristic dimension of the URL between the designated software and the malware is larger than a preset threshold value, determining that the designated software is the malware, wherein the URL is a URL corresponding to the network behavior generated by the software, and the characteristic dimension of the URL of the malware is the characteristic dimension of the URL which is shared by all the malware in the designated malware family.

Optionally, the similarity of the feature dimensions of the URLs among the members in the designated malicious family is higher than a preset value, the designated malicious family is a preset set of malicious software, and the feature dimensions of the URLs of the network behaviors of the software are acquired by the following method: acquiring respective network behaviors of the plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.

Optionally, in a case that the similarity between the specified software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, determining that the specified software is the malware in the family cluster, where the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.

Optionally, after dividing the malware in the specified malware family into a plurality of families cluster by aggregating feature dimensions of URLs, determining categories of network behaviors of the malware in the families cluster, and determining categories of the families cluster according to the categories, where the categories include one of: c & C connection, file downloading and advertisement clicking.

Optionally, malware in the specified malware family is updated periodically.

According to another embodiment of the present invention, there is also provided a malware identification apparatus including:

the acquisition module is used for acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by the specified software during operation;

the determining module is used for determining whether the designated software is the malicious software according to preset rules and the characteristic dimension of the URL, wherein the preset rules are determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by the malicious software.

Optionally, the determining module is further configured to determine that the specified software is malware when the similarity of the feature dimension of the URL between the specified software and the malware is greater than a preset threshold, where the URL is a URL corresponding to a network behavior generated by the software, and the feature dimension of the URL of the malware is a feature dimension of a URL that all malware in a specified malware family have in common.

Optionally, the similarity of the feature dimensions of URLs among members in the specified malicious family is higher than a preset value, the specified malicious family is a preset set of malicious software, and the obtaining module is further configured to obtain the feature dimensions of URLs of network behaviors of the software by: acquiring respective network behaviors of a plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.

Optionally, the determining module is further configured to determine that the specified software is a malware in a family cluster if the similarity between the specified software and a feature dimension of a URL in the malware in the family cluster is greater than a preset threshold, where the family cluster is a family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.

In the embodiment of the invention, the network behaviors of a large amount of malicious software are collected in advance, URL characteristics are extracted from the network behaviors, after the preset rule is established, the URL characteristics of the network behaviors of the designated software to be detected are collected, and the URL characteristics of the designated software to be detected are compared to determine whether the designated software to be detected is the malicious software. By adopting the technical scheme, the malicious software can be rapidly and accurately identified, and the technical problem that the malicious software is not conveniently and effectively identified in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of malware identification according to an embodiment of the present invention;

FIG. 2 is a flow chart of establishing a model for detecting malicious intent, in accordance with a preferred embodiment of the present invention;

fig. 3 is a block diagram of a malware recognition apparatus according to a preferred embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided an embodiment of a method for malware identification, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of a method for identifying malware according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step S102, acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated when the specified software runs;

step S104, determining whether the designated software is malware according to a preset rule and a feature dimension of the URL, wherein the preset rule is determined according to the feature dimension of the URL corresponding to the network behavior generated by the malware.

Through the steps, the network behaviors of a large amount of malicious software are collected in advance, URL features are extracted from the network behaviors, after the preset rules are established, the URL features of the network behaviors of the designated software to be detected are collected, and the URL features of the designated software to be detected and detected are compared to determine whether the designated software to be detected is the malicious software. By adopting the technical scheme, the malicious software can be rapidly and accurately identified, and the technical problem that the malicious software is not conveniently and effectively identified in the related technology is solved.

Optionally, the preset rule is: and under the condition that the similarity of the feature dimension of the URL between the designated software and the malware is larger than a preset threshold value, determining that the designated software is the malware, wherein the URL is a URL corresponding to the network behavior generated by the software, and the feature dimension of the URL of the malware is the feature dimension of the URL which is shared by all the malware in the designated malware family. It should be added that the feature dimension of the URL is an integer vector (the following description is also relevant to explain), and the way of calculating the similarity of the feature dimensions of two URLs may be a technical way of calculating the similarity between vectors in the related art. The method for judging whether the vectors are close is a distance formula. The distance formula may be selected by the specific characteristics of the application. The selected distance formula is to be kept consistent in all calculation steps.

Optionally, the similarity of the feature dimensions of the URLs among the members in the specified malicious family is higher than a preset value, the specified malicious family is a preset set of malicious software, and the feature dimensions of the URLs of the network behaviors of the software are acquired by the following method: acquiring respective network behaviors of the plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting the parameter of the URL according to the key-value pair to obtain a plurality of parameter segments, assigning the parameter segments of the key-value pair to an n-dimensional space, and then obtaining an integer vector of the URL, wherein the integer vector of the URL is the characteristic dimension of the URL, the parameter segments are codes of the key-value pair, n represents the dimension number of the characteristic space, and n is an integer.

Optionally, in a case that the similarity between the specified software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, determining that the specified software is the malware in the family cluster, where the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist in the plurality of malware families.

Optionally, after dividing the malware in the specified malware family into a plurality of families cluster by aggregating the feature dimensions of the URLs, determining the category of the network behavior of the malware in the family cluster, and determining the category of the family cluster according to the category, where the category includes one of: c & C connection, file downloading and advertisement clicking. It should be added that the category of the above-mentioned groups is not limited to the above-mentioned examples.

Optionally, malware in the specified malware family is updated periodically. Sources of updates may include new malware samples, feedback of device detection results, and adaptation to the device deployment environment.

The following detailed description is given with reference to preferred embodiments of the present invention.

The preferred embodiment of the invention adopts a novel method to extract the characteristic dimension of the parameters of the network connection URL based on the network behavior characteristics of the malicious software family. The extracted dimensionality uses an unsupervised machine learning aggregation (clustering) method to extract accurate features, incorrect samples and noises are effectively removed, and the generated model has very high accuracy and low false alarm rate.

The principle of the method is to fully utilize some essential characteristics of the development and evolution of the malicious software. In the related art, one of the essentials of malware is the reuse of code modules. Many malware are malicious varieties that are created by modifications based on existing malware. Some malware source code has been marketed and sold in underground black markets. These factors cause the website characteristics of malicious connections to be preserved to a large extent in the same family of variants of malware. While the commonality is retained, there is also a degree of drift flare. A reliable and efficient detection method needs to extract common similarities and remove non-common noise.

There may also be a great similarity between certain network connection behaviors of different families of malware to other families due to the reuse of modules. Similar features of cross-family malicious software are identified and combined into one feature, so that the feature identification model is reduced, and the detection efficiency is improved.

By adopting the technical scheme in the preferred embodiment of the invention, the following three technical problems are mainly solved:

1. distinguishing malicious connections from non-malicious connections in a malware sample: the collected network behavior of a large amount of malware is the basis for our analytical modeling. The network behavior of the sample is not entirely malicious connections, many of which are also present in normal software. It is a technical challenge to determine that a connection is a malicious connection.

2. Effective characteristic engineering: the evolutionary process of malware produces a large number of variants that make extraction of identifying features difficult. To solve this problem, feature dimensions and matching methods need to have some ambiguity. Instead of identifying existing known samples, it is also possible to identify unknown variants.

3. Machine learning modeling techniques that do not rely on benign samples. The method has less requirements on benign samples, and the benign samples are only used for model cleaning to reduce the false alarm rate.

Fig. 2 is a flowchart of establishing a malicious model detection method according to a preferred embodiment of the present invention, as shown in fig. 2, including the following steps:

step one, collecting and classifying malicious software, running in a sandbox technology, and collecting network behaviors of the malicious software.

And step two, analyzing the network behaviors of the various protocols, and extracting the URL characteristics of the HTTP protocol.

And thirdly, splitting parameter key value pairs of the URL, wherein each key value pair is an integer value coded as n (n is a fixed value). A URL can be converted into an integer vector in this way. This vector is the feature dimension of the URL. The number of feature dimensions for such a mathematization is fixed. This step is a process of dimension extraction and dimension reduction.

And step four, the feature dimensions of all URLs of one malware family are grouped into a plurality of similar families together by using an aggregation method. Sample URLs that do not satisfy the aggregation condition are treated as noise removal. Each ethnic group is a signature of a feature of this malware family. A malicious family may have one to many ethnic characteristics. Different from the traditional string matching method, the calculation of the aggregation similarity adopts a specific calculation formula of the preferred embodiment of the invention, and the physical meaning of the URL parameter can be effectively reflected.

Controlling the conditions of aggregation allows distinguishing truly generic malicious URLs from false malicious URLs that are not statistically significant. In the process of later updating, as the samples increase, the false malicious URLs can also become meaningful true malicious URLs due to the addition of new samples.

And step five, carrying out cross-family aggregation on a plurality of families (clusters) of the malware families together. Similar clusters from different families may be combined into one cluster. The commonality cluster across families reflects the characteristics of malware modularization and code reuse.

And step six, in order to reduce false alarm, cleaning the model by using a large amount of normal flow connections. At this step, we use the same similarity calculation method. Cluster similar to benign connections will be removed from the model. This step also makes class labels for cluster. For some malicious connections, through our analytical research, the type of connection can be determined, such as C & C connection, file download, advertisement click, and so on. These categorical URLs can be applied to cluster using the same affinity matching method. Therefore, for the cluster matched with the malicious connection, the specific type of the malicious connection can be provided, and a user can have more accurate cognition.

And step seven, the mathematical characteristic model of the cluster generated in the step six is issued to the deployed intelligent firewall equipment. The firewall device employs a model to examine the HTTP URLs of user traffic. The detected malicious connection may generate a threat event that is exposed to the user on the firewall device. And simultaneously, uploading meta data of the inspection result to the cloud for further analysis.

And step eight, updating the model. The method can continuously update samples of the malicious software, and regularly repeats the steps from one step to six to build a new model. Meanwhile, the cloud end can also perform big data analysis on the detection data uploaded by all the devices, and the model is corrected according to the feedback result. And the updated model is issued to the equipment for upgrading.

The model generated by the preferred embodiment of the invention has the characteristics of small volume, rapid calculation, high accuracy, high malicious software coverage rate and the like. Preliminary data shows that applicants extracted a model of about nine thousand clusters from over one million malware samples. The model can detect 85% of known malicious samples, and the coverage rate of a malicious family reaches over 90%. While the false alarm rate for benign samples is negligibly low. More importantly, the method of the preferred embodiment of the present invention also has a high detection rate of unknown malware variants, thereby preventing the disease.

The following are specific examples of preferred embodiments of the invention.

The preferred embodiment of the invention is a novel method for analyzing big data of the URL connected by the malicious software and extracting the commonality of the malicious connection so as to establish a mathematical model. The model established by the method has the characteristics of small volume, small calculation complexity, high efficiency and quickness in detection and the like.

Using the method of the preferred embodiment of the present invention, the user needs to first collect a large number of malware samples of the network behavior. It is common practice to run malware in a sandbox environment while crawling network messages. Malware is typically classified by malicious families, under which there are different variants. We group malicious links into families, including variants under the same family. This allows the extraction of commonalities between different varieties of the same family. The following is an example, under a family Trojan [ Rootkit ]/Win32.Small, there are two malicious connections. (of course actual data, there may be tens of thousands of connections under a family)

Family: trojan [ Rootkit ]/Win32.Small

The first malicious connection: http:// domain.com/conn? user & ver & 2.0& key & 123abc

Second malicious connection: http:// cc. domain/connpath? key 123456& user jane & ver 3.5

For the first connection, three pairs of parameters are extracted: user joe, ver 2.0, key 123abc

For the second connection, three pairs of parameters were extracted: 12345 key, jane, ver 3.5

(the parameter is a key value format)

Millions of parameter key-value pairs may be extracted for actual data. From these parameter pairs, feature dimension extraction is required. The extraction of features reflects the nature of the malicious connection changes. For the parameter of ver 2.0, it is converted into ver 2.0, ver numerical (type of 2.0), string (type of ver) 2.0. Each transformed value is then mapped onto a finite space of fixed dimension n. Assuming that n is 100, the four values map to (5,32,91, 99). The specific mapping formula may be determined by the actual application.

Vector connecting web site → n-dimensional space, here, a method for extracting feature dimensions and reducing the number of dimensions, is one of the technical key points of the preferred embodiment of the present invention. Its function is to translate hundreds of millions of parameters to facilitate machine learning of large data.

Assuming that a malicious family has M connection sites, we can obtain an M × N dimensional matrix. And performing clustering aggregation on the dimension matrix, and combining the similar vectors to generate a cluster. This reduces the matrix of MxN to p clusters. One Cluster is represented by an n-dimensional vector with its center point. By adjusting the polymerization parameters, p can be controlled to a small and effective range.

The method for judging whether the vectors are close is a distance formula. The distance formula may be selected by the specific characteristics of the application. The selected distance formula is to be kept consistent in all calculation steps.

P n-dimensions are mathematical models of threat detection. For an unknown connection, http:// somewhere/connection? The user & version 2.3, we can extract dimension and vectorize the connecting parameters by the same method. And (4) performing distance calculation on the vector to be detected and the model clusters, and judging the vector to be detected as malicious connection if the vector meets the requirement of a distance threshold.

The detection model provided by the preferred embodiment of the invention can be deployed in an exit firewall of a company network to detect the external connection of an intranet host. The preferred embodiment of the invention can detect that the intranet host downloads malicious software, plug-ins and the like from a malicious website; requests by infected hosts to the botnet control center may also be detected. The detection result can remind the IT department of the company to further analyze the suspicious host, such as antivirus scanning and the like. If the diagnosis is confirmed, further protective measures can be taken for isolation. The firewall may also be configured with some automatic policies to mitigate the harm of the suspicious host, such as limiting network connection, preventing file from being uploaded, and so on, so as to avoid the harm caused by trojan and virus.

The unique and novel cluster technology of the preferred embodiment of the invention makes the feedback and updating of the model simple and easy. The user may identify a false positive for the model so that similar URLs are no longer reported. The cloud model update can also quickly send the user feedback to more devices.

In the related art, the high-level threats completely pose a significant challenge to the information of enterprises due to their diverse, persistent and difficult-to-detect characteristics. The technical scheme in the preferred embodiment of the invention can provide protection for enterprises and governments, and effectively detect the controlled host infected by the malicious software in the shortest possible time on the basis of the existing firewall. And the enterprises can take effective measures before the harm occurs, so that the loss in all aspects is reduced.

By adopting the technical scheme in the preferred embodiment of the invention, the following effects are realized:

1. and extracting correct and effective malicious website connections from the network behaviors of all the malicious software samples, removing benign noise, and establishing a rapid and efficient detection model. The detection model adopts a similarity matching method, can effectively adapt to the variation of the network connection of the malicious software, and helps to detect unknown variation.

2. The detection method is rapid and efficient. The preferred embodiment of the invention adopts a novel conversion formula to convert parameter strings with infinite possibility theoretically into mathematical dimensionality with fixed number, thereby effectively solving the problem of dimension magic spells and enabling quick and efficient detection to be possible.

Example 2

Fig. 3 is a block diagram of a malware recognition apparatus according to a preferred embodiment of the present invention, as shown in fig. 3, the apparatus includes:

an obtaining module 32, configured to obtain a uniform resource locator URL corresponding to a network behavior generated by the specified software during running;

a determining module 34, connected to the obtaining module 32, for determining whether the specified software is malware according to a preset rule and a feature dimension of the URL, where the preset rule is determined according to the feature dimension of the URL corresponding to network behaviors generated by a plurality of malware.

Optionally, the determining module 34 is further configured to determine that the specified software is malware if the similarity of the feature dimension of the URL between the specified software and the malware is greater than a preset threshold, where the URL is a URL corresponding to a network behavior generated by the software, and the feature dimension of the URL of the malware is a feature dimension of a URL that all malware in a specified malware family have in common.

Optionally, the similarity of the feature dimension of the URL between the members in the specified malicious family is higher than a preset value, the specified malicious family is a set of preset malicious software, and the obtaining module 32 is further configured to obtain the feature dimension of the URL of the network behavior of the software by: acquiring respective network behaviors of the malicious software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting the parameter of the URL according to the key-value pair to obtain a plurality of parameter segments, assigning the parameter segments of the key-value pair to an n-dimensional space, and then obtaining an integer vector of the URL, wherein the integer vector of the URL is the characteristic dimension of the URL, the parameter segments are codes of the key-value pair, n represents the dimension number of the characteristic space, and n is an integer.

Optionally, the determining module 34 is further configured to determine that the specified software is the malware in the family cluster if the similarity between the specified software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, where the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist in the plurality of malware families.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying malware, comprising:

acquiring a Uniform Resource Locator (URL) corresponding to a network behavior generated by running of specified software;

determining whether the designated software is malware or not according to preset rules and the characteristic dimension of the URL, wherein the preset rules are determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by a plurality of malware;

wherein, the preset rule is as follows: determining the designated software as malware when the similarity of the feature dimensions of the URLs between the designated software and the malware is larger than a preset threshold, wherein the URL is a URL corresponding to a network behavior generated by the software, and the feature dimensions of the URLs of the malware are feature dimensions of URLs which are common to all malware in a designated malware family;

the similarity of the feature dimensions of the URLs among the members in the specified malware family is higher than a preset value, the specified malware family is a preset set of malware, and the feature dimensions of the URLs of the network behaviors of the software are obtained in the following mode:

acquiring respective network behaviors of a plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior;

splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.

2. The method according to claim 1, wherein in case that the similarity between the designated software and the feature dimension of the URL in the malware in the family cluster is greater than a preset threshold, determining that the designated software is the malware in the family cluster, wherein the family cluster is the family cluster in the designated malware family, and the family cluster is obtained by:

and dividing the malware in the specified malware family into a plurality of families by aggregating the characteristic dimension of the URL, wherein the families with high similarity in different malware families are merged into the specified family under the condition that the plurality of malware families exist.

3. The method of claim 2, wherein after dividing the malware in the specified malware family into a plurality of families cluster by aggregating feature dimensions of URLs, determining categories of network behaviors of the malware in the families cluster, and determining categories of the families cluster according to the categories, wherein the categories comprise one of: c & C connection, file downloading and advertisement clicking.

4. The method of claim 1, further comprising: periodically updating malware in the specified malware family.

5. An apparatus for identifying malware, the apparatus comprising:

the determining module is used for determining whether the designated software is the malicious software or not according to preset rules and the characteristic dimension of the URL, wherein the preset rules are determined according to the characteristic dimension of the URL corresponding to the network behaviors generated by a plurality of malicious software;

the determining module is further configured to determine that the designated software is malware when the similarity of the feature dimensions of the URLs between the designated software and the malware is greater than a preset threshold, where the URL is a URL corresponding to a network behavior generated by the software, and the feature dimension of the URL of the malware is a feature dimension of a URL common to all malware in a designated malware family;

the similarity of the feature dimensions of the URLs among the members in the specified malware family is higher than a preset value, the specified malware family is a preset set of malware, and the obtaining module is further used for obtaining the feature dimensions of the URLs of the network behaviors of the software in the following modes:

acquiring respective network behaviors of a plurality of software, and analyzing and acquiring a URL (uniform resource locator) in each network behavior; splitting parameters of the URL according to key-value pairs to obtain a plurality of parameter segments, assigning values to the parameter segments of the key-value pairs to map to an n-dimensional space, and then obtaining integer vectors of the URL, wherein the integer vectors of the URL are characteristic dimensions of the URL, the parameter segments are codes of the key-value pairs, n represents the dimension number of the characteristic space, and n is an integer.

6. The apparatus of claim 5, wherein the determining module is further configured to determine that the specified software is a malware in a family cluster if the similarity between the specified software and a feature dimension of a URL in the malware in the family cluster is greater than a preset threshold, wherein the family cluster is the family cluster in the specified malware family, and the family cluster is obtained by: