WO2017128868A1 - 程序文件的分类方法、分类装置和分类系统 - Google Patents

程序文件的分类方法、分类装置和分类系统 Download PDF

Info

Publication number
WO2017128868A1
WO2017128868A1 PCT/CN2016/108901 CN2016108901W WO2017128868A1 WO 2017128868 A1 WO2017128868 A1 WO 2017128868A1 CN 2016108901 W CN2016108901 W CN 2016108901W WO 2017128868 A1 WO2017128868 A1 WO 2017128868A1
Authority
WO
WIPO (PCT)
Prior art keywords
program file
file
path
behavior information
program
Prior art date
Application number
PCT/CN2016/108901
Other languages
English (en)
French (fr)
Inventor
刘振华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP16887739.7A priority Critical patent/EP3306493B1/en
Publication of WO2017128868A1 publication Critical patent/WO2017128868A1/zh
Priority to US15/870,545 priority patent/US10762194B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures

Definitions

  • Embodiments of the present invention relate to the field of computers, and, more particularly, to a method, a classification device, and a classification system for classifying a program file.
  • the classification of program files inside the enterprise specifically, the behavior of the program files during the operation as the feature vector of machine clustering and classification, thereby dividing a large number of program files inside the enterprise into several program file categories. Therefore, the analyst only needs to randomly select several program files for each category for in-depth analysis to know whether the program files in this category are suspicious. At the same time, it is also possible to timely discover small program files in categories that are not classified, difficult to be classified, or have a small number of program files, and focus on analyzing the small program files and discovering malicious program files in time. This can effectively reduce the workload of analyzing massive program files and improve the efficiency of analysis.
  • the classification of the program file is to directly use the behavior information in the behavior sequence of the program file, such as string information, to form a feature vector for calculation. Because the randomness of the parameter information of the path involved in various behavioral information is large, the difference between the feature vectors is large, and the classification effect is not good, so that the similarity of the behavior information can not be effectively used for clustering and classification.
  • the embodiment of the invention provides a classification method, a classification device and a classification system of a program file, which can improve the effect of classifying a program file, thereby reducing the workload of identifying a malicious program file.
  • the present invention provides a method for classifying a program file, comprising: acquiring behavior information corresponding to at least two behaviors performed by a program file during operation, wherein each of the behavior information includes a behavior identifier and an execution a path involved in the corresponding behavior; normalizing the path in each of the behavior information, the normalization processing is used to reduce the diversity of the path; at least two according to the normalization of the path
  • the behavior information generates a feature vector, wherein each element of the feature vector corresponds to a behavior information normalized to the path, and the element corresponding to the same behavior information after normalizing the path
  • the same behavior information refers to behavior information in which the behavior identifiers are the same and the normalized processing paths are also the same; and the category to which the program file belongs is determined according to the feature vector.
  • the method for classifying the program file of the present invention acquires behavior information corresponding to at least two behaviors of the program file, normalizes the path of the behavior in the behavior information, reduces the diversity of the path, and normalizes the path according to the path.
  • the processed behavior information generates a feature vector, and determines a category to which the program file belongs according to the feature vector, so that the randomness of the normalized path is reduced, thereby improving the classification effect of the program file, thereby reducing the work of identifying the malicious program file.
  • the workload of the staff is
  • the specific normalization process may include the following.
  • the path in the behavior information includes an M-level directory of the file
  • the normalizing the path in each of the behavior information includes: if the M-level directory conforms to the first regular expression, The first N level directory of the M level directory is replaced with the first identifier, and the remaining MN level of the M level directory is replaced with a digital MN, wherein the mapping relationship between the first regular expression and the first identifier is A set of mapping relationships in the first type, the first type of mapping relationship includes at least one mapping relationship formed by a regular expression and an identifier, where M is a positive integer, and N is a positive integer less than or equal to M; The M-level directory does not match any one of the first-type mapping relationships, and the former L-level directory of the M-level directory is retained, and the remaining ML-level directories of the M-level directory are replaced with the digital ML. Where L is a positive integer less than or equal to M.
  • M-level directory conforms to the first regular expression and can be equivalently understood as including in the M-level directory.
  • a special directory such as the first directory.
  • the path in the behavior information includes a file primary name and a file extension
  • the normalizing the path in each of the behavior information includes: if the file primary name meets the predetermined file primary name feature, The file name is replaced with a second identifier, wherein the mapping relationship between the predetermined file primary name feature and the second identifier is a group of the second type mapping relationship, and the second type mapping relationship includes at least one of a mapping relationship between the feature and the identifier conformed to by the file name; if the file extension does not belong to the file extension in the preset file extension list, the file extension is replaced with the preset third Logo.
  • the path in the behavior information includes an S-level sub-key of the registry, and the normalizing the path in each of the behavior information includes: if the S-level sub-key conforms to the third regular expression, Retaining the pre-T-level directory of the S-level sub-key, and deleting the remaining ST-level sub-keys of the S-level sub-key, wherein the mapping relationship between the third regular expression and T is a third-type mapping a set of relationships, the third type of mapping relationship includes at least one set of mapping relationships formed by regular expressions and identifiers, S is a positive integer, and T is a positive integer less than or equal to S; if the S-level sub- The first subkey in the key is a globally unique identifier CLSID, and the first subkey is replaced with a fourth identifier.
  • S-level sub-key conforming to the third regular expression can be equivalently understood as including a special sub-key in the S-level sub-key.
  • the above normalization method is a long-term summary of the behavior of malicious programs, or based on experience.
  • the above normalization processing on the path implemented by the present invention can reduce the number of features of the program file. On the one hand, it is possible to reduce the number of elements of the feature vector of a program file. On the other hand, it is possible to increase the number of identical features of any two program files, thereby achieving the effect of classifying the program files more efficiently.
  • the determining, according to the feature vector, a category to which the program file belongs includes: determining, by using a classification algorithm, an element in the feature vector and a plurality of programs The similarity of the cluster center data of the file category, the cluster center data is used to represent the characteristics of the program file in the belonging program file category, including the elements of the feature vector of the program file in the belonging program file category;
  • the file is divided into a first program file category, and an element of the feature vector is more similar to the cluster center data of the first program file category than an element of the feature vector and the plurality of program file categories The similarity of cluster center data of other program file categories other than the first program file category.
  • the cluster center data in the present invention may be obtained by clustering feature vectors of a plurality of program files by using a clustering algorithm when clustering a plurality of program files, and is used to represent the category of the program file to which the program belongs. The characteristics of the program file.
  • the determining, according to the feature vector, the category to which the program file belongs includes: using the clustering algorithm, the feature vector and multiple other program files The feature vectors are clustered to form at least one program file category, wherein the similarity of elements of the feature vector of the program file in each program file category is higher than a threshold, and the program file is divided into the at least one program file The first program file category in the category.
  • the generating the feature vector according to the at least two behavior information after normalizing the path includes: normalizing each path after the path is normalized
  • the behavior information is represented as an integer, wherein the same behavior information after the path is normalized is represented as the same integer; each of the integers is used as an element of the feature vector, and the plurality of integers constitute The feature vector.
  • acquiring the behavior information corresponding to the at least two behaviors performed by the program file during the running includes: acquiring the program file by using a monitoring program; obtaining the a sequence of behaviors of the program file, the behavior sequence including behavior information corresponding to the at least two behaviors respectively.
  • a second aspect provides a device for classifying a program file, comprising: an obtaining module, configured to acquire behavior information corresponding to at least two behaviors performed by a program file during operation, each of the behavior information including a behavior identifier and being executed a path involved in the corresponding behavior; a normalization module, configured to normalize a path in each of the behavior information acquired by the acquisition module, where the normalization process is used to reduce the diversity of the path a generating module, configured to generate a feature vector according to at least two behavior information that is normalized by the normalization module, wherein each element of the feature vector is corresponding to the path normalized After the processing, the behavior information of the same behavior information after the normalization of the path is the same, and the same behavior information refers to the same behavior and the normalized path. a classifying module, configured to determine, according to the feature vector generated by the generating module, a category to which the program file belongs
  • the classification device of the program file of the second aspect is a classification server.
  • a third aspect provides a classification system for a program file, including a classification device of a client, a sandbox server, and a program file, wherein the client includes a monitoring program to acquire a program in the client by using the monitoring program. a file; the sandbox server is configured to receive the program file sent by the monitoring program to generate a behavior sequence of the program file and provide the generated behavior sequence to a classification device of the program file, the behavior sequence The behavior information corresponding to at least two behaviors respectively is included; the classification device of the program file is used to execute the method as described in the first aspect and the corresponding implementation manner.
  • a program file may be referred to as an executable program file.
  • FIG. 1 is a schematic diagram of behavior information corresponding to a plurality of behaviors performed by a program file during operation.
  • 2A and 2B are schematic diagrams showing the hash values of two sets of program files and the result of abstract conversion of the behavior sequence of the program file during operation into feature vectors.
  • FIG. 3 is a schematic flowchart of a method for classifying a program file according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a classification system of a program file according to an embodiment of the present invention.
  • FIG. 5 is a schematic flow chart of generating a feature vector according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of implementing a clustering function according to an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of implementing a clustering function according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of an implementation classification function according to an embodiment of the present invention.
  • FIG. 9 is a schematic flow chart of implementing a classification function according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a method for classifying a program file according to another embodiment of the present invention.
  • Figure 11 is a schematic block diagram of a sorting device of a program file according to an embodiment of the present invention.
  • Figure 12 is a schematic block diagram of a method of classifying a program file according to another embodiment of the present invention.
  • program files in this document are also referred to as executable program files.
  • the program file initiates a system service request to the operating system during runtime.
  • a system service request can also be called an application programming interface (API) call.
  • the API call may include reading and writing of a file, allocation of a memory, input and output (IO) of a network, operation of a hardware device, reading and writing of a system configuration, and the like, which are not limited by the embodiment of the present invention.
  • the API call of a program file during runtime is called the "behavior" of the program file in the industry; the sequence of API calls is called the "behavior sequence" of the program file.
  • the behavior sequence includes a plurality of behavior information, and each behavior information corresponds to one behavior, and each behavior information includes a path involved in performing the corresponding behavior.
  • FIG. 1 is a schematic diagram of behavior information corresponding to a plurality of behaviors performed by a program file during operation.
  • the path involved in executing the behavior is HKLM ⁇ SOFTWARE ⁇ Microsoft ⁇ Windows NT ⁇ CurrentVersion ⁇ Drivers32 ⁇ msacm.msgsm610, where the subkeys of the registry are HKLM ⁇ SOFTWARE ⁇ Microsoft ⁇ Windows NT ⁇ CurrentVersion ⁇ Drivers32 and the registry value is msacm.msgsm610.
  • the behavior is identified as RegQueryValue, and the specific behavior is to execute the registry operation function RegQueryValue.
  • a program file usually has several different versions.
  • the new version may be a bug fix or a feature improvement, but the basic functionality to be implemented by these different versions of the program is consistent. Therefore, the flow and behavior of these different versions of the program files during their operation in the same environment are very close.
  • FIG. 2B is a schematic diagram of the hash values of the two sets of program files, for example, the Message Digest Algorithm (MD5) value, and the result of the abstract conversion of the behavior sequence of the program file into the feature vector during the running.
  • the left column is the MD5 value
  • each MD5 value represents a program file
  • the right side is the feature vector corresponding to the program file. It can be seen that the MD5 values of each set of program files of the two sets of program files are different, but the feature vectors of each set of program files are very close.
  • the classification of the program file is performed by directly forming the feature vector using the behavior information in the behavior sequence of the program file. Because the randomness of the path involved in the behavior corresponding to the behavior information is large, the feature vectors of the program file are greatly different, and the classification effect is not good, so that the similarity of the behavior information can not be effectively used for clustering and classification.
  • FIG. 3 is a schematic flowchart of a method 300 for classifying a program file according to an embodiment of the present invention.
  • the method 300 can include:
  • S320 normalize a path in each of the behavior information, where the normalization process is used to reduce path diversity;
  • S330 Generate a feature vector according to at least two behavior information that is normalized to the path, where each element of the feature vector corresponds to a behavior information that is normalized to the path, and the path is returned.
  • the values of the elements corresponding to the same behavior information after the processing are the same, and the same behavior information refers to the behavior information with the same behavior identifier and the same path after the normalization processing;
  • the S310 obtains the behavior information corresponding to the at least two behaviors performed by the program file during the running, and may include: acquiring the program file by using a monitoring program; obtaining a behavior sequence of the program file by using a sandbox server, where the behavior sequence is The behavior information corresponding to the at least two behaviors is respectively included.
  • the classification system 400 of the program file may separately deploy a monitoring program, such as an Agent program, an agent in a plurality of enterprise clients, such as the client 402, the client 404, and the client 406.
  • Agent programs are responsible for real-time monitoring of executable program files in enterprise clients, especially new ones.
  • Executable file Obtaining executable program files by monitoring the program can be based on commonly used techniques and will not be described in detail here.
  • the Agent program in the client sends the program file to a remote sandboxie server 414.
  • the sandbox server 414 continuously receives program files from different clients and generates a sequence of behaviors of the program files. As shown in FIG. 1 , the behavior sequence includes a plurality of behavior information, and each behavior information corresponds to one behavior, and each behavior information includes a path involved in performing the corresponding behavior.
  • the sandbox server 414 outputs a sequence of behaviors of each program file during runtime to the classification server 416 to facilitate classification server 416 to classify the program files.
  • Sandbox technology can also be based on commonly used techniques and will not be described in detail here.
  • the classification server 416 abstracts the behavior of the program file and generates a feature vector based on the behavior information included in the behavior sequence of each program file. Wherein, in S320, the classification server 416 can normalize the path in each behavior information, and the normalization processing is used to reduce the diversity of behavior information. Generating a feature vector according to a plurality of behavior information normalized to the path, wherein each element of the feature vector corresponds to a behavior information normalized to the path, and the path is normalized After the same behavior information, the corresponding elements have the same value.
  • the specific normalization process is described below.
  • the S330 generates the feature vector according to the at least two behaviors that are normalized to the path, and may include: converting each behavior information that is normalized to the path as an integer, where the path is returned The same behavior information after the normalization process is represented as the same integer; each of the integers is used as an element of the feature vector, and a plurality of the integers constitute the feature vector.
  • each behavior information after normalizing the path corresponds to an integer obtained.
  • the behavior information corresponding to multiple behaviors in a sequence of behaviors corresponding to a program file may correspond to a plurality of integers, and the plurality of integers constitute an integer sequence, and the sequence of integers is a feature vector.
  • the elements in the feature vector are the above integers, and each integer can also be called a feature, indicating a behavior of the program file.
  • the method for classifying a program file in the embodiment of the present invention acquires behavior information corresponding to at least two behaviors of the program file, normalizes the path of the behavior in the behavior information, reduces path diversity, and returns the path according to the path.
  • the processed behavior information is generated, and the feature vector is generated, and the category to which the program file belongs is determined according to the feature vector, so that the randomness of the normalized path is reduced, so that the classification effect of the program file is greatly improved.
  • the feature vector is generated according to the behavior information of the unnormalized process, and the effect of clustering or classifying the program file according to the feature vector is not good.
  • the 100 program files to be classified correspond to 100 feature vectors, and the difference of the elements included in the 100 feature vectors respectively is larger than the difference between the elements included in the 100 feature vectors of the prior art. Decrease, it is possible to make the effect of clustering or classification to get 5 program file categories. This greatly improves the classification effect of the program file, thereby reducing the workload of subsequently identifying malicious program files.
  • FIG. 5 is a schematic flowchart of generating a feature vector in an embodiment of the present invention.
  • S501 is executed to read a behavior information in the behavior sequence of the program file.
  • the path in the behavior information is normalized.
  • the behavior information includes a path involved in performing the corresponding behavior, wherein the path involved in executing the corresponding behavior may include a M-level directory, a file main name, and a file extension of the file; or when performing the corresponding behavior
  • the path may include the S-level subkey of the registry and the registry value name; the path in each behavior information is normalized, and the path may be simplified according to some preset normalization rules.
  • the behavior information after normalizing the path and the corresponding integer are already stored in the feature set.
  • the feature set includes a plurality of behavior information and an integer corresponding to each.
  • S504 is executed, and the integer corresponding to the behavior information is taken as one element of the feature vector, and then S507 is executed.
  • S505 When it does not exist, executing S505, generating a new integer representing the behavior information, and using the integer as an element of the feature vector; and executing S506, storing the behavior information and the corresponding integer in the feature set, and then executing S507 .
  • S507. Determine whether there is still behavior information in the behavior sequence, execute S501 when it exists, and end when it does not exist. Thus, the feature vector corresponding to the program file is obtained, and S508 is executed to store the MD5 value and the feature vector of the program file.
  • the classification server 416 uses the sequence of integers as a feature vector for machine learning clustering and classification.
  • the classification server 416 can be internally divided into two partial functions.
  • the feature vector is clustered with feature vectors of a plurality of other program files by a clustering algorithm to form at least one program file category, where The similarity of the elements of the feature vector of the program file in each program file category is higher than a threshold, and the program file is divided into the first program file category in the at least one program file category.
  • FIG. 6 shows a schematic diagram of implementing a clustering function according to an embodiment of the present invention.
  • FIG. 7 shows a schematic flow chart of implementing a clustering function according to an embodiment of the present invention.
  • the clustering function can be implemented by the clusterer.
  • the clusterer reads the original un-clustered plurality of feature vectors, or reads a plurality of feature vectors that are not clustered successfully or unclassified in a certain period of time.
  • the clusterer performs a clustering operation on the feature vectors of the plurality of program files by using a clustering algorithm to generate a plurality of cluster clusters and cluster center data of the cluster clusters.
  • Each cluster cluster is a program file category, and each program file category has cluster center data, and the cluster center data includes corresponding elements of the feature vector of the program file in the corresponding program file category. S703. Output and save the cluster center data successfully clustered for reference in subsequent clustering or classification.
  • the classification algorithm is used to determine the similarity between the elements in the feature vector and the cluster center data of the plurality of program file categories, and the cluster center data is used to represent the program in the program file category to which the program belongs.
  • a feature of the file including an element of a feature vector of the program file in the associated program file category; dividing the program file into a first program file category, the element of the feature vector and the cluster center data of the first program file category.
  • the similarity is higher than the similarity of the elements of the feature vector to the cluster center data of the program file categories of the plurality of program file categories other than the first program file category.
  • the cluster center data in the embodiment of the present invention may be obtained by clustering feature vectors of multiple program files by using a clustering algorithm when clustering multiple program files, and is used to represent the belonging program file.
  • the cluster center data includes a plurality of elements, which may be all the elements in the feature vector of the program file in the associated program file category, or may be in the feature vector of the program file in the associated program file category. A part of the elements, for example, may be a part of the elements in the feature vector of all program files in the program file category that appear frequently.
  • the clustering of the plurality of program files is performed by using different clustering algorithms, and the elements included in the clustering center data obtained after the clustering are also different, which is not limited by the embodiment of the present invention.
  • the clustering algorithm used in the embodiments of the present invention may be based on some existing algorithms, including but not limited to the following algorithms: K-Means algorithm, K-Medoids algorithm, CLARANS algorithm, etc. Partitioning methods; hierarchy-based balanced iterative reduction and clustering (Balanced Iterative Reducing and Clustering using Hierarchies, BIRCH) algorithm, Clustering Using Representatives (CURE) algorithm, chameleon algorithm, etc.; density-based spatial clustering for noise applications (Density-Based Spatial) Clustering of Applications with Noise, DBSCAN) algorithm, OPTICS algorithm, DENCLUE algorithm and other density algorithms; graph theory clustering method; STING algorithm, CLIQUE algorithm, wavelet clustering (WAVE-CLUSTER) algorithm and other grid algorithms; based on statistics or based on Neural network model algorithm; etc., not described here.
  • the program file is divided into the first program file, and the element of the feature vector of the program file has the highest similarity with the cluster center data of the first program file category.
  • the highest similarity refers to the similarity between the elements of the feature vector of the program file and the cluster center data of all program file categories.
  • the elements of the feature vector of the program file have the highest similarity with the cluster center data of the first program file category. .
  • FIG. 8 is a diagram showing the implementation of the classification function of one embodiment of the present invention.
  • FIG. 9 shows a schematic flow chart of implementing a classification function according to an embodiment of the present invention.
  • the classification function can be implemented by a category calibrator.
  • the category calibrator obtains a feature vector of a program file.
  • the category calibrator uses a classification algorithm to compare the feature vector with the existing cluster center data, and determines which cluster center data is closest to the feature vector, and the program file corresponding to the feature vector Divide into the corresponding program file category.
  • the system 400 for program file classification may also include a web portal server 418. Through the web portal server 418, the system 400 can display the classification information of the current program file to the relevant staff of the enterprise, such as an IT administrator, in real time. The relevant staff can select typical program files in each program file category, and perform in-depth analysis on these program files and the small-scale program files that fail to be classified.
  • a method 1000 for classifying a program file according to an embodiment of the present invention will be specifically described below with reference to FIG.
  • the method 1000 is performed by a system for classifying program files, including:
  • the Agent program in the client sends the program file to the sandbox server.
  • the program file can be a new program file in the client.
  • the sandbox server runs the program, and outputs the behavior sequence of the program file to the classification server.
  • the behavior sequence may also be referred to as a behavior log, which is not limited by the embodiment of the present invention.
  • the classification server generates a feature vector composed of a sequence of integers according to the sequence of behaviors.
  • S1005 Determine, by the category calibrator, whether the program file belongs to a known program file category according to the feature vector. When it is, S1009 may be performed; when it is not, S1006 is executed, and at the same time, S1009 may be executed.
  • S1006 Determine whether the number of program files of the unknown category reaches a predetermined threshold. When the threshold is not reached, the process ends; when the threshold is reached, S1007 is executed.
  • S1007 Perform clustering operations on multiple feature vectors of a program file of an unknown category.
  • the information of the program file category may include a program file in each program file category, and may also include a program file of a small group whose classification fails or cluster fails.
  • each behavior information in the behavior sequence of the program file is uniquely marked as an integer.
  • the behavior identifier is RegWriteValue
  • its corresponding behavior is to execute the registry operation function RegWriteValue.
  • the path involved in this behavior is HKEY_LOCAL_MACHINE ⁇ SYSTEM ⁇ ControlSet001 ⁇ Control ⁇ BootDriverFlags.
  • the embodiment of the present invention may normalize the path in the behavior information by using a preset normalization rule. More specifically, the path involved in the behavior information is simplified according to a preset normalization rule, thereby reducing the number of independent features.
  • the specific process of normalizing the path may be as follows.
  • the path includes the path of the file and the path of the registry.
  • the path of the file is normalized or simplified.
  • the path of the file includes the directory of the file, such as an M-level directory, a file main name, and a file extension.
  • At least one set of mapping relationships formed by regular expressions and identifiers may be pre-stored in the classification server, and the at least one set of mapping relationships is referred to as a first type of mapping relationship.
  • the first regular expression The mapping relationship between the formula and the first identifier is a group in the first type of mapping relationship.
  • the special directory includes the directories as shown in Table 1, but is not limited to these.
  • a particular directory may correspond to an abstracted identifier or name, respectively, in the normalization process, if the directory of the normalized path and the regular directory of the special directory in the table If the expressions match, the corresponding first-level directories are replaced with identifiers. In other words, if the directory of the normalized path matches the regular expression of the special directory in the table, Then replace the last level directory that conforms to the regular expression with the previous level of the directory as a whole with the identifier. At the same time, put the number of levels of the remaining directories behind the logo.
  • the M-level directory in the path does not match any of the mapping relationships in the first-type mapping relationship, that is, when the directory is a non-special directory, the pre-L-level directory of the M-level directory is retained, and the remaining ML of the M-level directory is retained.
  • the level directory is replaced by the number ML, where L is a positive integer less than or equal to M.
  • L is equal to one. That is, the M-level directory extracts only the first-level directory and the subsequent sub-directories.
  • the directory C: ⁇ aaa ⁇ bbb ⁇ ccc ⁇ ddd is normalized to C: ⁇ AAA ⁇ :3.
  • the M-level directory conforming to the first regular expression can be equivalently understood as including a special directory in the M-level directory, such as the first directory.
  • file name includes the file main name and file extension, and the file main name and file extension are separated by a separator ".”.
  • the file primary name matches the predetermined file primary name feature
  • the file primary name is replaced with the second identifier, wherein the mapping relationship formed by the predetermined file primary name feature and the second identifier is a group of the second type mapping relationship.
  • the second type of mapping relationship includes at least one mapping relationship formed by the feature and the identifier that the file name matches.
  • the file master names can be divided into two categories, a system file primary name and a non-system file primary name.
  • the file name of the read and generated file is consistent with the file name of the operating system itself.
  • the file name is replaced with the identifier “sysname”, otherwise the file name is replaced with the identifier “normal”. .
  • some other features can be used as a standard. For example, if the file's primary name contains spaces, replace the file's primary name with the identifier "space.” If the access file was created by itself during the running of the program file, replace the file primary name with the identifier "SelfCreated”. In other cases, such as the file main name does not include spaces, and the file name is not created for the program file itself, the file name is replaced To identify "any".
  • the file extension When the file extension does not belong to the file extension in the preset file extension list, the file extension is replaced with the preset third identifier.
  • the file extension if the file extension is in the default file extension list, then no replacement is required. If the file extension is not in the default file extension list, replace the file extension with the identifier "UnStd".
  • notepad is the system file primary name
  • exe is the file extension in the default file extension list.
  • path c: ⁇ windows ⁇ notepad.exe can be normalized to windir ⁇ sysname.exe.
  • aaa is the non-system file main name
  • rty is not the file extension in the default file extension list
  • the path c: ⁇ windows ⁇ aaa.rty can be normalized to windir ⁇ normal.UnStd.
  • the path of the registry is normalized, wherein the path of the registry includes subkeys of the registry, such as S-level subkeys and registry value names (ValueName).
  • the pre-T-level directory of the S-level sub-key is retained, and the remaining ST-level sub-keys of the S-level sub-key are deleted, wherein the third regular expression
  • the mapping relationship formed by the T is a group in the third type mapping relationship, and the third type mapping relationship includes at least one mapping relationship formed by the regular expression and the identifier, where S is a positive integer and T is less than or equal to S. Positive integer. This applies to the case where the S-level subkey of the registry has a special subkey as part of the root subkey or subkey.
  • the special sub-keys include sub-keys as shown in Table 2, but are not limited thereto.
  • the number of sub-keys that are finally retained to the sub-keys subsequent to the sub-keys are determined according to the number of "reserved sub-key levels".
  • the S-level sub-key conforms to a regular expression conforming to a certain sub-key, the pre-T-level directory of the S-level sub-key is retained, and the remaining S-T-level sub-keys of the S-level sub-key are deleted.
  • S-level sub-key conforming to the third regular expression can be equivalently understood as including a special sub-key in the S-level sub-key.
  • the subkeys of each level of the registry are SOFTWARE ⁇ Adobe ⁇ Acrobat Reader, the path conforms to item 20 of Table 2, and the number of subkeys required to be retained is 1, then the normalized registry
  • the subkey is SOFTWARE ⁇ Adobe.
  • the first sub-key of the S-level sub-key is the globally unique identifier CLSID
  • the first sub-key is replaced with the fourth identifier.
  • a large number of sub-keys of the CLSD class are often present in the path of the registry, and the sub-keys may be in the form of ⁇ 8591DA08-F8AD-333D-83FE-599CDACEB1A0 ⁇ , which uses regular expressions.
  • the formula can be expressed as ⁇ [0-9a-f] ⁇ 8 ⁇ -[0-9a-f] ⁇ 4 ⁇ -[0-9a-f] ⁇ 4 ⁇ -[0-9a-f] ⁇ 4 ⁇ - [0-9a-f] ⁇ 12 ⁇ .
  • the S-level subkey of the path of the registry is CLSID ⁇ 8591DA08-F8AD-333D-83FE-599CDACEB1A0 ⁇ ProgId, it can be CLSID ⁇ $clsid ⁇ ProgId after normalization.
  • the path of the behavior may also include a form name and a process name, etc., and the window name and the process name may also be normalized to reduce the diversity of the path, which is not limited by the embodiment of the present invention.
  • the above normalization method is a long-term summary of the behavior of a malicious program, or based on experience.
  • the above normalization processing on the path implemented by the present invention can reduce the number of features of the program file.
  • the method of the embodiment of the present invention may reduce the number of elements of a feature vector of a program file.
  • the feature vector of the program file X includes 100 elements, that is, the integer sequence corresponding to the program file X includes 100 integers; and after the normalization process, the feature vector of the program file X is The number of elements may be reduced to 70, that is, the sequence of integers corresponding to program file X becomes 70 integers.
  • some similar behaviors are combined into one feature, the number of features is reduced, the file features are more prominent, and it is easier to cluster or classify with other files.
  • the method of the embodiment of the present invention may increase the same feature of two program files. number.
  • the feature vector of the program file X includes 100 elements, that is, the integer sequence corresponding to the program file X includes 100 integers
  • the feature vector of the program file Y also includes 100 elements, that is, the program.
  • the sequence of integers corresponding to file Y also includes 100 integers, and it is assumed that 100 elements of the feature vector of program file X and 100 elements of the feature vector of program file Y are completely different; and after normalization, program file X
  • the number of elements in the feature vector is still 100, and the number of elements in the feature vector of the program file Y is still 100, but the 100 elements of the feature vector of the program file X and the feature vector of the program file Y are 100.
  • some similar behaviors are combined into one feature, and the total number of features is also reduced, making it easier to cluster or classify two program files into one program file category.
  • the method for classifying a program file in the embodiment of the present invention acquires behavior information corresponding to at least two behaviors of the program file, normalizes the path of the behavior in the behavior information, reduces path diversity, and returns the path according to the path.
  • the processed behavior information is generated, and the feature vector is generated, and the category to which the program file belongs is determined according to the feature vector, so that the randomness of the normalized path is reduced, so that the classification effect of the program file is greatly improved.
  • the analyst only needs to randomly select several program files for each category of the program file for in-depth analysis to know whether the program files in this category are suspicious. At the same time, it is also possible to timely discover small program files in categories that are not classified, difficult to be classified, or have a small number of program files, and focus on analyzing the small program files and discovering malicious program files in time. This can effectively reduce the workload of analyzing massive program files and improve the efficiency of analysis.
  • FIG 11 is a schematic block diagram of a sorting apparatus 1100 of a program file of one embodiment of the present invention.
  • the device 1100 can include:
  • the obtaining module 1110 is configured to acquire behavior information corresponding to at least two behaviors performed by the program file during the running, where the behavior information includes a behavior identifier and a path involved in performing the corresponding behavior;
  • the normalization module 1120 is configured to perform normalization processing on the path in each of the behavior information acquired by the obtaining module 1110, where the normalization process is used to reduce path diversity;
  • a generating module 1130 configured to normalize the path according to the normalization module 1120
  • the at least two behavior information generates a feature vector, wherein each element of the feature vector corresponds to a behavior information normalized to the path, and the same behavior information corresponding to the path normalization is corresponding
  • the values of the elements are the same, and the same behavior information refers to the behavior information that has the same behavior identifier and the same path after normalization;
  • the classification module 1140 is configured to determine, according to the feature vector generated by the generating module 1130, a category to which the program file belongs.
  • the device for classifying the program file of the embodiment of the present invention acquires behavior information corresponding to at least two behaviors of the program file, normalizes the path of the behavior in the behavior information, reduces the diversity of the path, and returns the path according to the path.
  • the processed behavior information is generated, and the feature vector is generated, and the category to which the program file belongs is determined according to the feature vector, so that the randomness of the normalized path is reduced, thereby improving the classification effect of the program file, thereby reducing the recognition of the malicious program file.
  • the workload of the staff is
  • the path in the behavior information includes an M-level directory of the file
  • the normalization module 1120 may be specifically configured to:
  • the first N-level directory of the M-level directory is replaced with the first identifier, and the remaining MN level of the M-level directory is replaced with a digital MN, wherein the first regular expression
  • the mapping relationship between the formula and the first identifier is a group in the first type mapping relationship, and the first type mapping relationship includes at least one mapping relationship formed by the regular expression and the identifier, where M is a positive integer, and N is a positive integer less than or equal to M;
  • the pre-L-level directory of the M-level directory is reserved, and the remaining ML-level directories of the M-level directory are replaced with the digital ML, where L is a positive integer less than or equal to M.
  • the path in the behavior information includes a file primary name and a file extension
  • the normalization module is specifically configured to:
  • the file primary name matches the predetermined file primary name feature, the file primary name is replaced with the second identifier, wherein the mapping relationship between the predetermined file primary name feature and the second identifier is a group of the second type mapping relationship
  • the second type of mapping relationship includes at least one mapping relationship formed by the feature and the identifier that the file name matches;
  • the file extension is replaced with the preset third identifier.
  • the path in the behavior information includes an S-level sub-key of the registry
  • the normalization module 1120 is specifically configured to:
  • the pre-T-level directory of the S-level sub-key is retained, and the remaining ST-level sub-keys of the S-level sub-key are deleted, wherein the third regular expression and
  • the mapping relationship formed by T is a group in the third type mapping relationship, and the third type mapping relationship includes at least one mapping relationship formed by the regular expression and the identifier, S is a positive integer, and T is less than or equal to S. Positive integer
  • the first subkey of the S level subkey is the globally unique identifier CLSID, the first subkey is replaced with the fourth identifier.
  • the normalization module 1120 of the above three embodiments may deploy a normalization process based on a summary of the behavior of the malicious program.
  • the classification module 1140 is specifically configured to:
  • the classification algorithm is used to determine the similarity between the elements in the feature vector and the cluster center data of the plurality of program file categories, wherein the cluster center data is used to represent the characteristics of the program file in the program file category, including the program file to which the program file belongs.
  • the element of the feature vector is more similar to the cluster center data of the first program file category than the element of the feature vector and the plurality of program file categories The similarity of cluster center data for other program file categories other than a program file category.
  • the classification module 1140 is specifically configured to:
  • the feature vector is clustered with the feature vectors of a plurality of other program files by a clustering algorithm to form at least one program file category, wherein the similarity of elements of the feature vector of the program file in each program file category is higher than Threshold, the program file is divided into a first program file category in the at least one program file category.
  • the generating module 1130 may be specifically configured to:
  • Each behavior information after the path is normalized is represented as an integer, wherein the same behavior information is normalized to the same integer after the path is normalized;
  • Each of the integers is taken as an element of the feature vector, and a plurality of the integers constitute the feature vector.
  • the obtaining module 1110 is specifically configured to:
  • the behavior sequence of the program file is obtained by the sandbox server, and the behavior sequence includes behavior information corresponding to the at least two behaviors respectively.
  • the device 1100 can be the classification server described above.
  • the obtaining module 1100 may be implemented by a network interface
  • the normalization module 1120, the generating module 1130, and the classifying module 1140 may be implemented by a processor.
  • apparatus 1200 can include a processor 1210, a network interface 1220, and a memory 1230.
  • the memory 1230 can be used to store code and the like executed by the processor 1210.
  • Apparatus 1200 can also include an output device or an output interface 1240 coupled to the output device for outputting a classification result of the program file.
  • Output devices include displays, printers, and the like.
  • bus system 1250 that includes, in addition to the data bus, a power bus, a control bus, and a status signal bus.
  • the apparatus 1100 shown in FIG. 11 or the apparatus 1200 shown in FIG. 12 can implement the various processes implemented in the foregoing embodiments of FIG. 1 to FIG. 10. To avoid repetition, details are not described herein again.
  • the processor may be an integrated circuit chip with signal processing capabilities.
  • each step of the foregoing method embodiment may be completed by an integrated logic circuit of hardware in a processor or an instruction in a form of software.
  • the processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present invention may be implemented or carried out.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented by the hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory, and the processor reads the information in the memory and combines the hardware to complete the steps of the above method.
  • the memory in the embodiments of the present invention may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can Read-Only Memory (ROM), Programmable Read ROM (PROM), Erasable PROM (EPROM), EEPROM (Electrically Erasable Programmable Read Only Memory) Electrically EPROM, EEPROM) or flash memory.
  • ROM Read-Only Memory
  • PROM Programmable Read ROM
  • EPROM Erasable PROM
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • flash memory Electrically EPROM
  • the volatile memory can be a Random Access Memory (RAM) that acts as an external cache.
  • RAM Random Access Memory
  • many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (Synchronous DRAM).
  • SDRAM Double Data Rate SDRAM
  • DDR SDRAM Double Data Rate SDRAM
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Connection Dynamic Random Access Memory
  • DR RAM direct memory bus random access memory
  • the network interface is configured to receive a sequence of behaviors of at least one program file sent from a sandbox server in an enterprise network. Specifically, the network interface may receive the MD5 value corresponding to the program file sent by the sandbox server and the behavior sequence of the program file.
  • the network interface 1220 can be a network interface or multiple network interfaces. The network interface 1220 can receive a sequence of behaviors sent by a sandbox server, and can also receive a sequence of behaviors sent by multiple sandbox servers.
  • the network interface may be a wired interface, such as a Fiber Distributed Data Interface (FDDI) or a Gigabit Ethernet (GE) interface; the network interface may also be a wireless interface.
  • FDDI Fiber Distributed Data Interface
  • GE Gigabit Ethernet
  • the embodiment of the invention further provides a classification system for a program file, which can be specifically shown in FIG. 4 .
  • the system for classifying the program files may include a client, a sandbox server, and a device file classification device of the embodiment of the present invention, corresponding to the classification server 416 of FIG. a monitoring program is included in the client to acquire a program file in the client by using the monitoring program; the sandbox server is configured to receive the program file sent by the monitoring program to generate the behavior of the program file.
  • the sequence provides the generated behavior sequence to the classification device of the program file, the behavior sequence includes at least two behavior information, and the device file classification means is used to perform the classification method of the program file of the embodiment of the present invention.
  • the system of the program file classification in the embodiment of the present invention can be used in the process of classifying the program file described in the embodiment of the present invention. To avoid repetition, details are not described herein again.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device, such as a personal computer, server, or network device, to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

一种程序文件的分类方法、分类装置和分类系统,该系统通过客户端中的Agent程序和沙箱服务器获取程序文件在运行期间执行的至少两个行为对应的行为信息,每个该行为信息中包括行为标识以及在执行所对应的行为时涉及的路径(S310),分类服务器对每个该行为信息中的路径进行归一化处理,该归一化处理用于降低路径的多样性(S320),并根据对路径进行归一化处理后的至少两个行为信息,生成特征向量(S330),根据该特征向量,确定所述程序文件所属的类别(S340)。所述分类方法、分类装置和分类系统对路径进行归一化处理,使得归一化处理后的路径的随机性降低,从而改善程序文件的分类效果,进而能够降低识别恶意程序文件时的工作量。

Description

程序文件的分类方法、分类装置和分类系统
本申请要求于2016年1月26日提交中国专利局、申请号为201610052993.6、发明名称为“程序文件的分类方法、分类装置和分类系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及计算机领域,并且更具体地,涉及一种程序文件的分类方法、分类装置和分类系统。
背景技术
在企业内部的高级持续性威胁(Advanced Persistent Threat,APT)攻击中,恶意程序文件占有很大的比例。因此,APT防御中,如何及时识别和发现恶意程序文件成为一个非常重要的技术。但是,对于大中型企业来说,企业内部网络中存在的程序文件的数量众多,逐个进行深层次分析的工作量巨大,基本是做不到的。为了解决这一问题,现有技术提出利用程序文件在运行期间的行为对企业内部的程序文件进行分类的方法。对每类程序文件进行分析与对每个程序文件进行分析相比,能够在一定程度上降低工作量。
对企业内部的程序文件进行分类,具体而言,将程序文件在运行期间的行为作为机器聚类、分类的特征向量,从而将企业内部的大量程序文件划分为若干个程序文件类别。由此,分析人员对每一类别只需随机抽取几个程序文件进行深入分析,即可了解该类别的程序文件是否可疑。同时,还能及时发现未被分类的、不易被分类的或程序文件数量少的类别中的小众的程序文件,对小众的程序文件重点分析并及时发现恶意的程序文件。这样能够有效地降低分析海量程序文件的工作量、提高分析的效率。
现有的技术中,对程序文件分类是直接使用程序文件的行为序列中的行为信息,如字符串信息,形成特征向量进行计算。由于各种行为信息涉及的路径的参数信息的随机性大,导致特征向量间差异很大,分类效果不佳,从而无法有效地利用行为信息的相似性进行聚类和分类。
发明内容
本发明实施例提供一种程序文件的分类方法、分类装置和分类系统,可以改善程序文件分类的效果,进而降低识别恶意程序文件的工作量。
第一方面,本发明提供了一种程序文件的分类方法,包括:获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,每个所述行为信息中包括行为标识以及在执行所对应的行为时涉及的路径;对每个所述行为信息中的路径进行归一化处理,所述归一化处理用于降低路径的多样性;根据对路径进行归一化处理后的至少两个行为信息,生成特征向量,其中,所述特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,将对路径进行归一化处理后的相同的行为信息对应的元素的值相同,相同的行为信息是指行为标识相同并且进行归一化处理后的路径也相同的行为信息;根据所述特征向量,确定所述程序文件所属的类别。
本发明的程序文件的分类方法,获取程序文件的至少两个行为分别对应的行为信息,对行为信息中的行为的路径进行归一化处理,降低路径的多样性,根据对路径进行归一化处理后的行为信息,生成特征向量,并根据特征向量确定程序文件所属的类别,使得归一化处理后的路径的随机性降低,从而改善程序文件的分类效果,进而降低识别恶意程序文件时工作人员的工作量。
具体的归一化处理可以包括以下几种。
所述行为信息中的路径包括文件的M级目录,所述对每个所述行为信息中的路径进行归一化处理,包括:如果所述M级目录符合第一正则表达式,将所述M级目录的前N级目录替换为第一标识,将所述M级目录的剩余的M-N级替换为数字M-N,其中,所述第一正则表达式和所述第一标识形成的映射关系是第一类映射关系中的一组,所述第一类映射关系中包括至少一组由正则表达式与标识形成的映射关系,M为正整数,N为小于或等于M的正整数;如果所述M级目录不符合所述第一类映射关系中任一组映射关系,将所述M级目录的前L级目录保留,将所述M级目录的剩余的M-L级目录替换为数字M-L,其中,L为小于或等于M的正整数。
应理解,M级目录符合第一正则表达式可以等同理解为M级目录中包括 某一特殊目录,例如第一目录。
所述行为信息中的路径包括文件主名和文件扩展名,所述对每个所述行为信息中的路径进行归一化处理,包括:如果所述文件主名符合预定文件主名特征,将所述文件主名替换为第二标识,其中,所述预定文件主名特征和所述第二标识形成的映射关系是第二类映射关系中的一组,所述第二类映射关系中包括至少一组由文件主名所符合的特征与标识形成的映射关系;如果所述文件扩展名不属于预设的文件扩展名列表中的文件扩展名,将所述文件扩展名替换为预设的第三标识。
所述行为信息中的路径包括注册表的S级子键,所述对每个所述行为信息中的路径进行归一化处理,包括:如果所述S级子键符合第三正则表达式,将所述S级子键的前T级目录保留,将所述S级子键的剩余的S-T级子键删除,其中,所述第三正则表达式和T形成的映射关系是第三类映射关系中的一组,所述第三类映射关系中包括至少一组由正则表达式与标识形成的映射关系,S为正整数,T为小于或等于S的正整数;如果所述S级子键中的第一子键为全局唯一标示符CLSID,将所述第一子键替换为第四标识。
应理解,S级子键符合第三正则表达式可以等同理解为S级子键中包括某一特殊子键。
以上归一化手段是长期对恶意程序的行为进行总结得到的,或者说是基于经验得到的。通过本发明实施的对路径进行的上述归一化处理,可以减少程序文件的特征的数量。一方面,可能会减少一个程序文件的特征向量的元素的个数。另一方面,可能会增加任意两个程序文件的相同的特征的个数,从而达到更有效地进行程序文件的分类的效果。
结合第一方面,在一种可能的实现方式中,所述根据所述特征向量,确定所述程序文件所属的类别,包括:通过分类算法,确定所述特征向量中的元素分别与多个程序文件类别的聚类中心数据的相似性,所述聚类中心数据用于表示所属程序文件类别中的程序文件的特征,包括所属程序文件类别中的程序文件的特征向量的元素;将所述程序文件划分到第一程序文件类别,所述特征向量的元素与所述第一程序文件类别的聚类中心数据的相似性高于所述特征向量的元素与所述多个程序文件类别中除所述第一程序文件类别之外的其他程序文件类别的聚类中心数据的相似性。
应理解,本发明中的聚类中心数据可以是在对多个程序文件进行聚类时,通过聚类算法对多个程序文件的特征向量进行聚类得到的,用于表示所属程序文件类别中的程序文件的特征。
结合第一方面,在另一种可能的实现方式中,所述根据所述特征向量,确定所述程序文件所属的类别,包括:通过聚类算法,将所述特征向量与多个其他程序文件的特征向量进行聚类,形成至少一个程序文件类别,其中,每个程序文件类别中的程序文件的特征向量的元素的相似性高于阈值,所述程序文件被划分到所述至少一个程序文件类别中的第一程序文件类别。
结合第一方面,在一种可能的实现方式中,所述根据对路径进行归一化处理后的至少两个行为信息,生成特征向量,包括:将对路径进行归一化处理后的每个行为信息表示为一个整数,其中,将对路径进行归一化处理后相同的行为信息表示为相同的整数;将每个所述整数作为所述特征向量的一个元素,多个所述整数构成所述特征向量。
结合第一方面,在一种可能的实现方式中,获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,包括:通过监控程序获取所述程序文件;通过沙箱服务器获得所述程序文件的行为序列,所述行为序列中包括所述至少两个行为分别对应的行为信息。
第二方面提供了一种程序文件的分类装置,包括获取模块,用于获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,每个所述行为信息中包括行为标识以及在执行所对应的行为时涉及的路径;归一化模块,用于对所述获取模块获取的每个所述行为信息中的路径进行归一化处理,所述归一化处理用于降低路径的多样性;生成模块,用于根据经过所述归一化模块对路径进行归一化处理后的至少两个行为信息,生成特征向量,其中,所述特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,将对路径进行归一化处理后的相同的行为信息对应的元素的值相同,相同的行为信息是指行为标识相同并且进行归一化处理后的路径也相同的行为信息;分类模块,用于根据所述生成模块生成的所述特征向量,确定所述程序文件所属的类别。
应理解,第二方面的程序文件的分类装置的各个模块可以用于实现第一方面及第一方面的各种可能的实现方式的程序文件的分类方法,此处不再赘 述。
在第二方面的一种具体的实现方式中,第二方面的程序文件的分类装置为分类服务器。
第三方面提供了一种程序文件的分类系统,包括客户端、沙箱服务器和程序文件的分类装置,所述客户端中包括监控程序,以通过所述监控程序获取所述客户端中的程序文件;所述沙箱服务器用于接收所述监控程序发送的所述程序文件,以生成所述程序文件的行为序列并将生成的行为序列提供给所述程序文件的分类装置,所述行为序列中包括至少两个行为分别对应的行为信息;所述程序文件的分类装置用于执行如第一方面及相应的实现方式所述的方法。
本发明中,程序文件可以称为可执行程序文件。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是一个程序文件在运行期间执行的多个行为分别对应的行为信息的示意图。
图2A和图2B为两组程序文件的哈希值以及程序文件在运行期间的行为序列抽象转换为特征向量的结果的示意图。
图3是本发明一个实施例的程序文件的分类方法的示意性流程图。
图4是本发明一个实施例的程序文件的分类系统的示意图。
图5是本发明一个实施例的生成特征向量的示意性流程图。
图6是本发明一个实施例的实现聚类功能的示意图。
图7是本发明一个实施例的实现聚类功能的示意性流程图。
图8是本发明一个实施例的实现分类功能的示意图。
图9是本发明一个实施例的实现分类功能的示意性流程图。
图10是本发明另一个实施例的程序文件的分类方法的示意性流程图。
图11是本发明一个实施例的程序文件的分类装置的示意性框图。
图12是本发明另一个实施例的程序文件的分类方法的示意性框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
首先对程序文件分类涉及的一些技术进行简单的介绍。应理解,本文中的程序文件又称为可执行程序文件。
程序文件在运行期间会对操作系统发起系统服务请求。系统服务请求也可以称为应用程序编程接口(Application Programming Interface,API)调用。API调用可以包括文件的读写、内存的分配、网络的输入输出(Input Output,IO)、硬件设备的操作、系统配置的读写等等,本发明实施例对此不作限定。程序文件在运行期间中API调用被业界称为程序文件的“行为”;API调用的序列被称为程序文件的“行为序列”。行为序列中包括多个行为信息,每个行为信息对应一个行为,每个行为信息包括在执行所对应的行为时涉及的路径。
图1为一个程序文件在运行期间执行的多个行为分别对应的行为信息的示意图。例如,对于<action arg1=”HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Drivers32\msacm.msgsm610”>RegQueryValue</action>这一行为信息,执行行为时涉及的路径为HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Drivers32\msacm.msgsm610,其中,注册表的各级子键为HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Drivers32,注册表值名为msacm.msgsm610。行为标识为RegQueryValue,具体的行为为执行注册表操作函数RegQueryValue。
一个程序文件通常会有多个不同的版本。新的版本可能是漏洞(bug)的修复或功能的改进等,但是,这些不同版本的程序所要实现的基本功能是一致的。因此,这些不同版本的程序文件在相同环境下运行期间的流程和行为也非常接近。
本发明实施例利用这一特点,利用程序文件在运行期间的的行为,对程序文件的相似度进行统计分析,并将行为相似的程序文件划分为一类。图2A 和图2B分别为两组程序文件的哈希值,例如消息摘要算法第五版(Message Digest Algorithm,MD5)值,以及程序文件在运行期间的行为序列抽象转换为特征向量的结果的示意图。其中左边一列为MD5值,每个MD5值表示一个程序文件,右侧为该程序文件对应的特征向量。可以看出,两组程序文件的每组程序文件的MD5值都各不相同,但是每组程序文件的特征向量都非常接近。
然而,现有的技术中,对程序文件分类是直接使用程序文件的行为序列中的行为信息形成特征向量进行计算。由于行为信息对应的行为所涉及的路径的随机性大,导致程序文件的特征向量间差异很大,分类效果不佳,从而无法有效地利用行为信息的相似性进行聚类和分类。
图3示出了本发明实施例的程序文件的分类方法300的示意性流程图。该方法300可以包括:
S310,获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,每个该行为信息中包括行为标识以及在执行所对应的行为时涉及的路径;
S320,对每个该行为信息中的路径进行归一化处理,该归一化处理用于降低路径的多样性;
S330,根据对路径进行归一化处理后的至少两个行为信息,生成特征向量,其中,该特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,将对路径进行归一化处理后的相同的行为信息对应的元素的值相同,相同的行为信息是指行为标识相同并且进行归一化处理后的路径也相同的行为信息;
S340,根据该特征向量,确定该程序文件所属的类别。
具体而言,S310获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,可以包括:通过监控程序获取该程序文件;通过沙箱服务器获得该程序文件的行为序列,该行为序列中包括该至少两个行为分别对应的行为信息。
一个具体的例子中,如图4所示,程序文件的分类系统400可以在多个企业客户端,例如客户端402、客户端404和客户端406,中分别部署监控程序,例如Agent程序,Agent程序408、Agent程序410和Agent程序412。这些Agent程序负责实时监控企业客户端中的可执行程序文件,尤其是新增 的可执行程序文件。通过监控程序的方法获取可执行程序文件,可以依照常用的技术,此处不做详细叙述。
客户端中的Agent程序将程序文件发送至远程的沙箱(sandboxie)服务器414。沙箱服务器414持续地接收来自不同客户端发来的程序文件,并生成程序文件的行为序列。如图1所示的,该行为序列中包括多个行为信息,每个行为信息对应一个行为,每个行为信息包括在执行所对应的行为时涉及的路径。沙箱服务器414输出每个程序文件在运行期间的行为序列至分类服务器416,以方便分类服务器416对程序文件进行分类。沙箱技术也可以依照常用的技术,此处也不做详细叙述。
分类服务器416对程序文件的行为进行抽象,根据每个程序文件的行为序列中包括的行为信息,生成一个特征向量。其中,在S320中,分类服务器416可以对每个行为信息中的路径进行归一化处理,该归一化处理用于降低行为信息的多样性。根据对路径进行归一化处理后的多个行为信息,生成特征向量,其中,该特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,将对路径进行归一化处理后的相同的行为信息对应的元素的值相同。具体的归一化处理在下文中展开描述。
S330根据对路径进行归一化处理后的至少两个行为信息,生成特征向量,可以包括:将对路径进行归一化处理后的每个行为信息表示为一个整数,其中,将对路径进行归一化处理后相同的行为信息表示为相同的整数;将每个该整数作为该特征向量的一个元素,多个该整数构成该特征向量。
具体而言,将对路径进行归一化处理后的每个行为信息对应得到的一个整数。一个程序文件对应的行为序列中的多个行为对应的行为信息,则可以对应多个整数,多个整数构成一个整数序列,该整数序列即为特征向量。特征向量中的元素即为上述整数,每个整数也可以称为特征,表示程序文件的一个行为。
本发明实施例的程序文件的分类方法,获取程序文件的至少两个行为分别对应的行为信息,对行为信息中的行为的路径进行归一化处理,降低路径的多样性,根据对路径进行归一化处理后的行为信息,生成特征向量,并根据特征向量确定程序文件所属的类别,使得归一化处理后的路径的随机性降低,从而使得程序文件的分类效果大大提高。
具体而言,现有技术中根据未经归一化处理的行为信息生成特征向量,根据特征向量对程序文件进行聚类或分类后的效果不好。例如,现有技术中有100个程序文件待分类,对应100个特征向量,该100个特征向量分别包括的元素的差异很大,聚类或分类后的效果是得到99个程序文件类别。而执行本发明实施例的方法,待分类的100个程序文件,对应100个特征向量,该100个特征向量分别包括的元素的差异较现有技术的100个特征向量分别包括的元素的差异大大减小,有可能使得聚类或分类后的效果是得到5个程序文件类别。这样使得程序文件的分类效果大大提高,进而能够降低后续识别恶意程序文件的工作量。
图5为本发明实施例中生成特征向量的示意性流程图。
对于一个程序文件,执行S501,读取该程序文件的行为序列中的一个行为信息。
在S502中,对行为信息中的路径进行归一化处理。该行为信息包括在执行所对应的行为时涉及的路径,其中,执行所对应的行为时涉及的路径中可以包括文件的M级目录、文件主名和文件扩展名;或执行所对应的行为时涉及的路径中可以包括注册表的S级子键和注册表值名;对每个行为信息中的路径进行归一化处理,可以是根据一些预设的归一化规则,对路径进行简化。
在S503中,在特征集合中查询对路径进行归一化处理后的行为信息及其对应的整数是否已存在;其中,特征集合中包括多个行为信息及各自分别对应的整数。当存在时,执行S504,将该行为信息对应的整数作为特征向量的一个元素,继而执行S507。当不存在时,执行S505,生成新的代表该行为信息的整数,并将该整数作为特征向量的一个元素;并且执行S506,将该行为信息及对应的整数存入特征集合中,继而执行S507。
S507,判断行为序列中是否还存在行为信息,当存在时执行S501,当不存在时结束。由此获得程序文件对应的特征向量,执行S508,存储程序文件的MD5值及特征向量。
分类服务器416将该整数序列作为机器学习的聚类和分类的特征向量。分类服务器416内部可以分为两个部分功能。
一部分功能用于实现聚类。具体地,通过聚类算法,将该特征向量与多个其他程序文件的特征向量进行聚类,形成至少一个程序文件类别,其中, 每个程序文件类别中的程序文件的特征向量的元素的相似性高于阈值,该程序文件被划分到该至少一个程序文件类别中的第一程序文件类别。
图6示出了本发明一个实施例的实现聚类功能的示意图。图7示出了本发明一个实施例的实现聚类功能的示意性流程图。聚类功能可以由聚类器实现,S701中,聚类器读取原始的未经聚类的多个特征向量,或者读取一定时间内未聚类成功或未分类成功的多个特征向量。S702中,聚类器使用聚类算法对多个程序文件的特征向量进行聚类运算,生成多个聚类簇以及该聚类簇的聚类中心数据。每个聚类簇为一个程序文件类别,每个程序文件类别具有聚类中心数据,该聚类中心数据包括所对应的程序文件类别中的程序文件的特征向量的相应元素。S703,输出并保存聚类成功的聚类中心数据,以供后续聚类或分类时参考。
一部分功能用于实现分类。具体地,通过分类算法,确定该特征向量中的元素分别与多个程序文件类别的聚类中心数据的相似性,该聚类中心数据该聚类中心数据用于表示所属程序文件类别中的程序文件的特征,包括所属的程序文件类别中的程序文件的特征向量的元素;将该程序文件划分到第一程序文件类别,该特征向量的元素与该第一程序文件类别的聚类中心数据的相似性高于该特征向量的元素与该多个程序文件类别中除该第一程序文件类别之外的其他程序文件类别的聚类中心数据的相似性。
应理解,本发明实施例中的聚类中心数据可以是在对多个程序文件进行聚类时,通过聚类算法对多个程序文件的特征向量进行聚类得到的,用于表示所属程序文件类别中的程序文件的特征。聚类中心数据中包括多个元素,该多个元素可以是所属的程序文件类别中的程序文件的特征向量中的所有的元素,也可以是所属的程序文件类别中的程序文件的特征向量中的一部分元素,例如可以是该程序文件类别中所有程序文件的特征向量中的元素出现频率较高的一部分元素。采用不同聚类算法对多个程序文件进行聚类,聚类之后得到的聚类中心数据中所包括的元素也会有所不同,本发明实施例对此不作限定。
本发明实施例中所使用的聚类算法可以基于现有的一些算法,包括但不限于以下算法:K均值(K-Means)算法、K中心点(K-Medoids)算法、CLARANS算法等划分法(partitioning methods);基于层次的平衡迭代减少和聚类 (Balanced Iterative Reducing and Clustering using Hierarchies,BIRCH)算法、基于代表的聚类(Clustering Using Representatives,CURE)算法、变色龙(Chameleon)算法等层次法;噪声应用的基于密度的空间聚类(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)算法、OPTICS算法、DENCLUE算法等密度算法;图论聚类法;STING算法、CLIQUE算法、小波聚类(WAVE-CLUSTER)算法等网格算法;基于统计的或基于神经网络的模型算法;等等,此处不作赘述。
还应理解,本发明实施例中,将程序文件划分到第一程序文件,程序文件的特征向量的元素与第一程序文件类别的聚类中心数据的相似性最高。所谓相似性最高是指计算程序文件的特征向量的元素与所有程序文件类别的聚类中心数据的相似性,程序文件的特征向量的元素与第一程序文件类别的聚类中心数据的相似性最高。
图8示出了本发明一个实施例的实现分类功能的示意图。图9示出了本发明一个实施例的实现分类功能的示意性流程图。分类功能可以由类别标定器实现,S901中,类别标定器获得一个程序文件的特征向量。继而,在S902,类别标定器使用分类算法,将该特征向量与已有的聚类中心数据进行对较,判断该特征向量与哪个聚类中心数据最接近,则将该特征向量对应的程序文件划分到相应的程序文件类别。
程序文件分类的系统400中还可以包括门户网站(web portal)服务器418。通过web portal服务器418,系统400可以实时地向企业的相关工作人员,例如IT管理员,展示当前程序文件的分类信息。相关工作人员可以在每个程序文件类别中抽选典型的程序文件,对这些程序文件以及分类失败的小众的程序文件进行深入的分析。
下面结合图10具体介绍本发明一个实施例的程序文件的分类方法1000。该方法1000由程序文件分类的系统执行,包括:
S1001,客户端中的Agent程序将程序文件发送至沙箱服务器。该程序文件可以为客户端中新增的程序文件。
S1002,沙箱服务器运行程序,输出程序文件的行为序列至分类服务器。应理解,行为序列也可以称为行为日志,本发明实施例对此不作限定。
S1003,分类服务器根据行为序列,生成由整数序列构成的特征向量。
S1004,将该程序文件的MD5值及其特征向量进行保存,以便于后续进行聚类或分类。
S1005,使用类别标定器根据特征向量,判断该程序文件是否属于一个已知的程序文件类别。当属于时,可以执行S1009;当不属于时,执行S1006,并同时可以执行S1009。
S1006,判断未知类别的程序文件数量是否达到了预定的阈值。未达到阈值时,结束;达到阈值时,执行S1007。
S1007,对未知类别的程序文件的多个特征向量进行聚类运算。
S1008,输出并保存聚类中心数据,以便于后续进行分类时使用。
S1009,通过web poral服务器展示程序文件类别的信息。程序文件类别的信息可以包括每个程序文件类别中的程序文件,还可以包括分类失败或聚类失败的小众的程序文件等。
本发明实施例中,将程序文件的行为序列中的每个行为信息唯一标记为一个整数。例如,对一条行为信息,行为标识为RegWriteValue,其对应的行为为执行注册表操作函数RegWriteValue,该行为所涉及的路径为HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Control\BootDriverFlags。将上述行为信息作为一个独立特征,给予其一个整数编号。这样,能够将一组行为序列转换为一个整数序列或者称其为特征向量,例如[1,2,3,4,5]。
由于行为信息中行为所涉及的路径的随机性非常大。这导致独立的特征非常多,过于分散。为了解决这一问题,本发明实施例可以采用一套预设的归一化规则对行为信息中的路径进行归一化处理。更具体地说是根据预设的归一化规则,对行为信息中行为涉及的路径进行简化,从而降低独立的特征的数量。
具体地,对路径进行归一化处理的具体流程可以如下。其中,路径包括文件的路径和注册表的路径。
一、对文件的路径进行归一化或简化处理,其中,文件的路径包括文件的目录,例如为M级目录、文件主名和文件扩展名。
a)M级目录的归一化
在分类服务器中可以预先存储至少一组由正则表达式与标识形成的映射关系,将上述至少一组映射关系称为第一类映射关系。其中,第一正则表达 式和该第一标识形成的映射关系是第一类映射关系中的一组。当M级目录符合第一正则表达式时,将该M级目录的前N级目录替换为第一标识,将该M级目录的剩余的M-N级替换为数字M-N,M为正整数,N为小于或等于M的正整数。这种情况适用于文件的目录以特殊目录作为根目录或为目录的一部分的情况。
特殊目录包括如表1中所示的目录,但不限于这些。
表1
Figure PCTCN2016108901-appb-000001
具体而言,在一个实施例中,对于特殊目录,其可分别对应一个抽象后的标识或名称,在归一化处理时,如果被归一化的路径的目录与表格中的特殊目录的正则表达式相匹配,则将相对应的前几级目录替换为标识。换而言之,如果被归一化的路径的目录与表格中的特殊目录的正则表达式相匹配, 则将符合正则表达式的最后一级目录和之前的各级目录作为一个整体替换为标识。同时,将剩余的目录的级数放在标识的后边。
例如,对于目录C:\Documents and Setting\administrator\Temporary Internet Files\aaa\bbb,根据表1给出的映射关系,可被转换为ietemp\:2。
这是因为鉴别程序文件时,通常我们只关心程序文件访问了哪些关键系统目录,而不关心其访问的具体的子目录。由此,对文件路径中的各级目录进行简化。
对于路径中M级目录不符合第一类映射关系中任一组映射关系时,即目录为非特殊目录时,将该M级目录的前L级目录保留,将该M级目录的剩余的M-L级目录替换为数字M-L,其中,L为小于或等于M的正整数。优选地,L等于1。即,M级目录只提取第一级目录以及后续子目录的级数。例如,目录C:\aaa\bbb\ccc\ddd归一化为C:\AAA\:3。
应理解,M级目录符合第一正则表达式可以等同理解为M级目录中包括某一特殊目录,例如第一目录。
应理解,文件名包括文件主名和文件扩展名,文件主名和文件扩展名之间以分隔符“.”进行分隔。
b)文件主名的归一化
当文件主名符合预定文件主名特征时,将文件主名替换为第二标识,其中,预定文件主名特征和第二标识形成的映射关系是第二类映射关系中的一组。第二类映射关系中包括至少一组由文件主名所符合的特征与标识形成的映射关系。
具体而言,在一个实施例中,文件主名可以分为两类,系统文件主名和非系统文件主名。
例如,当程序文件在运行期间,读取、生成的文件主名与操作系统自身存在的文件主名相一致,则文件主名替换为标识“sysname”,否则文件主名替换为标识“normal”。
此外,还可以以一些其他的特征为标准。例如,如果文件主名中包含空格,则将文件主名替换为标识“space”。如果访问文件是程序文件运行过程中自身创建的,则将文件主名替换为标识“SelfCreated”。其他情况,例如文件主名中不包括空格,并且文件主名不为程序文件自身创建,则将文件主名替换 为标识“any”。
c)文件扩展名的归一化
当文件扩展名不属于预设的文件扩展名列表中的文件扩展名时,将该文件扩展名替换为预设的第三标识。
具体地,在一个实施例中,如果文件扩展名在预设的文件扩展名列表中,那么无需替换。如果文件扩展名不在预设的文件扩展名列表中,则将文件扩展名替换为标识“UnStd”。
例如:notepad为系统文件主名,exe为预设的文件扩展名列表中的文件扩展名。则路径c:\windows\notepad.exe可以归一化为windir\sysname.exe。再如,aaa为非系统文件主名,rty不为预设的文件扩展名列表中的文件扩展名,则路径c:\windows\aaa.rty可以归一化为windir\normal.UnStd。
二、对注册表的路径进行归一化,其中,注册表的路径包括注册表的子键,例如为S级子键和注册表值名(ValueName)。
d)注册表的S级子键的归一化
当该S级子键符合第三正则表达式时,将该S级子键的前T级目录保留,将该S级子键的剩余的S-T级子键删除,其中,该第三正则表达式和T形成的映射关系是第三类映射关系中的一组,该第三类映射关系中包括至少一组由正则表达式与标识形成的映射关系,S为正整数,T为小于或等于S的正整数。这种情况适用于注册表的S级子键以特殊子键作为根子键或子键的一部分的情况。
特殊子键包括如表2中所示的子键,但不限于这些。
表2
Figure PCTCN2016108901-appb-000002
Figure PCTCN2016108901-appb-000003
Figure PCTCN2016108901-appb-000004
具体而言,在一个实施例中,当S级子键中包括上述特殊子键时,根据“保留子键级数”的数字来确定最终保留至该子键后续的几级子键。或者说,当S级子键符合符合某一关于子键的正则表达式时,将该S级子键的前T级目录保留,将该S级子键的剩余的S-T级子键删除。
应理解,S级子键符合第三正则表达式可以等同理解为S级子键中包括某一特殊子键。
例如,注册表的各级子键为SOFTWARE\Adobe\Acrobat Reader,该路径符合表2的第20项,该项要求保留的子键级数为1,那么归一化后的注册表 的子键为SOFTWARE\Adobe。
对于没有出现在表2中的子键为非特殊子键,一律保留全路径。
对于其他的情况,例如,当S级子键中的第一子键为全局唯一标示符CLSID时,将该第一子键替换为第四标识。
具体而言,在一个实施例中,在注册表的路径中往往存在着大量的CLSD类的子键,这种子键的形式可以为{8591DA08-F8AD-333D-83FE-599CDACEB1A0},其使用正则表达式可以表示为\{[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\}。当S级子键中的某一级子键为这种子键时,将该级子键名称替换为标识“$clsid”。
例如,注册表的路径的S级子键为CLSID\{8591DA08-F8AD-333D-83FE-599CDACEB1A0}\ProgId时,归一化后可以为CLSID\$clsid\ProgId。
e)注册表值名(ValueName)保留。
应理解,本发明实施例中正则表达式中的“\”的个数仅是示意性的,正则式的表达也是示意性的。表1和表2的表现形式也是示意性的,其内容可以以其他等同的形式来表示,本发明实施例对此不作限定。
还应理解,行为的路径中还可以包括窗体名和进程名等,对窗体名和进程名也可以进行相应的归一化处理,以降低路径的多样性,本发明实施例对此不作限定。
还应理解,以上归一化手段是长期对恶意程序的行为进行总结得到的,或者说是基于经验得到的。通过本发明实施的对路径进行的上述归一化处理,可以减少程序文件的特征的数量。
一来,本发明实施例的方法可能会减少一个程序文件的特征向量的元素的个数。例如,未进行归一化处理时,程序文件X的特征向量中包括100个元素,即程序文件X对应的整数序列包括100个整数;而经过归一化处理后,程序文件X的特征向量中的元素个数可能减少为70个,即程序文件X对应的整数序列变为包括70个整数。这样,对于一个程序文件而言,将一些类似的行为合并为一个特征,其特征个数变少,其文件特点更突出,更容易与其他文件进行聚类或分类。
二来,本发明实施例的方法可能会增加两个程序文件的相同的特征的个 数。例如,未进行归一化处理时,程序文件X的特征向量中包括100个元素,即程序文件X对应的整数序列包括100个整数,程序文件Y的特征向量中也包括100个元素,即程序文件Y对应的整数序列也包括100个整数,并且假设程序文件X的特征向量的100个元素和程序文件Y的特征向量的100个元素完全不相同;而经过归一化处理后,程序文件X的特征向量中的元素个数仍为100个,程序文件Y的特征向量中的元素个数也仍为100个,然而程序文件X的特征向量的100个元素和程序文件Y的特征向量的100个元素中有若干个相同的元素。这样,对于两个程序文件而言,将一些类似的行为合并为一个特征,其总的特征个数也是变少的,更容易将两个程序文件聚类或分类到一个程序文件类别中。
应理解,本发明实施例中相同的元素是指元素的值相等,与元素在特征向量中或整数序列中的位置无关。
本发明实施例的程序文件的分类方法,获取程序文件的至少两个行为分别对应的行为信息,对行为信息中的行为的路径进行归一化处理,降低路径的多样性,根据对路径进行归一化处理后的行为信息,生成特征向量,并根据特征向量确定程序文件所属的类别,使得归一化处理后的路径的随机性降低,从而使得程序文件的分类效果大大提高。
分析人员对程序文件的的每一类别只需随机抽取几个程序文件进行深入分析,即可了解该类别的程序文件是否可疑。同时,还能及时发现未被分类的、不易被分类的或程序文件数量少的类别中的小众的程序文件,对小众的程序文件重点分析并及时发现恶意的程序文件。这样能够有效地降低分析海量程序文件的工作量、提高分析的效率。
图11示出了本发明一个实施例的程序文件的分类装置1100的示意性框图。该装置1100可以包括:
获取模块1110,用于获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,每个该行为信息中包括行为标识以及在执行所对应的行为时涉及的路径;
归一化模块1120,用于对该获取模块1110获取的每个该行为信息中的路径进行归一化处理,该归一化处理用于降低路径的多样性;
生成模块1130,用于根据经过该归一化模块1120对路径进行归一化处理 后的至少两个行为信息,生成特征向量,其中,该特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,将对路径进行归一化处理后的相同的行为信息对应的元素的值相同,相同的行为信息是指行为标识相同并且进行归一化处理后的路径也相同的行为信息;
分类模块1140,用于根据该生成模块1130生成的该特征向量,确定该程序文件所属的类别。
本发明实施例的程序文件的分类装置,获取程序文件的至少两个行为分别对应的行为信息,对行为信息中的行为的路径进行归一化处理,降低路径的多样性,根据对路径进行归一化处理后的行为信息,生成特征向量,并根据特征向量确定程序文件所属的类别,使得归一化处理后的路径的随机性降低,从而改善程序文件的分类效果,进而降低识别恶意程序文件时工作人员的工作量。
可选地,作为一个实施例,行为信息中的路径包括文件的M级目录,归一化模块1120具体可以用于:
如果该M级目录符合第一正则表达式,将该M级目录的前N级目录替换为第一标识,将该M级目录的剩余的M-N级替换为数字M-N,其中,该第一正则表达式和该第一标识形成的映射关系是第一类映射关系中的一组,该第一类映射关系中包括至少一组由正则表达式与标识形成的映射关系,M为正整数,N为小于或等于M的正整数;
如果该M级目录不符合该第一类映射关系中任一组映射关系,将该M级目录的前L级目录保留,将该M级目录的剩余的M-L级目录替换为数字M-L,其中,L为小于或等于M的正整数。
可选地,作为一个实施例,该行为信息中的路径包括文件主名和文件扩展名,该归一化模块具体用于:
如果该文件主名符合预定文件主名特征,将该文件主名替换为第二标识,其中,该预定文件主名特征和该第二标识形成的映射关系是第二类映射关系中的一组,该第二类映射关系中包括至少一组由文件主名所符合的特征与标识形成的映射关系;
如果该文件扩展名不属于预设的文件扩展名列表中的文件扩展名,将该文件扩展名替换为预设的第三标识。
可选地,作为另一个实施例,该行为信息中的路径包括注册表的S级子键,该归一化模块1120具体可以用于:
如果该S级子键符合第三正则表达式,将该S级子键的前T级目录保留,将该S级子键的剩余的S-T级子键删除,其中,该第三正则表达式和T形成的映射关系是第三类映射关系中的一组,该第三类映射关系中包括至少一组由正则表达式与标识形成的映射关系,S为正整数,T为小于或等于S的正整数;
如果该S级子键中的第一子键为全局唯一标示符CLSID,将该第一子键替换为第四标识。
上述三个实施例中的归一化模块1120可以基于对恶意程序的行为的总结部署归一化处理。通过本发明实施的对路径进行的上述归一化处理,可以减少程序文件的特征的数量,从而可以更有效地进行程序文件的分类。
可选地,在本发明一个实施例中,该分类模块1140具体可以用于:
通过分类算法,确定该特征向量中的元素分别与多个程序文件类别的聚类中心数据的相似性,该聚类中心数据用于表示所属程序文件类别中的程序文件的特征,包括所属程序文件类别中的程序文件的特征向量的元素;
将该程序文件划分到第一程序文件类别,该特征向量的元素与该第一程序文件类别的聚类中心数据的相似性高于该特征向量的元素与该多个程序文件类别中除该第一程序文件类别之外的其他程序文件类别的聚类中心数据的相似性。
可选地,在本发明另一个实施例中,该分类模块1140具体可以用于:
通过聚类算法,将该特征向量与多个其他程序文件的特征向量进行聚类,形成至少一个程序文件类别,其中,每个程序文件类别中的程序文件的特征向量的元素的相似性高于阈值,该程序文件被划分到该至少一个程序文件类别中的第一程序文件类别。
可选地,在本发明各实施例中,该生成模块1130具体可以用于:
将对路径进行归一化处理后的每个行为信息表示为一个整数,其中,将对路径进行归一化处理后相同的行为信息表示为相同的整数;
将每个该整数作为该特征向量的一个元素,多个该整数构成该特征向量。
可选地,在本发明各实施例中,该获取模块1110具体可以用于:
通过监控程序获取该程序文件;
通过沙箱服务器获得该程序文件的行为序列,该行为序列中包括该至少两个行为分别对应的行为信息。
应理解,装置1100可以为上文中描述的分类服务器。
应注意,本发明实施例中,获取模块1100可以由网络接口实现,归一化模块1120、生成模块1130和分类模块1140可以由处理器实现。如图12所示,装置1200可以包括处理器1210、网络接口1220和存储器1230。其中,存储器1230可以用于存储处理器1210执行的代码等。装置1200还可以包括输出设备或与输出设备连接的输出接口1240,以用于输出程序文件的分类结果。输出设备包括显示器,打印机等等。
装置1200中的各个组件通过总线系统1250耦合在一起,其中总线系统1250除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。
图11所示的装置1100或图12所示的装置1200能够实现前述图1至图10的实施例中所实现的各个过程,为避免重复,这里不再赘述。
应注意,本发明上述方法实施例可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本发明实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以 是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
所述网络接口用于接收来自于企业网络中的沙箱服务器发送的至少一个程序文件的行为序列。具体地,网络接口可以接收沙箱服务器发送的程序文件对应的MD5值以及程序文件的行为序列。网络接口1220可以是一个网络接口,也可以是多个网络接口。网络接口1220可以接收一个沙箱服务器发送的行为序列,也可以接收多个沙箱服务器分别发送的行为序列。网络接口可以是有线接口,例如光纤分布式数据接口(Fiber Distributed Data Interface,FDDI)、千兆以太网(Gigabit Ethernet,GE)接口;网络接口也可以是无线接口。
本发明实施例还提供一种程序文件的分类系统,具体可以如图4所示。该程序文件分类的系统可以包括客户端、沙箱服务器和本发明实施例的程序文件分类的装置,对应于图4中的分类服务器416。客户端中包括监控程序,以通过所述监控程序获取所述客户端中的程序文件;所述沙箱服务器用于接收所述监控程序发送的所述程序文件,以生成所述程序文件的行为序列并将生成的行为序列提供给所述程序文件的分类装置,所述行为序列中包括至少两个行为信息,程序文件分类的装置用于执行本发明实施例的程序文件的分类方法。
本发明实施例的程序文件分类的系统能够本发明实施例中所描述的程序文件分类的方法的各个过程,为避免重复,这里不再赘述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备,例如可以是个人计算机,服务器,或者网络设备等,执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以 存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (16)

  1. 一种程序文件的分类方法,其特征在于,包括:
    获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,每个所述行为信息中包括行为标识以及在执行所对应的行为时涉及的路径;
    对每个所述行为信息中的路径进行归一化处理,所述归一化处理用于降低路径的多样性;
    根据对路径进行归一化处理后的至少两个行为信息,生成特征向量,其中,所述特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,对路径进行归一化处理后的相同的行为信息对应的元素的值相同,相同的行为信息是指行为标识相同并且进行归一化处理后的路径也相同的行为信息;
    根据所述特征向量,确定所述程序文件所属的类别。
  2. 根据权利要求1所述的方法,其特征在于,所述行为信息中的路径包括文件的M级目录,所述对每个所述行为信息中的路径进行归一化处理,包括:
    如果所述M级目录符合第一正则表达式,将所述M级目录的前N级目录替换为第一标识,将所述M级目录的剩余的M-N级替换为数字M-N,其中,所述第一正则表达式和所述第一标识形成的映射关系是第一类映射关系中的一组,所述第一类映射关系中包括至少一组由正则表达式与标识形成的映射关系,M为正整数,N为小于或等于M的正整数;
    如果所述M级目录不符合所述第一类映射关系中任一组映射关系,将所述M级目录的前L级目录保留,将所述M级目录的剩余的M-L级目录替换为数字M-L,其中,L为小于或等于M的正整数。
  3. 根据权利要求1或2所述的方法,其特征在于,所述行为信息中的路径包括文件主名和文件扩展名,所述对每个所述行为信息中的路径进行归一化处理,包括:
    如果所述文件主名符合预定文件主名特征,将所述文件主名替换为第二 标识,其中,所述预定文件主名特征和所述第二标识形成的映射关系是第二类映射关系中的一组,所述第二类映射关系中包括至少一组由文件主名所符合的特征与标识形成的映射关系;
    如果所述文件扩展名不属于预设的文件扩展名列表中的文件扩展名,将所述文件扩展名替换为预设的第三标识。
  4. 根据权利要求1所述的方法,其特征在于,所述行为信息中的路径包括注册表的S级子键,所述对每个所述行为信息中的路径进行归一化处理,包括:
    如果所述S级子键符合第三正则表达式,将所述S级子键的前T级目录保留,将所述S级子键的剩余的S-T级子键删除,其中,所述第三正则表达式和T形成的映射关系是第三类映射关系中的一组,所述第三类映射关系中包括至少一组由正则表达式与标识形成的映射关系,S为正整数,T为小于或等于S的正整数;
    如果所述S级子键中的第一子键为全局唯一标示符CLSID,将所述第一子键替换为第四标识。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述根据所述特征向量,确定所述程序文件所属的类别,包括:
    通过分类算法,确定所述特征向量中的元素分别与多个程序文件类别的聚类中心数据的相似性,所述聚类中心数据用于表示所属程序文件类别中的程序文件的特征,包括所属程序文件类别中的程序文件的特征向量的元素;
    将所述程序文件划分到第一程序文件类别,所述特征向量的元素与所述第一程序文件类别的聚类中心数据的相似性高于所述特征向量的元素与所述多个程序文件类别中除所述第一程序文件类别之外的其他程序文件类别的聚类中心数据的相似性。
  6. 根据权利要求1至4中任一项所述的方法,其特征在于,所述根据所述特征向量,确定所述程序文件所属的类别,包括:
    通过聚类算法,将所述特征向量与多个其他程序文件的特征向量进行聚 类,形成至少一个程序文件类别,其中,每个程序文件类别中的程序文件的特征向量的元素的相似性高于阈值,所述程序文件被划分到所述至少一个程序文件类别中的第一程序文件类别。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述根据对路径进行归一化处理后的至少两个行为信息,生成特征向量,包括:
    将对路径进行归一化处理后的每个行为信息表示为一个整数,其中,将对路径进行归一化处理后相同的行为信息表示为相同的整数;
    将每个所述整数作为所述特征向量的一个元素,多个所述整数构成所述特征向量。
  8. 一种程序文件的分类装置,其特征在于,包括:
    获取模块,用于获取程序文件在运行期间执行的至少两个行为分别对应的行为信息,每个所述行为信息中包括行为标识以及在执行所对应的行为时涉及的路径;
    归一化模块,用于对所述获取模块获取的每个所述行为信息中的路径进行归一化处理,所述归一化处理用于降低路径的多样性;
    生成模块,用于根据经过所述归一化模块对路径进行归一化处理后的至少两个行为信息,生成特征向量,其中,所述特征向量的每个元素对应对路径进行归一化处理后的一个行为信息,将对路径进行归一化处理后的相同的行为信息对应的元素的值相同,相同的行为信息是指行为标识相同并且进行归一化处理后的路径也相同的行为信息;
    分类模块,用于根据所述生成模块生成的所述特征向量,确定所述程序文件所属的类别。
  9. 根据权利要求8所述的装置,其特征在于,所述行为信息中的路径包括文件的M级目录,所述归一化模块具体用于:
    如果所述M级目录符合第一正则表达式,将所述M级目录的前N级目录替换为第一标识,将所述M级目录的剩余的M-N级替换为数字M-N,其中,所述第一正则表达式和所述第一标识形成的映射关系是第一类映射关系 中的一组,所述第一类映射关系中包括至少一组由正则表达式与标识形成的映射关系,M为正整数,N为小于或等于M的正整数;
    如果所述M级目录不符合所述第一类映射关系中任一组映射关系,将所述M级目录的前L级目录保留,将所述M级目录的剩余的M-L级目录替换为数字M-L,其中,L为小于或等于M的正整数。
  10. 根据权利要求8或9所述的装置,其特征在于,所述行为信息中的路径包括文件主名和文件扩展名,所述归一化模块具体用于:
    如果所述文件主名符合预定文件主名特征,将所述文件主名替换为第二标识,其中,所述预定文件主名特征和所述第二标识形成的映射关系是第二类映射关系中的一组,所述第二类映射关系中包括至少一组由文件主名所符合的特征与标识形成的映射关系;
    如果所述文件扩展名不属于预设的文件扩展名列表中的文件扩展名,将所述文件扩展名替换为预设的第三标识。
  11. 根据权利要求8所述的装置,其特征在于,所述行为信息中的路径包括注册表的S级子键,所述归一化模块具体用于:
    如果所述S级子键符合第三正则表达式,将所述S级子键的前T级目录保留,将所述S级子键的剩余的S-T级子键删除,其中,所述第三正则表达式和T形成的映射关系是第三类映射关系中的一组,所述第三类映射关系中包括至少一组由正则表达式与标识形成的映射关系,S为正整数,T为小于或等于S的正整数;
    如果所述S级子键中的第一子键为全局唯一标示符CLSID,将所述第一子键替换为第四标识。
  12. 根据权利要求8至11中任一项所述的装置,其特征在于,所述分类模块具体用于:
    通过分类算法,确定所述特征向量中的元素分别与多个程序文件类别的聚类中心数据的相似性,所述聚类中心数据用于表示所属程序文件类别中的程序文件的特征,包括所属程序文件类别中的程序文件的特征向量的元素;
    将所述程序文件划分到第一程序文件类别,所述特征向量的元素与所述第一程序文件类别的聚类中心数据的相似性高于所述特征向量的元素与所述多个程序文件类别中除所述第一程序文件类别之外的其他程序文件类别的聚类中心数据的相似性。
  13. 根据权利要求8至11中任一项所述的装置,其特征在于,所述分类模块具体用于:
    通过聚类算法,将所述特征向量与多个其他程序文件的特征向量进行聚类,形成至少一个程序文件类别,其中,每个程序文件类别中的程序文件的特征向量的元素的相似性高于阈值,所述程序文件被划分到所述至少一个程序文件类别中的第一程序文件类别。
  14. 根据权利要求8至13中任一项所述的装置,其特征在于,所述生成模块具体用于:
    将对路径进行归一化处理后的每个行为信息表示为一个整数,其中,将对路径进行归一化处理后相同的行为信息表示为相同的整数;
    将每个所述整数作为所述特征向量的一个元素,多个所述整数构成所述特征向量。
  15. 根据权利要求8至14中任一项所述的装置,其特征在于,所述装置为分类服务器。
  16. 一种程序文件的分类系统,其特征在于,包括客户端、沙箱服务器和程序文件的分类装置,
    所述客户端中包括监控程序,以通过所述监控程序获取所述客户端中的程序文件;
    所述沙箱服务器用于接收所述监控程序发送的所述程序文件,以生成所述程序文件的行为序列并将生成的行为序列提供给所述程序文件的分类装置,所述行为序列中包括至少两个行为分别对应的行为信息;
    所述程序文件的分类装置用于执行如权利要求1至7中任一项所述的方法。
PCT/CN2016/108901 2016-01-26 2016-12-07 程序文件的分类方法、分类装置和分类系统 WO2017128868A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP16887739.7A EP3306493B1 (en) 2016-01-26 2016-12-07 Classification method, device and system for program file
US15/870,545 US10762194B2 (en) 2016-01-26 2018-01-12 Program file classification method, program file classification apparatus, and program file classification system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610052993.6A CN106997367B (zh) 2016-01-26 2016-01-26 程序文件的分类方法、分类装置和分类系统
CN201610052993.6 2016-01-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/870,545 Continuation US10762194B2 (en) 2016-01-26 2018-01-12 Program file classification method, program file classification apparatus, and program file classification system

Publications (1)

Publication Number Publication Date
WO2017128868A1 true WO2017128868A1 (zh) 2017-08-03

Family

ID=59397332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/108901 WO2017128868A1 (zh) 2016-01-26 2016-12-07 程序文件的分类方法、分类装置和分类系统

Country Status (4)

Country Link
US (1) US10762194B2 (zh)
EP (1) EP3306493B1 (zh)
CN (1) CN106997367B (zh)
WO (1) WO2017128868A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145113A (zh) * 2018-08-24 2019-01-04 北京桃花岛信息技术有限公司 一种基于机器学习的学生贫困程度预测方法

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10621349B2 (en) * 2017-01-24 2020-04-14 Cylance Inc. Detection of malware using feature hashing
CN108256329B (zh) * 2018-02-09 2022-06-17 杭州义盾信息技术有限公司 基于动态行为的细粒度rat程序检测方法、系统及相应的apt攻击检测方法
CN109901869B (zh) * 2019-01-25 2022-03-18 中国电子科技集团公司第三十研究所 一种基于词袋模型的计算机程序分类方法
CN111666404A (zh) * 2019-03-05 2020-09-15 腾讯科技(深圳)有限公司 一种文件聚类方法、装置及设备
CN112100618B (zh) * 2019-06-18 2023-12-29 深信服科技股份有限公司 一种病毒文件检测方法、系统、设备及计算机存储介质
US10846407B1 (en) * 2020-02-11 2020-11-24 Calypso Ai Corp Machine learning model robustness characterization
CN111444961B (zh) * 2020-03-26 2023-08-18 国家计算机网络与信息安全管理中心黑龙江分中心 一种通过聚类算法判定互联网网站归属的方法
US11520831B2 (en) * 2020-06-09 2022-12-06 Servicenow, Inc. Accuracy metric for regular expression
US11483375B2 (en) * 2020-06-19 2022-10-25 Microsoft Technology Licensing, Llc Predictive model application for file upload blocking determinations
CN114510511A (zh) * 2020-11-17 2022-05-17 武汉斗鱼鱼乐网络科技有限公司 工程文件与库引用关系的展示方法、装置、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567661A (zh) * 2010-12-31 2012-07-11 北京奇虎科技有限公司 基于机器学习的程序识别方法及装置
CN102779249A (zh) * 2012-06-28 2012-11-14 奇智软件(北京)有限公司 恶意程序检测方法及扫描引擎
US20140137255A1 (en) * 2011-08-09 2014-05-15 Huawei Technologies Co., Ltd. Method, System, and Apparatus for Detecting Malicious Code

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6754389B1 (en) * 1999-12-01 2004-06-22 Koninklijke Philips Electronics N.V. Program classification using object tracking
CN1360267A (zh) * 2002-01-30 2002-07-24 北京大学 文件分类查找方法
US7461086B1 (en) * 2006-01-03 2008-12-02 Symantec Corporation Run-time application installation application layered system
US9942271B2 (en) * 2005-12-29 2018-04-10 Nextlabs, Inc. Information management system with two or more interactive enforcement points
CN101079851B (zh) 2007-07-09 2011-01-05 华为技术有限公司 邮件类型判断方法、装置及系统
US8713680B2 (en) * 2007-07-10 2014-04-29 Samsung Electronics Co., Ltd. Method and apparatus for modeling computer program behaviour for behavioural detection of malicious program
US20090037440A1 (en) * 2007-07-30 2009-02-05 Stefan Will Streaming Hierarchical Clustering
WO2012071989A1 (zh) 2010-11-29 2012-06-07 北京奇虎科技有限公司 基于机器学习的程序识别方法及装置
ES2755780T3 (es) * 2011-09-16 2020-04-23 Veracode Inc Análisis estático y de comportamiento automatizado mediante la utilización de un espacio aislado instrumentado y clasificación de aprendizaje automático para seguridad móvil
CN103902443B (zh) * 2012-12-26 2017-04-26 华为技术有限公司 一种程序运行性能分析方法及装置
US20140201208A1 (en) * 2013-01-15 2014-07-17 Corporation Symantec Classifying Samples Using Clustering
KR101500512B1 (ko) * 2013-05-15 2015-03-18 소프트캠프(주) 데이터 프로세싱 시스템 보안 장치와 보안방법
US10084817B2 (en) * 2013-09-11 2018-09-25 NSS Labs, Inc. Malware and exploit campaign detection system and method
GB2518452A (en) * 2013-09-24 2015-03-25 Ibm Method for file recovery and client server system
US9489514B2 (en) * 2013-10-11 2016-11-08 Verisign, Inc. Classifying malware by order of network behavior artifacts
CN105765528B (zh) * 2013-11-13 2019-09-24 微软技术许可有限责任公司 具有可配置原点定义的应用执行路径跟踪的方法、系统和介质
TWI528216B (zh) * 2014-04-30 2016-04-01 財團法人資訊工業策進會 隨選檢測惡意程式之方法、電子裝置、及使用者介面
CN105095296B (zh) * 2014-05-16 2020-03-17 小米科技有限责任公司 文件管理方法及装置
CN104392174B (zh) * 2014-10-23 2016-04-06 腾讯科技(深圳)有限公司 应用程序动态行为的特征向量的生成方法及装置
US10708296B2 (en) * 2015-03-16 2020-07-07 Threattrack Security, Inc. Malware detection based on training using automatic feature pruning with anomaly detection of execution graphs
US9787695B2 (en) * 2015-03-24 2017-10-10 Qualcomm Incorporated Methods and systems for identifying malware through differences in cloud vs. client behavior
US20170046510A1 (en) * 2015-08-14 2017-02-16 Qualcomm Incorporated Methods and Systems of Building Classifier Models in Computing Devices
US9992211B1 (en) * 2015-08-27 2018-06-05 Symantec Corporation Systems and methods for improving the classification accuracy of trustworthiness classifiers
US10437986B2 (en) * 2015-12-10 2019-10-08 AVAST Software s.r.o. Distance and method of indexing sandbox logs for mapping program behavior
US10536482B2 (en) * 2017-03-26 2020-01-14 Microsoft Technology Licensing, Llc Computer security attack detection using distribution departure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567661A (zh) * 2010-12-31 2012-07-11 北京奇虎科技有限公司 基于机器学习的程序识别方法及装置
US20140137255A1 (en) * 2011-08-09 2014-05-15 Huawei Technologies Co., Ltd. Method, System, and Apparatus for Detecting Malicious Code
CN102779249A (zh) * 2012-06-28 2012-11-14 奇智软件(北京)有限公司 恶意程序检测方法及扫描引擎

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3306493A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145113A (zh) * 2018-08-24 2019-01-04 北京桃花岛信息技术有限公司 一种基于机器学习的学生贫困程度预测方法
CN109145113B (zh) * 2018-08-24 2021-12-21 北京桃花岛信息技术有限公司 一种基于机器学习的学生贫困程度预测方法

Also Published As

Publication number Publication date
US20180189481A1 (en) 2018-07-05
US10762194B2 (en) 2020-09-01
EP3306493A1 (en) 2018-04-11
CN106997367B (zh) 2020-05-08
CN106997367A (zh) 2017-08-01
EP3306493B1 (en) 2021-08-11
EP3306493A4 (en) 2018-09-26

Similar Documents

Publication Publication Date Title
WO2017128868A1 (zh) 程序文件的分类方法、分类装置和分类系统
Chi et al. Hashing techniques: A survey and taxonomy
US11256821B2 (en) Method of identifying and tracking sensitive data and system thereof
CN106599686B (zh) 一种基于tlsh特征表示的恶意软件聚类方法
Serpen et al. Host-based misuse intrusion detection using PCA feature extraction and kNN classification algorithms
RU2454714C1 (ru) Система и способ повышения эффективности обнаружения неизвестных вредоносных объектов
US9734235B2 (en) Grouping documents and data objects via multi-center canopy clustering
Tang et al. Dynamic API call sequence visualisation for malware classification
US20200351285A1 (en) Anomaly detection based on evaluation of user behavior using multi-context machine learning
Gao et al. Android malware detection via graphlet sampling
JP2015079504A (ja) ネットワーク動作アーチファクトの順序によるマルウェアの分類
US10121106B2 (en) Spectral data analytic cube classifier (SPADACC) for continuous wide area geospatial modeling using normalized and highly dimensional multispectal raster data
US20200004956A1 (en) System and method for detecting malicious files using two-stage file classification
US11036800B1 (en) Systems and methods for clustering data to improve data analytics
US10083194B2 (en) Process for obtaining candidate data from a remote storage server for comparison to a data to be identified
US20210056456A1 (en) Tree-based associative data augmentation
Laurenza et al. Malware triage for early identification of advanced persistent threat activities
Kumar et al. Machine learning based malware detection in cloud environment using clustering approach
CN117616431A (zh) 针对大规模数据的可解释的机器学习
Deore et al. Mdfrcnn: Malware detection using faster region proposals convolution neural network
Švec et al. Knowledge-based dataset for training PE malware detection models
Čeponis et al. Evaluation of deep learning methods efficiency for malicious and benign system calls classification on the AWSCTD
Saletta et al. A neural embedding for source code: Security analysis and cwe lists
Abdullah et al. Hierarchical density-based clustering of malware behaviour
Wang et al. AIHGAT: A novel method of malware detection and homology analysis using assembly instruction heterogeneous graph

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16887739

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE