CN111444144B - File feature extraction method and device - Google Patents

File feature extraction method and device Download PDF

Info

Publication number
CN111444144B
CN111444144B CN202010144181.0A CN202010144181A CN111444144B CN 111444144 B CN111444144 B CN 111444144B CN 202010144181 A CN202010144181 A CN 202010144181A CN 111444144 B CN111444144 B CN 111444144B
Authority
CN
China
Prior art keywords
file
information
level
target
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010144181.0A
Other languages
Chinese (zh)
Other versions
CN111444144A (en
Inventor
白敏�
白子潘
汪列军
白皓文
刘爽
潘博文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd, Secworld Information Technology Beijing Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN202010144181.0A priority Critical patent/CN111444144B/en
Publication of CN111444144A publication Critical patent/CN111444144A/en
Application granted granted Critical
Publication of CN111444144B publication Critical patent/CN111444144B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a file feature extraction method and device, wherein the method comprises the following steps: obtaining a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation; analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the affiliated relation of each file among the hierarchies; and determining the feature vector of the target file according to the hierarchical file information set. According to the file feature extraction method and device provided by the embodiment of the invention, the target file to be analyzed can be subjected to deep analysis to obtain the file information corresponding to the files in each level of the target file adopting the multi-level structure, and finally the corresponding feature vectors are determined according to the file information and are used for carrying out feature matching on the target file to judge the file category, so that the file depth detection is realized, and the malicious identification information of the file is difficult to evade and identify.

Description

File feature extraction method and device
Technical Field
The present invention relates to the field of information security detection technologies, and in particular, to a method and an apparatus for extracting file features.
Background
In malicious sample analysis, an unknown file is typically subjected to homology analysis to analyze which APT group the unknown file belongs to or which malicious family the unknown file belongs to. However, in order to avoid the malicious sample being detected, many means are adopted to avoid the malicious sample being identified by security software, for example, effective identification information is hidden in the inner layer of the file, so that effective analysis cannot be completed by external detection. The file is simply analyzed at present, the effective information of the file cannot be analyzed, and malicious qualitative and positioning of the file cannot be completed.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides a file feature extraction method and device.
In a first aspect, an embodiment of the present invention provides a method for extracting a file feature, including:
obtaining a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation;
analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the affiliated relation of each file among the hierarchies;
and determining the feature vector of the target file according to the hierarchical file information set.
Further, the file information of each file in each hierarchy includes dynamic behavior information and static file information.
Further, analyzing the target file to obtain a hierarchical file information set, including:
acquiring files of a current level to be processed in the target file according to a preset level processing sequence;
analyzing the execution items in the file of the current to-be-processed hierarchy to obtain dynamic behavior information;
analyzing basic items existing in the files of the current to-be-processed level to obtain static file information;
and recording the belongings of the files of the current to-be-processed level and the files of the adjacent processed levels.
Further, analyzing the execution items existing in the file of the current to-be-processed hierarchy to obtain dynamic behavior information, including:
and placing the execution items in the file of the current to-be-processed level into a sandbox for execution, and analyzing the execution process to obtain dynamic behavior information.
Further, the determining a feature vector corresponding to the target file according to the hierarchical file information set includes:
respectively carrying out digital characteristic processing on dynamic behavior information and static file information of each file in each level to obtain a corresponding first characteristic set and a corresponding second characteristic set;
and determining the feature vector corresponding to the target file according to the first feature set and the second feature set corresponding to each file in all the levels and the belonging relation of each file among the levels.
Further, the digital characteristic processing of the dynamic behavior information and the static file information comprises the following steps:
digital conversion is carried out on word information existing in the dynamic behavior information and the static file information;
numerical extraction is carried out on numerical information existing in the dynamic behavior information and the static file information.
In a second aspect, an embodiment of the present invention provides a file feature extraction apparatus, including:
the acquisition module is used for acquiring a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation;
the analysis module is used for analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the belonging relation of each file among the hierarchies;
and the processing module is used for determining the characteristic vector of the target file according to the hierarchical file information set.
Further, the file information of each file in each hierarchy includes dynamic behavior information and static file information.
Further, the parsing module is specifically configured to:
acquiring files of a current level to be processed in the target file according to a preset level processing sequence;
analyzing the execution items in the file of the current to-be-processed hierarchy to obtain dynamic behavior information;
analyzing basic items existing in the files of the current to-be-processed level to obtain static file information;
and recording the belongings of the files of the current to-be-processed level and the files of the adjacent processed levels.
Further, the parsing module is specifically configured to, in a process of parsing an execution item existing in a file of a current to-be-processed hierarchy to obtain dynamic behavior information:
and placing the execution items in the file of the current to-be-processed level into a sandbox for execution, and analyzing the execution process to obtain dynamic behavior information.
Further, the processing module is specifically configured to:
respectively carrying out digital characteristic processing on dynamic behavior information and static file information of each file in each level to obtain a corresponding first characteristic set and a corresponding second characteristic set;
and determining the feature vector corresponding to the target file according to the first feature set and the second feature set corresponding to each file in all the levels and the belonging relation of each file among the levels.
Further, the processing module is specifically configured to, in a process of performing digital feature processing on the dynamic behavior information and the static file information:
digital conversion is carried out on word information existing in the dynamic behavior information and the static file information;
numerical extraction is carried out on numerical information existing in the dynamic behavior information and the static file information.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the file feature extraction method described above when the program is executed.
In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a file feature extraction method as described above.
In a fifth aspect, embodiments of the present invention provide a computer program product comprising computer executable instructions, characterized in that the instructions, when executed, are adapted to carry out the steps of a file feature extraction method as described above.
According to the file feature extraction method and device provided by the embodiment of the invention, the target file to be analyzed can be subjected to deep analysis to obtain the file information corresponding to the files in each level of the target file adopting the multi-level structure, and finally the corresponding feature vectors are determined according to the file information and are used for carrying out feature matching on the target file to judge the file category, so that the file depth detection is realized, and the malicious identification information of the file is difficult to evade and identify.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an embodiment of a method for extracting features of a document according to the present invention;
FIG. 2 is a schematic view of a file hierarchy structure according to the present invention;
FIG. 3 is a diagram illustrating an exemplary embodiment of a document feature extraction apparatus according to the present invention;
fig. 4 is a block diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In malicious sample analysis, an unknown file is typically subjected to homology analysis to analyze which APT group the unknown file belongs to or which malicious family the unknown file belongs to. However, in order to avoid the malicious sample being detected, many means are adopted to avoid the malicious sample being identified by security software, for example, effective identification information is hidden in the inner layer of the file, so that effective analysis cannot be completed by external detection. The file is simply analyzed at present, the effective information of the file cannot be analyzed, and malicious qualitative and positioning of the file cannot be completed.
To this end, fig. 1 shows a method for extracting file features according to an embodiment of the present invention, including:
s11, acquiring a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation;
s12, analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the belonging relation of each file among the hierarchies;
s13, determining the feature vector of the target file according to the hierarchical file information set.
For steps S11 to S13, it should be noted that, in the embodiment of the present invention, file types of the file capable of performing feature extraction include, but are not limited to: window executable file, office document, office compound document, PDF file, ZIP compression package file, RAR compression package file, GZ compression package file, rich Text Format file, email file, linux executable file, adobe Flash file, windows shortcut file, HWP file, inpage file or Android APK file. Therefore, the method of the embodiment of the invention can extract the characteristics of the files with more different file types, and is convenient for subsequent file qualitative and positioning.
In the embodiment of the invention, the target file is a file which is to be subjected to feature extraction and is required to be judged whether to be a malicious sample or not.
The target file comprises files with multi-level structures, and each file among the levels has a belonging relationship.
For example, the RAR compresses a package file, where the package is a file at a first level, and the file in the package may be a file at a second level.
For example, a word file, where the word file is used as a file of a first hierarchy, and an inserted link file in the word file may be used as a file of a second hierarchy.
For example, a foxmail file, where the foxmail file is a first level file, an RAR compression package file as an attachment is a second level file, and a file in the compression package is a third level file.
As shown in fig. 2, which is a schematic diagram of a file hierarchy, referring to fig. 2, it can be seen that:
the first level of files is aaa.
The files of the second level are BBB.docx and CCC.exence, and the two files belong to AAA.RAR, namely, the AAA.RAR comprises BBB.docx and CCC.exence.
The third level of files is ddd.ole, which belongs to bbb.docx. I.e. BBB.docx contains DDD.ole.
The fourth level of files are eee.dll and fff.vbs, and the two files belong to ddd.ole, i.e. ddd.ole contains eee.dll and fff.vbs.
Therefore, in the embodiment of the present invention, it is necessary to analyze each file in each level of the target file to obtain the file information of each file in each level and the relationship of each file between levels, and to combine the file information of each file in each level and the relationship of each file between levels together to form a level file information set.
In a further embodiment of the method of the present embodiment, the file information of each file in each hierarchy includes dynamic behavior information and static file information.
The dynamic behavior information of the file is information generated in the process of being executed. This information includes, but is not limited to: execution sequence information, call number information, network behavior information, release code information, registry operation information, startup code information, and the like.
For example, a script is executed to turn the display screen into a blue screen while a word file is opened. At this time, the script number, the script code, and the network behavior information of the script file are acquired.
Static file information of a file is basic information that the file has, whether executed or not executed. Such as file name, file author name, file size, file type, hash value, creation time, modification time, etc.
The relationship of each file between layers is recorded. The relationship, for example, the compressed package file, includes a word file, a link file is inserted into the word file, and a picture file is in the link file.
If the malicious sample only has one hierarchy, the hierarchy file information set only comprises dynamic behavior information and static file information of the files in the first hierarchy.
If the malicious sample contains at least two levels, the level file information set comprises dynamic behavior information and static file information of files in each level.
If the files in one hierarchy do not have the conditions to be executed, the files in the hierarchy only have static file information.
It should be noted that, in step S13, in the embodiment of the present invention, since feature matching needs to be performed on the target file, it is determined which APT group the target file belongs to or which malicious family the target file belongs to. Therefore, the feature vector corresponding to the target file needs to be determined according to the obtained hierarchical file information set, and then the corresponding relation between the target file and the feature vector can be established for storage, so that the target file can be conveniently searched when being judged later.
According to the file feature extraction method provided by the embodiment of the invention, the target file to be analyzed can be subjected to deep analysis to obtain the file information corresponding to the files in each level of the target file adopting the multi-level structure, and finally the corresponding feature vectors are determined according to the file information and are used for carrying out feature matching on the target file to judge the file type, so that the file depth detection is realized, and the malicious identification information of the file is difficult to evade and identify.
In a further embodiment of the foregoing embodiment, the step of analyzing the target file to obtain the hierarchical file information set mainly includes:
s121, acquiring a file of a current level to be processed in the target file according to a preset level processing sequence;
s122, analyzing the execution items in the file of the current to-be-processed level to obtain dynamic behavior information;
s123, analyzing basic items existing in the file of the current to-be-processed level to obtain static file information;
s124, recording the belongings of the files of the current to-be-processed level and the files of the adjacent processed level.
In step S121, it should be noted that, in the process of analyzing the information of the target file, if the target file has multiple levels, each level needs to be analyzed. Therefore, the analysis of all the levels needs to be completed in a preset level processing order. As analyzed from the first level to the nth level, namely: the files of the hierarchy are acquired sequentially from the high hierarchy to the low hierarchy. It can be seen that the analysis process of each hierarchy is performed in the above steps S121 to S124.
For each level analysis process, first, a current to-be-processed level file in the target file is obtained.
It should be noted that, in step S122 and step S123, the execution item in the file is an information item loaded when the file is executed. Such as sequence items, call items, network behavior items, registration items, and the like. Such as registration information for the registration item, and behavior information for the network behavior item. And analyzing the execution item to obtain dynamic behavior information.
The base items in the file are fixed items of the file, such as author items, time items, type items, and the like. And analyzing the basic item to obtain static file information.
For step S124, it should be noted that after the dynamic behavior information and the static file information are acquired, the relationship between the file of the current to-be-processed level and the file of the adjacent processed level may be recorded. Namely: the file of the current waiting level and the file already processed by the upper and lower levels belong to each other.
The analysis processing is carried out on the files of each level, so that the sequential processing of the level files can be realized, and the aims of no omission and no bundling processing are fulfilled. In addition, the dynamic and static files are distinguished, so that the comprehensive information of the files is acquired, and the feature extraction is facilitated.
In a further embodiment of the foregoing embodiment method, the process of analyzing the execution items existing in the file of the current to-be-processed hierarchy to obtain the dynamic behavior information is mainly explained, and specifically includes:
and placing the execution items in the file of the current to-be-processed level into a sandbox for execution, and analyzing the execution process to obtain dynamic behavior information.
To ensure relatively secure execution of the file, a secure environment may be provided, such as placing the execution items of the file at the current level in a sandbox for execution, thereby obtaining dynamic behavior information. Sandboxes are virtual system programs that allow a browser or other program to be run in a sandbox environment, so that changes made to the run can be subsequently removed. It is an independent working environment with isolation function, and is a tool for testing the behavior of untrusted files or applications.
In a further embodiment of the foregoing embodiment, the determining the feature vector corresponding to the target file according to the hierarchical file information set is mainly explained as follows:
s131, respectively carrying out digital characteristic processing on dynamic behavior information and static file information of each file in each level to obtain a corresponding first characteristic set and a corresponding second characteristic set;
and S132, determining the feature vector corresponding to the target file according to the first feature set and the second feature set corresponding to each file in all the layers and the belonging relation of each file among the layers.
For step S131 and step S132, it should be noted that, in order to obtain the feature vector corresponding to the target file, it is necessary to perform digital feature processing on the dynamic behavior information and the static file information of each file in each level, so as to obtain the corresponding first feature set and second feature set, and then determine the corresponding feature vector according to the corresponding relationship between the first feature set, the second feature set and each file in all levels and the preset rule. For example, the 1*n-dimensional feature vector used, n is currently controllable at 228.
In a further embodiment of the foregoing embodiment method, a process of performing digital feature processing on dynamic behavior information and static file information is mainly explained, and specifically includes:
digital conversion is carried out on word information existing in the dynamic behavior information and the static file information;
numerical extraction is carried out on numerical information existing in the dynamic behavior information and the static file information.
In this regard, it should be noted that, since some information of the document is information expressed by words or phrases, namely: word information. For example, author names Li San, li San are information expressed by words. Still other information is numerically expressed information, namely: numerical information. Such as file size-20 kb.20 is information expressed by numerical values.
For word information, word frequency and word length statistics can be performed by using a word bag method to generate digital characteristics.
For numerical information, corresponding numerical values can be directly acquired to generate digital characteristics.
In a further embodiment of the foregoing embodiment method, in order to ensure accuracy of the feature vector, before performing the digital feature processing on the dynamic behavior information and the static file information, normalization processing is further required on information in the dynamic behavior information and the static file information. Here, normalization processing includes, but is not limited to: deleting invalid attribute information; word segmentation, cutting, etc. are performed on character strings, lists, etc. that do not meet the conditions.
It should also be noted that, for all the embodiments described above, there may be different content for dynamic behavior information and static file information for different file types. The LNK type and PDF type are exemplified as follows.
The LNK type field is as follows:
stream_size: file size
show_hidden: whether CMD window (0, 1) is present or not when executing
fullpath: complete path
string_dat_name:
string_dat_relativepath:
string_dat_workingdir:
string_dat_instruments: parameters (parameters)
string_dat_iconlocation: icon path
overlay_offset: additional data offset (a particular sample will be)
overlay_size: accessory data size (special samples will be)
Exploid_info: such as containing the vulnerability, the vulnerability name.
The PDF type field is as follows:
author: author's authors
company: company (Corp)
creationdate: creation date
The creator: creator(s)
moddate: date of modification
producer: tool name.
According to the file feature extraction method provided by the embodiments, the target file to be analyzed can be subjected to deep analysis to obtain the file information corresponding to the files in each level of the target file with the multi-level structure, and finally the corresponding feature vectors are determined according to the file information and used for carrying out feature matching on the target file to judge the file type, so that the file depth detection is realized, and the malicious identification information of the file is difficult to evade and identify.
Fig. 3 shows a file feature extraction device provided by an embodiment of the present invention, which includes an obtaining module 31, an analyzing module 32, and a processing module 33, wherein:
the obtaining module 31 is configured to obtain a target file to be analyzed, where the target file is a multi-level structure file, and each file between levels has a subordinate relationship;
the parsing module 32 is configured to parse the target file to obtain a hierarchical file information set, where the hierarchical file information set includes file information of each file in each hierarchy and a relationship between each file in each hierarchy;
and a processing module 33, configured to determine a feature vector of the target file according to the hierarchical file information set.
In a further embodiment of the foregoing embodiment apparatus, the file information of each file in each hierarchy includes dynamic behavior information and static file information.
In a further embodiment of the foregoing embodiment apparatus, the parsing module is specifically configured to:
acquiring files of a current level to be processed in the target file according to a preset level processing sequence;
analyzing the execution items in the file of the current to-be-processed hierarchy to obtain dynamic behavior information;
analyzing basic items existing in the files of the current to-be-processed level to obtain static file information;
and recording the belongings of the files of the current to-be-processed level and the files of the adjacent processed levels.
In a further embodiment of the foregoing embodiment of the present invention, the parsing module is specifically configured to, in a process of parsing an execution item existing in a file of a current to-be-processed hierarchy to obtain dynamic behavior information:
and placing the execution items in the file of the current to-be-processed level into a sandbox for execution, and analyzing the execution process to obtain dynamic behavior information.
In a further embodiment of the foregoing embodiment apparatus, the processing module is specifically configured to:
respectively carrying out digital characteristic processing on dynamic behavior information and static file information of each file in each level to obtain a corresponding first characteristic set and a corresponding second characteristic set;
and determining the feature vector corresponding to the target file according to the first feature set and the second feature set corresponding to each file in all the levels and the belonging relation of each file among the levels.
In a further embodiment of the foregoing embodiment apparatus, the processing module is specifically configured to, in a process of performing digital feature processing on the dynamic behavior information and the static file information:
digital conversion is carried out on word information existing in the dynamic behavior information and the static file information;
numerical extraction is carried out on numerical information existing in the dynamic behavior information and the static file information.
Since the apparatus according to the embodiment of the present invention is the same as the method according to the above embodiment, the details of the explanation will not be repeated here.
It should be noted that, in the embodiment of the present invention, the related functional modules may be implemented by a hardware processor (hardware processor).
According to the file feature extraction device provided by the embodiment of the invention, the target file to be analyzed can be subjected to deep analysis to obtain the file information corresponding to the files in each level of the target file with a multi-level structure, and finally the corresponding feature vectors are determined according to the file information and are used for carrying out feature matching on the target file to judge the file type, so that the file depth detection is realized, and the malicious identification information of the file is difficult to evade and identify.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: a processor (processor) 41, a communication interface (Communications Interface) 42, a memory (memory) 43 and a communication bus 44, wherein the processor 41, the communication interface 42 and the memory 43 perform communication with each other through the communication bus 44. The processor 41 may call logic instructions in the memory 43 to perform the following method: obtaining a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation; analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the affiliated relation of each file among the hierarchies; and determining the feature vector of the target file according to the hierarchical file information set.
Further, the logic instructions in the memory 43 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising: obtaining a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation; analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the affiliated relation of each file among the hierarchies; and determining the feature vector of the target file according to the hierarchical file information set.
Embodiments of the present invention also provide a computer program product comprising computer executable instructions which, when executed, are implemented to perform the methods provided by the above embodiments, for example comprising: obtaining a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation; analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the affiliated relation of each file among the hierarchies; and determining the feature vector of the target file according to the hierarchical file information set.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A method for extracting file features, comprising:
obtaining a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation;
analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the belonging relation of each file among the hierarchies, and analyzing the target file to obtain the hierarchical file information set comprises the following steps:
acquiring files of a current level to be processed in the target file according to a preset level processing sequence;
analyzing the execution items in the file of the current to-be-processed hierarchy to obtain dynamic behavior information;
analyzing basic items existing in the files of the current to-be-processed level to obtain static file information;
recording the belongings of the files of the current to-be-processed level and the files of the adjacent processed levels;
determining a feature vector of the target file according to the hierarchical file information set, wherein the determining the feature vector corresponding to the target file according to the hierarchical file information set includes:
respectively carrying out digital characteristic processing on dynamic behavior information and static file information of each file in each level to obtain a corresponding first characteristic set and a corresponding second characteristic set;
and determining the feature vector corresponding to the target file according to the first feature set and the second feature set corresponding to each file in all the levels and the belonging relation of each file among the levels.
2. The method of claim 1, wherein the file information of each file in each hierarchy includes dynamic behavior information and static file information.
3. The method for extracting features of a file according to claim 1, wherein analyzing an execution item existing in a file of a current hierarchy to be processed to obtain dynamic behavior information includes:
and placing the execution items in the file of the current to-be-processed level into a sandbox for execution, and analyzing the execution process to obtain dynamic behavior information.
4. The file feature extraction method according to claim 1, wherein the digital feature processing of the dynamic behavior information and the static file information includes:
digital conversion is carried out on word information existing in the dynamic behavior information and the static file information;
numerical extraction is carried out on numerical information existing in the dynamic behavior information and the static file information.
5. A document feature extraction apparatus, comprising:
the acquisition module is used for acquiring a target file to be analyzed, wherein the target file is a multi-level structure file, and each file among levels has a subordinate relation;
the analysis module is used for analyzing the target file to obtain a hierarchical file information set, wherein the hierarchical file information set comprises file information of each file in each hierarchy and the belonging relation of each file among the hierarchies, and the analysis module is specifically used for:
acquiring files of a current level to be processed in the target file according to a preset level processing sequence;
analyzing the execution items in the file of the current to-be-processed hierarchy to obtain dynamic behavior information;
analyzing basic items existing in the files of the current to-be-processed level to obtain static file information;
recording the belongings of the files of the current to-be-processed level and the files of the adjacent processed levels;
the processing module is used for determining the feature vector of the target file according to the hierarchical file information set, wherein the processing module is specifically used for:
respectively carrying out digital characteristic processing on dynamic behavior information and static file information of each file in each level to obtain a corresponding first characteristic set and a corresponding second characteristic set;
and determining the feature vector corresponding to the target file according to the first feature set and the second feature set corresponding to each file in all the levels and the belonging relation of each file among the levels.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the file feature extraction method of any one of claims 1 to 4 when the program is executed by the processor.
7. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the file feature extraction method according to any of claims 1 to 4.
CN202010144181.0A 2020-03-04 2020-03-04 File feature extraction method and device Active CN111444144B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010144181.0A CN111444144B (en) 2020-03-04 2020-03-04 File feature extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010144181.0A CN111444144B (en) 2020-03-04 2020-03-04 File feature extraction method and device

Publications (2)

Publication Number Publication Date
CN111444144A CN111444144A (en) 2020-07-24
CN111444144B true CN111444144B (en) 2023-07-25

Family

ID=71654013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010144181.0A Active CN111444144B (en) 2020-03-04 2020-03-04 File feature extraction method and device

Country Status (1)

Country Link
CN (1) CN111444144B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891786B (en) * 2024-03-15 2024-05-31 浙江研通信息科技有限公司 File path hooking method and system based on Monte Carlo algorithm

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010140373A (en) * 2008-12-15 2010-06-24 Fujitsu Ltd Method and device for detecting document group
CN106778241A (en) * 2016-11-28 2017-05-31 东软集团股份有限公司 The recognition methods of malicious file and device
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
KR101880686B1 (en) * 2018-02-28 2018-07-20 에스지에이솔루션즈 주식회사 A malware code detecting system based on AI(Artificial Intelligence) deep learning
CN109492395A (en) * 2018-10-31 2019-03-19 厦门安胜网络科技有限公司 A kind of method, apparatus and storage medium detecting rogue program
CN109710980A (en) * 2018-11-30 2019-05-03 深圳市嘉立创科技发展有限公司 Note Auditing processing method, device, computer equipment and storage medium
CN110210221A (en) * 2018-08-02 2019-09-06 腾讯科技(深圳)有限公司 A kind of documentation risk detection method and device
CN110287701A (en) * 2019-06-28 2019-09-27 深信服科技股份有限公司 A kind of malicious file detection method, device, system and associated component
CN110414220A (en) * 2019-06-28 2019-11-05 奇安信科技集团股份有限公司 Operation file extracting method and device during sandbox internal program Dynamic Execution
CN110647746A (en) * 2019-08-22 2020-01-03 成都网思科平科技有限公司 Malicious software detection method, system and storage medium
CN110807205A (en) * 2019-09-30 2020-02-18 奇安信科技集团股份有限公司 File security protection method and device
CN110826064A (en) * 2019-10-25 2020-02-21 腾讯科技(深圳)有限公司 Malicious file processing method and device, electronic device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101388013A (en) * 2007-09-12 2009-03-18 日电(中国)有限公司 Method and system for clustering network files

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010140373A (en) * 2008-12-15 2010-06-24 Fujitsu Ltd Method and device for detecting document group
CN106778241A (en) * 2016-11-28 2017-05-31 东软集团股份有限公司 The recognition methods of malicious file and device
CN107180192A (en) * 2017-05-09 2017-09-19 北京理工大学 Android malicious application detection method and system based on multi-feature fusion
KR101880686B1 (en) * 2018-02-28 2018-07-20 에스지에이솔루션즈 주식회사 A malware code detecting system based on AI(Artificial Intelligence) deep learning
CN110210221A (en) * 2018-08-02 2019-09-06 腾讯科技(深圳)有限公司 A kind of documentation risk detection method and device
CN109492395A (en) * 2018-10-31 2019-03-19 厦门安胜网络科技有限公司 A kind of method, apparatus and storage medium detecting rogue program
CN109710980A (en) * 2018-11-30 2019-05-03 深圳市嘉立创科技发展有限公司 Note Auditing processing method, device, computer equipment and storage medium
CN110287701A (en) * 2019-06-28 2019-09-27 深信服科技股份有限公司 A kind of malicious file detection method, device, system and associated component
CN110414220A (en) * 2019-06-28 2019-11-05 奇安信科技集团股份有限公司 Operation file extracting method and device during sandbox internal program Dynamic Execution
CN110647746A (en) * 2019-08-22 2020-01-03 成都网思科平科技有限公司 Malicious software detection method, system and storage medium
CN110807205A (en) * 2019-09-30 2020-02-18 奇安信科技集团股份有限公司 File security protection method and device
CN110826064A (en) * 2019-10-25 2020-02-21 腾讯科技(深圳)有限公司 Malicious file processing method and device, electronic device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hung-Min Sun.A Flexible Framework for Malicious Open XML Document Detection based on APT Attacks.IEEE.2019,第1005-1006页. *
周可政 ; 施勇 ; 薛质 ; .基于恶意PDF文档的APT检测.信息安全与通信保密.2016,(第01期),第131-136页. *

Also Published As

Publication number Publication date
CN111444144A (en) 2020-07-24

Similar Documents

Publication Publication Date Title
Yakura et al. Malware analysis of imaged binary samples by convolutional neural network with attention mechanism
CN108763928B (en) Open source software vulnerability analysis method and device and storage medium
US11188650B2 (en) Detection of malware using feature hashing
CN108628751B (en) Useless dependency item detection method and device
RU2420791C1 (en) Method of associating previously unknown file with collection of files depending on degree of similarity
CN107563201B (en) Associated sample searching method and device based on machine learning and server
CN109255235B (en) Mobile application third-party library isolation method based on user state sandbox
CN111368289B (en) Malicious software detection method and device
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
WO2018070404A1 (en) Malware analysis device, malware analysis method, and storage medium having malware analysis program contained therein
CN111222137A (en) Program classification model training method, program classification method and device
CN108182363B (en) Detection method, system and storage medium of embedded office document
CN112632529A (en) Vulnerability identification method, device, storage medium and device
CN111444144B (en) File feature extraction method and device
US9646157B1 (en) Systems and methods for identifying repackaged files
CN111460447B (en) Malicious file detection method and device, electronic equipment and storage medium
CN112632528A (en) Threat information generation method, equipment, storage medium and device
US20060005161A1 (en) Method, system and program product for evaluating java software best practices across multiple vendors
CN115080114B (en) Application program transplanting processing method, device and medium
CN114610577A (en) Target resource locking method, device, equipment and medium
CN112099840A (en) Method and device for extracting features in application package
CN114491528A (en) Malicious software detection method, device and equipment
CN110377499B (en) Method and device for testing application program
CN115310082A (en) Information processing method, information processing device, electronic equipment and storage medium
US20220021703A1 (en) Phishing site detection device, phishing site detection method and phishing site detection program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant after: Qianxin Technology Group Co.,Ltd.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant before: Qianxin Technology Group Co.,Ltd.

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant