WO2021027831A1 - 一种恶意文件检测方法和装置、电子设备及存储介质 - Google Patents

一种恶意文件检测方法和装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021027831A1
WO2021027831A1 PCT/CN2020/108614 CN2020108614W WO2021027831A1 WO 2021027831 A1 WO2021027831 A1 WO 2021027831A1 CN 2020108614 W CN2020108614 W CN 2020108614W WO 2021027831 A1 WO2021027831 A1 WO 2021027831A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
target
behavior
file
encoding
Prior art date
Application number
PCT/CN2020/108614
Other languages
English (en)
French (fr)
Inventor
程强
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2021027831A1 publication Critical patent/WO2021027831A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • the present disclosure relates to the field of network security technology.
  • APT Advanced Persistent Threat
  • APT Advanced Persistent Threat
  • This type of attack not only uses traditional viruses and Trojan horses as attack methods, but also uses social engineering methods such as emails to conduct "pilot attacks” to send users carefully constructed files that use 0Day vulnerabilities. Once the user opens the relevant file, the vulnerability will be triggered, the attack code is injected into the user's system, and subsequent operations such as downloading other viruses and Trojan horses are performed to facilitate long-term incubation operations.
  • traditional firewalls and enterprise anti-virus software have very limited detection and protection capabilities for such malicious files or codes without signatures.
  • APT attack detection and defense technology has become a research hotspot in the new generation of network security.
  • the technical difficulty lies in how to quickly detect attacks that exploit unknown vulnerabilities.
  • a series of researches have been carried out at home and abroad, and a variety of methods have been proposed.
  • the representative one is the dynamic behavior analysis technology based on files or samples.
  • This technology is mainly aimed at the process of malicious code implantation in the process of APT attack. It dynamically analyzes the dynamic behavior of suspicious sample files entering the protected system through sandboxes, virtual machines and other controllable environments, identifies malicious behaviors and attack codes, and prevents malicious codes. Implant to prevent subsequent destruction.
  • This kind of technology can detect and protect the attack before it enters the network, so as to avoid the impact of the attack on the protected system.
  • the judging of the maliciousness of code files relies on the behavior feature library, which stores the malicious behavior features extracted after manual code analysis.
  • the update speed and accuracy of the feature library rules determine the success rate of malicious code detection.
  • the method for detecting malicious files includes: encoding the obtained application program interface (API) behavior and API behavior parameters of the target file to obtain the corresponding Target code set; vectorize the target code set to obtain a target behavior vector; determine whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set; When the target file is a malicious file, the malicious category of the target file is determined according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black and white sample set.
  • API application program interface
  • the malicious file detection device includes: an encoding module for encoding the obtained API behavior and API behavior parameters of the target file to obtain the corresponding target file The target coding set; the vectorization module is used to vectorize the target coding set to obtain the target behavior vector; the determination module is used to determine the distance between the target behavior vector and the sample behavior vector in the black and white sample set, Determine whether the target file is a malicious file, and when the target file is a malicious file, determine the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black and white sample set The malicious category of the target file.
  • the electronic device includes a processor, a communication interface, a memory, and a communication bus.
  • the processor, the communication interface, and the memory communicate with each other through the bus.
  • the memory is used for storing Computer program; processor, used to execute the program stored on the memory to implement the malicious file detection method as described above.
  • Another aspect of the present disclosure provides a computer-readable storage medium in which a computer program is stored, and when the computer program is executed by a processor, the malicious file detection method described above is implemented.
  • Fig. 1 is a flowchart of a malicious file detection method according to an embodiment of the present disclosure.
  • Fig. 2 is a flowchart of a malicious file detection method according to another embodiment of the present disclosure.
  • Fig. 3 is a schematic block diagram of an electronic device according to an embodiment of the present disclosure.
  • Fig. 4 is a schematic block diagram of a malicious file detection device according to an embodiment of the present disclosure.
  • the malicious sample training data is a sequence formed by the malicious sample's calling function of the system application program interface (Application Programming Interface, API).
  • Fig. 1 is a flowchart of a malicious file detection method according to an embodiment of the present disclosure.
  • the malicious file detection method is applied to an electronic device to detect malicious files such as viruses, Trojan horses, worms, and ransomware.
  • malicious files such as viruses, Trojan horses, worms, and ransomware.
  • the process shown in FIG. 1 will be described in detail below.
  • Step S101 encoding the acquired API behavior and API behavior parameters of the target file to obtain a target encoding set corresponding to the target file.
  • the API behavior may be, but is not limited to, starting a certain process, loading system DLL files, writing temporary files, or modifying the registry under Windows, Linux, Unix and other systems.
  • the API behavior parameter refers to a parameter included in a command, such as a directory path.
  • the target file type can be, but is not limited to, PE file, PDF file, text file, etc.
  • the API behavior of the same target file corresponds to the API behavior parameter one to one.
  • the external analysis engine can be, but is not limited to, a sandbox, a virtual machine, etc. After running the target file, the API behavior and API behavior parameters of the target file are obtained, then coded and combined with unified dimensions to obtain the target code set corresponding to the target file.
  • each API behavior and the corresponding API behavior parameters are respectively encoded, and then the resulting encoding uniform dimension combination is performed.
  • the API behavior is encoded in hexadecimal notation, and the encoding length of the API behavior is preset to obtain the first encoding set of the predetermined length.
  • vectorized conversion is performed on the API behavior parameters to obtain the second code set.
  • the codes in the second code set are converted into hexadecimal codes consistent with the coding format in the first code set, and the codes in the first code set and the codes in the converted second code set are combined one by one to obtain The normalized target code set.
  • hexadecimal coding is used for API behavior coding. It is understandable that in some other embodiments, binary, octal, or decimal coding may also be used.
  • encoding if the encoding in the first encoding set is different from the encoding in the second encoding set, the encoding in the second encoding set needs to be converted to the same type of encoding as the encoding in the first encoding set, or The codes in the first code set are converted into codes of the same type as the codes in the second code set. If the encoding in the first encoding set is the same as the encoding in the second encoding set, no conversion is required.
  • Encoding of API behavior parameters can be performed using, but not limited to, hashing, hashing, and other methods.
  • the encoding of API behavior parameters uses hash encoding.
  • the target file corresponds to an API behavior and an API behavior parameter as an example.
  • the first code set corresponding to the API behavior contains a hexadecimal code 0200
  • the second code set corresponding to the API behavior parameter contains A decimal code 67574613 can convert the code in the second code set into a hexadecimal code 4071B55
  • the code in the target code set obtained after the combination is a hexadecimal code 02004071B55.
  • the path length of the API behavior parameter can also be preset.
  • the API behavior parameter is encoded, if the path length of the API behavior parameter exceeds the preset preset
  • the path length of the API behavior parameter is adjusted to the preset length, and then the directory is layered.
  • adjusting the path length of the API behavior parameter can be realized by adding a fixed tail parameter undefine.
  • the preset longest encoding path is c:/system
  • an API behavior parameter is c:/system/host/
  • the obtained code length can be unified, which facilitates the extraction of features, avoids the situation where the feature discrimination is not high due to the broad feature description, and improves the accuracy of subsequent malicious file detection.
  • Step S102 Perform vectorization processing on the target code set to obtain a target behavior vector.
  • a sample code set has been established after pre-coding the API behaviors and API behavior parameters of the black samples and white samples in the black and white sample set, and each code in the sample code set corresponds to a different weight.
  • the black sample includes at least one of viruses, Trojan horses, worms, and ransomware
  • the white sample files are normal files.
  • the target encoding set is ⁇ A1, A2, B1, A3, C1, A4 ⁇
  • the weight corresponding to the code A1 in the sample encoding set is a1
  • the weight corresponding to A2 is a2
  • the weight corresponding to A3 is a3
  • the weight corresponding to A4 is a4
  • code B1 and code C1 do not exist in the sample code set
  • the target behavior vector obtained is (a1, a2, 0, a3, 0, a4).
  • the weight assigned to the code that does not appear in the sample code set may also be other values, for example, the weight assigned to the code that does not appear in the sample code set may also be 1.
  • Step S103 Determine whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set.
  • the black and white sample set includes sample behavior vectors corresponding to black samples and sample behavior vectors corresponding to white samples.
  • Black samples include at least one of viruses, Trojan horses, worms, and ransomware
  • white sample files are normal files.
  • the black samples include viruses, Trojan horses, worms, and ransomware to ensure that various types of malicious files can be detected.
  • the sample behavior vector is the black sample and white sample in the black and white sample set through the API behavior and API behavior parameter encoding and vectorization processing. The process is consistent with the above target file's API behavior and API behavior parameter encoding and vectorization processing. .
  • both the first distance and the second distance are average distances, that is, the first average distance between the target behavior vector and the sample behavior vectors corresponding to all black samples in the black and white sample set is calculated, and the target behavior vector is calculated The second average distance of the sample behavior vector corresponding to all white samples in the black and white sample set.
  • the distance calculation between the target behavior vector and the sample behavior vector can be used, but is not limited to Euclidean distance, cosine similarity calculation, etc. According to an embodiment of the present disclosure, the calculation of the distance between the target behavior vector and the sample behavior vector adopts the cosine recognition degree calculation.
  • the distance between the target behavior vector Jx and the sample behavior vector Jk corresponding to the black sample can be expressed as Calculate the distance between the sample behavior vector corresponding to all black samples and the target behavior vector to get the distance list [d 1 , d 2 ,...d B ], and average the values in the distance list to get the target behavior vector and black
  • the first average distance of the sample behavior vectors corresponding to all black samples in the sample set, the first average distance can be expressed as
  • the second average distance between the target behavior vector and the sample behavior vectors corresponding to all white samples in the black and white sample set can be obtained.
  • the first average distance is compared with the second average distance. If the first average is less than the second average distance, the target file is determined to be a normal file, and the detection ends. If the first average distance is greater than or equal to the second average distance, it is determined that the target file is a malicious file.
  • Step S104 when the target file is a malicious file, determine the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black and white sample set.
  • the malicious category of the target file can be determined according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black and white sample set.
  • the third average distance between the target behavior vector and the sample behavior vector corresponding to the virus is represented by D ⁇
  • the third average distance between the target behavior vector and the sample behavior vector corresponding to the Trojan horse is represented by D ⁇
  • the third average distance of the behavior vector is represented by D ⁇
  • the third average distance between the target behavior vector and the sample behavior vector corresponding to the ransomware is represented by D ⁇ .
  • the target files are classified into new types of malicious files other than viruses, Trojan horses, worms, and ransomware. If D ⁇ D ⁇ D ⁇ D ⁇ s, it is determined that the target file is a malicious file, and its category is a virus type.
  • Fig. 2 is a flowchart of a malicious file detection method according to another embodiment of the present disclosure. The process shown in FIG. 2 will be described in detail below.
  • Step S201 Obtain sample API behaviors and sample API behavior parameters of sample files in the training sample set.
  • the sample files include black sample files and white sample files
  • the black sample files include at least one of viruses, Trojan horses, worms, and ransomware
  • the white sample files are normal files.
  • the type of the sample file can be, but is not limited to, PE file, PDF file or text file, etc.
  • a training sample set for detecting whether the target file is a malicious file and malicious category needs to be established. Specifically, first, the sample files in the training sample set are run through the external analysis engine, and the sample API behavior and sample API behavior parameters of each sample file are obtained.
  • the external analysis engine can be, but is not limited to, a sandbox, a virtual machine, etc.
  • the behavior includes a sample API behavior and its corresponding sample API behavior parameters
  • the behaviors with the same sample API behavior and sample API behavior parameters can be combined to form a non-repetitive set, which can effectively avoid data redundancy and reduce the amount of calculation.
  • step S202 the obtained sample API behavior and sample API behavior parameters are encoded to obtain a sample encoding set corresponding to the training sample set.
  • hexadecimal encoding is performed on the corresponding sample API behavior for each sample file, and the encoding length is preset to obtain a third encoding set with a predetermined length.
  • the corresponding sample API behavior parameters are encoded to obtain the fourth encoding set.
  • the codes in the fourth code set are converted into hexadecimal codes that are consistent with the codes in the third code set, and the codes in the third code set and the codes in the converted fourth code set are combined in a one-to-one correspondence to get normalized The sample code set.
  • hexadecimal coding is used for the sample API behavior coding. It is understandable that in some other embodiments, binary, octal, or decimal coding may also be used.
  • the encoding in the fourth encoding set must also be converted to the same type of encoding as the encoding in the third encoding set. , Or convert the code in the third code set to the same type of code as the code in the fourth code set.
  • the sample API behavior parameters can be encoded by, but not limited to, hashing, hashing, and other methods. In the embodiments of the present disclosure, the sample API behavior parameters are encoded by hash encoding.
  • Step S203 Determine the weight corresponding to each code according to the frequency of appearance of the same sample file and different sample files corresponding to each code in the sample code set.
  • the weight corresponding to each code can be determined, but is not limited to, TF-IDF algorithm, TextRank algorithm, etc.
  • the TF-IDF algorithm is used. Specifically, for the appearance frequency of the same sample file corresponding to the same code in the sample code set (that is, the frequency of the behavior corresponding to the code appearing in the same sample file), the higher the appearance frequency, the higher the weight assigned. The frequency of appearance of different sample files corresponding to the same code in the sample code set (that is, the frequency of the behavior corresponding to the code in different sample files), if the frequency of appearance is higher, the weight assigned to it is lower.
  • Step S204 Perform vectorization processing on the sample code set corresponding to the sample file in the training sample set according to the weight corresponding to each code to obtain the sample behavior vector in the black and white sample set.
  • Step S205 Encoding the acquired API behavior and API behavior parameters of the target file to obtain a target encoding set corresponding to the target file.
  • Step S206 Perform vectorization processing on the target code set to obtain a target behavior vector.
  • Step S207 Determine whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set.
  • Step S208 When the target file is a malicious file, determine the malicious category of the target file according to the distance between the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black and white sample set.
  • the malicious file detection method encodes and normalizes the API behavior and API behavior parameters of the target file, and then performs vector conversion to obtain the target behavior vector of the target file.
  • the distance of the sample behavior vector in the collection determines whether the target file is a malicious file, and when the target file is a vector, the malicious category of the target file is determined according to the distance between the target behavior vector and the sample behavior vectors of different types of black samples in the black and white sample collection . Because the behavior characteristics are preserved and the richness of training input is improved, the detection accuracy can be improved when malicious files are detected, and the false alarm rate of the machine learning model can be reduced.
  • the malicious file detection method implemented according to the present disclosure supports good file type scalability. Compared with the traditional solution that only supports the executable PE file analysis model, the malicious file detection method implemented according to the present disclosure also supports Word, PDF, etc. Type file. Third, compared with the higher complexity of the deep learning network malicious file detection method, the malicious file detection method implemented according to the present disclosure reduces the weight and parameter value adjustments, and at the same time improves the behavior model based on the number of behavior statistics and depends on the sample distribution.
  • the malicious file detection method implemented according to the present disclosure has a good generalization effect against sample imbalance.
  • the malicious file detection method implemented according to the present disclosure can make the obtained code length uniform, facilitate the extraction of features, avoid the situation where the feature discrimination is not high due to the broad feature description, and further improve the accuracy of malicious file detection.
  • the training sample set is established, the behaviors with the same sample API behavior and sample API behavior parameters are combined to form a non-repetitive set, which can effectively avoid data redundancy and reduce the amount of calculation.
  • FIG. 3 is a schematic block diagram of an electronic device 100 according to an embodiment of the present disclosure.
  • the electronic device 100 includes a processor 110, and optionally an internal bus 120, a network interface 130, and a memory 140.
  • the memory 140 may include memory, such as a high-speed random access memory (Random-Access Memory, RAM), or may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • the electronic device 100 may also include hardware required by other services.
  • the processor 110, the network interface 130, and the memory 140 may be connected to each other through an internal bus 120, which may be an ISA (Industry Standard Architecture) bus or PCI (Peripheral Component Interconnect, peripheral component interconnection standard) Bus or EISA (Extended Industry Standard Architecture, Extended Industry Standard Architecture) bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used to indicate in FIG. 3, but it does not mean that there is only one bus or one type of bus.
  • the memory 140 is used to store programs.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 140 may include a memory and a non-volatile memory, and provide instructions and data to the processor 110.
  • the processor 110 reads the corresponding computer program from the non-volatile memory into the memory and then runs it, forming a malicious file detection device 150 on a logical level.
  • the processor 110 executes the program stored in the memory 140, and is specifically configured to perform the following operations: vectorize the obtained API behavior and API behavior parameters of the target file to obtain the target behavior vector corresponding to the target file; The distance between the behavior vector and the sample behavior vector in the black and white sample set determines whether the target file is malicious; when the target file is a malicious file, according to the target behavior vector and the sample behavior vector corresponding to different types of black samples in the black and white sample set The distance determines the malicious category of the target file.
  • the method performed by the malicious file detection apparatus 150 disclosed in the embodiment shown in FIG. 3 of the present disclosure may be applied to the processor 110 or implemented by the processor 110.
  • the processor 110 may be an integrated circuit chip with signal processing capabilities.
  • the steps of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 110 or instructions in the form of software.
  • the aforementioned processor 110 may be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (DSP), a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in combination with the embodiments of the present disclosure may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 140, and the processor 110 reads the information in the memory 140, and completes the steps of the foregoing method in combination with its hardware.
  • the electronic device 100 can also execute the methods in FIGS. 1 and 2 and implement the functions of the malicious file detection apparatus 150 in the embodiments shown in FIGS. 1 and 2, and the embodiments of the present disclosure will not be repeated here.
  • the electronic device 100 of the present disclosure does not exclude other implementations, such as a logic device or a combination of software and hardware, etc. That is to say, the execution body of the following processing flow is not limited to each logic unit , It can also be a hardware or logic device.
  • the present disclosure also proposes a computer-readable storage medium that stores one or more programs, and the one or more programs include instructions, which when executed by a portable electronic device that includes multiple application programs ,
  • the portable electronic device can execute the method of the embodiment shown in Figure 1 and Figure 2, and is specifically used to perform the following operations: vectorized conversion of the acquired API behavior and API behavior parameters of the target file to obtain Corresponding target behavior vector; according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set, determine whether the target file is a malicious file; when the target file is a malicious file, according to the target behavior vector and the different types of black in the black and white sample set The distance of the sample behavior vector corresponding to the sample determines the malicious category of the target file.
  • FIG. 4 is a schematic block diagram of a malicious file detection device 150 according to an embodiment of the present disclosure.
  • the malicious file detection device 150 may include an encoding module 151, a vectorization module 152, and a determination module 153.
  • the encoding module 151 is configured to encode the acquired API behavior and API behavior parameters of the target file to obtain a target encoding set corresponding to the target file.
  • the encoding module 151 can be used to execute the above-mentioned step S101 or step S205.
  • the vectorization module 152 is used to perform vectorization processing on the target code set to obtain the target behavior vector.
  • the vectorization module 152 can be used to execute the above-mentioned step S102 or step S206.
  • the determining module 153 is configured to determine whether the target file is a malicious file according to the distance between the target behavior vector and the sample behavior vector in the black and white sample set. And when the target file is a malicious file, the malicious category of the target file is determined according to the distance between the target behavior vector and the sample behavior vectors corresponding to different types of black samples in the black and white sample set.
  • the determining module 153 may be used to execute the above steps S103 and S104 or steps S207 and S208.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本公开提供了一种恶意文件检测方法和装置、电子设备及存储介质。该恶意文件检测方法包括:对获取到的目标文件的API行为及API行为参数进行编码,得到与目标文件对应的目标编码集;对目标编码集进行向量化处理,得到目标行为向量;根据目标行为向量与黑白样本集合中的样本行为向量的距离,确定目标文件是否为恶意文件;在目标文件为恶意文件时,根据目标行为向量与黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定出目标文件的恶意类别。

Description

一种恶意文件检测方法和装置、电子设备及存储介质 技术领域
本公开涉及网络安全技术领域。
背景技术
极光攻击、震网攻击、夜龙攻击、RSA令牌种子窃取等重大网络安全事件使得一种具有攻击手法高级、持续时间长、攻击目标明确等特征的攻击类型出现在公众视野中,国际上称之为高级持续性威胁(Advanced Persistent Threat,APT)攻击。这类攻击不仅使用传统的病毒、木马作为攻击手段,更以邮件等社会工程学方式进行“先导攻击”,向用户发送精心构造使用0Day漏洞的文件。一旦用户打开相关文件,漏洞就会被触发,攻击代码注入到用户系统,并进行后续下载其它病毒、木马等操作以利长期潜伏作业。而传统防火墙、企业反病毒软件等对此类无特征签名的恶意文件或代码的检测和防护能力非常有限。
APT攻击检测防御技术已成为新一代网络安全的研究热点,其中的技术难点在于:如何快速检测利用未知漏洞的攻击。国内外对此展开了一系列研究,提出了多种方法,其中具有代表性的是基于文件或样本的动态行为分析技术。此种技术主要针对APT攻击过程中的恶意代码植入过程,通过沙箱、虚拟机等可控环境动态分析进入受保护系统的可疑样本文件的动态行为,识别恶意行为和攻击代码,阻止恶意代码植入,防止后续破坏行为的发生。此种技术能够在攻击进入网络前进行检测和防护,从而避免受保护系统受到攻击的影响。其中,代码文件恶意性的判定依赖于行为特征库,这个特征库存储着通过人工代码分析后提取的恶意行为特征,特征库规则的更新速度及准确性决定着恶意代码检测的成功率。
发明内容
本公开的一方面提供了一种恶意文件检测方法,该恶意文件检测方法包括:对获取到的目标文件的应用程序接口(API)行为及 API行为参数进行编码,得到与所述目标文件对应的目标编码集;对所述目标编码集进行向量化处理,得到目标行为向量;根据所述目标行为向量与黑白样本集合中的样本行为向量的距离,确定所述目标文件是否为恶意文件;在所述目标文件是恶意文件时,根据所述目标行为向量与所述黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定出所述目标文件的恶意类别。
本公开的另一方面提供了一种恶意文件检测装置,该恶意文件检测装置包括:编码模块,用于对获取到的目标文件的API行为及API行为参数进行编码,得到与所述目标文件对应的目标编码集;向量化模块,用于对所述目标编码集进行向量化处理,得到目标行为向量;确定模块,用于根据所述目标行为向量与黑白样本集合中的样本行为向量的距离,确定所述目标文件是否为恶意文件,并且在所述目标文件为恶意文件时,根据所述目标行为向量与所述黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定出所述目标文件的恶意类别。
本公开的另一方面提供了一种电子设备,该电子设备包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过总线实现相互间的通信,存储器,用于存放计算机程序;处理器,用于执行存储器上所存放的程序,以实现如上所述的恶意文件检测方法。
本公开的另一方面提供了一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的恶意文件检测方法。
附图说明
此处所说明的附图用来提供对本公开的进一步理解,构成本公开的一部分,本公开的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。在附图中:
图1为根据本公开的实施例的恶意文件检测方法的流程图。
图2为根据本公开的另一实施例的恶意文件检测方法的流程图。
图3为根据本公开的实施例的电子设备的框图示意图。
图4为根据本公开的实施例的恶意文件检装置的框图示意图。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚,下面将结合本公开具体实施例及相应的附图对本公开技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
以下结合附图,详细说明本公开各实施例提供的技术方案。
由于恶意代码变种迅速,为满足实际检测的需求,人们尝试通过机器学习的方法,使用大量的恶意样本训练模型,学习恶意软件的行为模式,从而自动判定软件的恶意性。然而,传统的机器学习方法依赖于样本的分布状况相关,样本不均衡会导致检测准确率较差,同时对数值型数据类型具有较高敏感度,而对具有语义的行为数据特征难以区分识别。恶意样本训练数据是恶意样本对系统应用程序接口(Application Programming Interface,API)的调用函数所构成的序列。
图1为根据本公开的实施例的恶意文件检测方法的流程图。
参阅图1,根据本公开的实施例的恶意文件检测方法应用于电子设备,用于对恶意文件如病毒、木马、蠕虫及勒索软件等进行检测。下面将对图1所示的流程进行详细阐述。
步骤S101,对获取到的目标文件的API行为及API行为参数进行编码,得到与目标文件对应的目标编码集。
根据本公开的实施例,所述API行为可以是,但不限于,在Windows、Linux、Unix等系统下启动某个进程、加载系统DLL文件、写入临时文件或修改注册表等。所述API行为参数是指命令中所包含的参数,如目录路径。目标文件的类型可以是,但不限于,PE文件、PDF文件以及文本文件等。同一目标文件的API行为与API行为参数一一对应。
在对目标文件的API行为及API行为参数进行编码之前,首先 通过外部分析引擎运行需要进行检测的目标文件,外部分析引擎可以是,但不限于,沙箱、虚拟机等。运行目标文件后获取该目标文件的API行为及API行为参数然后进行编码并统一维度组合,得到与目标文件对应的目标编码集。
编码时,分别对每个API行为及对应的API行为参数进行编码,然后对得到的编码统一维度组合。具体的,对API行为进行十六进制编码,API行为的编码长度预先设定,得到预定长度的第一编码集。同时,对API行为参数进行向量化转换,得到第二编码集。然后,将第二编码集中的编码转换成与第一编码集中编码格式一致的十六进制编码,并将第一编码集中的编码与转换后的第二编码集中的编码进行一一组合,得到归一化的目标编码集。
根据本公开的实施例,对API行为编码采用十六进制编码,可以理解的在其他的一些实施例中,也可以采用二进制、八进制或十进制编码等。进行编码时,若第一编码集中的编码与第二编码集中的编码所采用的编码方式不同,则需将第二编码集中的编码转换成与第一编码集中的编码同类型的编码,或将第一编码集中的编码转换成与第二编码集中的编码同类型的编码。若第一编码集中的编码与第二编码集中的编码所采用的编码方式相同,则无需进行转换。
对API行为参数进行编码可以采用,但不限于哈希、散列等方式进行编码,本公开实施例中,对API行为参数进行编码采用的是哈希编码。
为方便说明这里以目标文件对应一个API行为及一个API行为参数进行举例说,假定与API行为对应的第一编码集中包含一个十六进制编码0200,与API行为参数对应的第二编码集中包含一个十进制编码67574613,可将第二编码集中的编码转换成十六进制编码4071B55,则组合后得到的目标编码集中的编码为一个十六进制编码02004071B55。
根据本公开的实施例,为了提高对目标文件检测的准确率,还可以预先设置API行为参数的路径长度,在对API行为参数进行编码时若API行为参数的路径长度超过预先设定的预置长度时,则将 API行为参数的路径长度调整为预置长度,然后再进行目录分层。其中,调整API行为参数的路径长度可通过添加固定尾参undefine实现。
例如,预置的最长编码路径为c:/system,而某一API行为参数为c:/system/host/,则可将该API行为参数调整为c:/system/undefine,然后再进行编码。如此,可使得到的编码长度统一,方便特征的提取,避免由于宽泛的特征描述导致特征区分度不高的情形,提高后续恶意文件检测的准确率。
步骤S102,对目标编码集进行向量化处理,得到目标行为向量。
根据本公开的实施例,已预先根据黑白样本集合中的黑样本、白样本的API行为及API行为参数编码后建立了一个样本编码集,该样本编码集中的各编码对应不同的权重。其中,黑样本包括病毒、木马、蠕虫及勒索软件中的至少其中一种,而白样本文件为正常文件。
在对目标编码集进行向量化处理时,根据样本编码集中的各编码的权重对目标编码集中的不同的编码赋予不同的权重,而样本编码集中未出现的编码所赋予的权重为0,如此即可得到与目标编码集对应的目标行为向量。
例如,目标编码集为{A1,A2,B1,A3,C1,A4},样本编码集中编码A1对应的权重为a1,A2对应的权重为a2,A3对应的权重为a3,A4对应的权重为a4,而样本编码集中不存在编码B1和编码C1,则对该目标编码集向量化处理后,得到的目标行为向量为(a1,a2,0,a3,0,a4)。
可以理解的,在其他的一些实施例中,样本编码集中未出现的编码所赋予的权重也可以为其他数值,例如样本编码集中未出现的编码所赋予的权重也可以为1。
步骤S103,根据目标行为向量与黑白样本集合中的样本行为向量的距离,确定目标文件是否为恶意文件。
黑白样本集合中包括黑样本所对应的样本行为向量以及白样本所对应的样本行为向量。黑样本包括病毒、木马、蠕虫及勒索软件中的至少其中一种,白样本文件为正常文件。根据本公开的实施例,黑 样本包括病毒、木马、蠕虫及勒索软件,以确保能够检测出各种不同类型的恶意文件。样本行为向量为黑白样本集合中的黑样本、白样本通过API行为及API行为参数编码后向量化处理得到的,其过程与上述目标文件的API行为及API行为参数编码、向量化处理的过程一致。
在确定目标文件是否为恶意文件时,首先计算目标行为向量与黑白样本集合中所有黑样本所对应的样本行为向量的第一距离,以及目标行为向量与黑白样本集合中所有白样本所对应的样本行为向量的第二距离。第一距离和第二距离可以是,但不限于,平均距离或多个距离中的中间值。根据本公开的实施例,第一距离及第二距离均取平均距离,即,计算目标行为向量与黑白样本集合中所有黑样本所对应的样本行为向量的第一平均距离,以及计算目标行为向量与黑白样本集合中所有白样本所对应的样本行为向量的第二平均距离。
目标行为向量与样本行为向量的距离计算可以采用,但不限于,欧式距离、余弦相似度计算等。根据本公开的实施例,目标行为向量与样本行为向量的距离计算采用余弦相识度计算。
假定目标行为向量为Jx,某一黑样本所对应的样本行为向量为Jk,则目标行为向量Jx与黑样本所对应的样本行为向量Jk的距离可以表示为
Figure PCTCN2020108614-appb-000001
计算所有黑样本所对应的样本行为向量与目标行为向量的距离,可以得到距离列表[d 1,d 2,...d B],对距离列表中的值取平均值得到目标行为向量与黑白样本集合中所有黑样本所对应的样本行为向量的第一平均距离,第一平均距离可表示为
Figure PCTCN2020108614-appb-000002
同样,可得到目标行为向量与黑白样本集合中所有白样本所对应的样本行为向量的第二平均距离。
然后,将第一平均距离与第二平均距离进行比较。若第一平均小于第二平均距离,则判定目标文件为正常文件,结束检测。若第一平均距离大于或等于第二平均距离,则判定目标文件为恶意文件。
步骤S104,在目标文件为恶意文件时,根据目标行为向量与黑 白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定目标文件的恶意类别。
若目标行为向量所对应的目标文件为恶意文件,则根据目标行为向量与黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离即可确定出目标文件的恶意类别。
根据本公开的实施例,首先计算目标行为向量与黑白样本集合中不同种类黑样本(病毒、木马、蠕虫、勒索软件)所对应的样本行为向量的第三平均距离。其中目标行为向量与病毒所对应的样本行为向量的第三平均距离用Dα表示,目标行为向量与木马所对应的样本行为向量的第三平均距离用Dβ表示,目标行为向量与蠕虫所对应的样本行为向量的第三平均距离用Dθ表示,目标行为向量与勒索软件所对应的样本行为向量的第三平均距离用Dμ表示。将第三平均距离Dα、Dβ、Dθ以及Dμ分别与预设的临界值进行比较,判断Dα、Dβ、Dθ以及Dμ是否超出了该临界值,若Dα、Dβ、Dθ以及Dμ均超出了该临界值,则说明目标行为向量与各种类黑样本对应的样本行为向量相差较大,此时将与该目标行为向量对应的目标文件划分为病毒、木马、蠕虫以及勒索软件之外的新类别恶意文件。若Dα、Dβ、Dθ以及Dμ中存在一个或多个未超出预设临界值的,则选取Dα、Dβ、Dθ以及Dμ中的最小值所对应的黑样本的恶意类别为目标文件的恶意类别。
例如,假定临界值为s,如果s<Dα<Dβ<Dθ<Dμ,则将目标文件划分为病毒、木马、蠕虫以及勒索软件之外的新类别恶意文件。如果Dα<Dβ<Dθ<Dμ<s,则判定目标文件属为恶意文件,其类别属于病毒类型。
图2为根据本公开的另一实施例的恶意文件检测方法的流程图。下面将对图2所示的流程进行详细阐述。
步骤S201,获取训练样本集中样本文件的样本API行为及样本API行为参数。
根据本公开的实施,所述样本文件包括黑样本文件和白样本文件,黑样本文件包括病毒、木马、蠕虫及勒索软件中的至少其中一种, 白样本文件为正常文件。样本文件的类型可以是,但不限于,PE文件、PDF文件或文本文件等。
在对目标文件进行检测之前,需要建立用于检测目标文件是否为恶意文件及恶意类别的训练样本集。具体的,首先通过外部分析引擎运行训练样本集中的样本文件,并获取每个样本文件的样本API行为及样本API行为参数。其中,外部分析引擎可以是,但不限于,沙箱、虚拟机等。
在获取训练样本集中样本文件的样本API行为及样本API行为参数时,当存在与样本API行为及样本API行为参数相同的行为(该行为包含一个样本API行为及其对应的样本API行为参数)时,可合并样本API行为及样本API行为参数均相同的行为组成无重复的集合,如此可有效避免数据冗余,减少运算量。
步骤S202,对获取到的样本API行为及样本API行为参数进行编码,得到与训练样本集对应的样本编码集。
根据本公开的实施例,针对每个样本文件对相应的样本API行为进行十六进制编码,编码长度预先设定,得到预定长度的第三编码集。同时,对相应的样本API行为参数进行编码,得到第四编码集。然后将第四编码集中的编码转换成与第三编码集中的编码一致的十六进制编码,并将第三编码集中的编码与转换后的第四编码集中的编码一一对应组合得到归一化的样本编码集。
根据本公开的实施例,对样本API行为编码采用十六进制编码,可以理解的,在其他的一些实施例中,也可以采用二进制、八进制或十进制编码等。采用其他进制编码时,若第三编码集中的编码与第四编码集中的编码采用的编码方式不同,则同样需将第四编码集中的编码转换成与第三编码集中的编码同类型的编码,或将第三编码集中的编码转换成与第四编码集中的编码同类型的编码。对样本API行为参数进行编码可以采用,但不限于,哈希、散列等方式进行编码,本公开实施例中对样本API行为参数进行编码采用的是哈希编码。
步骤S203,根据样本编码集中各编码所对应同一样本文件及不同样本文件出现的频率,确定各编码所对应的权重。
确定各编码所对应的权重可以采用,但不限于,TF-IDF算法、TextRank算法等。本公开实施例中,所采用的是TF-IDF算法。具体的,针对样本编码集中同一编码所对应同一样本文件的出现频率(即编码所对应的行为在同一样本文件中出现的频率),若出现的频率越高则其赋予的权重越高。而样本编码集中同一编码所对应不同样本文件的出现频率(即编码所对应的行为在不同样本文件中出现的频率),若出现的频率越高则其赋予的权重越低。
步骤S204,根据各编码所对应的权重对训练样本集中样本文件所对应的样本编码集进行向量化处理,得到黑白样本集合中的样本行为向量。
步骤S205,对获取到的目标文件的API行为及API行为参数进行编码,得到与目标文件对应的目标编码集。
步骤S206,对目标编码集进行向量化处理,得到目标行为向量。
步骤S207,根据目标行为向量与黑白样本集合中的样本行为向量的距离,确定目标文件是否为恶意文件。
步骤S208,在目标文件为恶意文件时,根据目标行为向量与黑白样本集合中的不同种类黑样本所对应的样本行为向量的距离确定目标文件的恶意类别。
根据本公开的实施例的恶意文件检测方法通过对目标文件的API行为及API行为参数进行编码和归一化后组合后进行向量转换,得到目标文件的目标行为向量,根据目标行为向量与黑白样本集合中的样本行为向量的距离确定出目标文件是否为恶意文件,并在目标文件为向量时根据目标行为向量与黑白样本集合中不同种类黑样本的样本行为向量的距离确定出目标文件的恶意类别。由于保留了行为特征,提升了训练输入的丰富度,因此在检测恶意文件时能提升检测的准确率,降低机器学习模型误报率。同时,根据目标行为向量与各种类黑样本对应的样本行为向量的距离,还能在具备对目标文件的恶意性进行区分的同时,能通过距离识别目标文件的类型并发现新类型的恶意文件。其次,根据本公开的实施的恶意文件检测方法支持文件类型扩展性好,相比于传统方案只支持可执行PE文件分析模型,根据 本公开的实施的恶意文件检测方法还支持Word、PDF等其他类型文件。再次,相比于深度学习网络恶意文件检测方法复杂度较高,根据本公开的实施的恶意文件检测方法减少了权值和参数值调整,同时改善基于行为个数统计行为模型依赖于样本分布。根据本公开的实施的恶意文件检测方法针对样本不均衡性,具有较好的泛化作用。另外,根据本公开的实施的恶意文件检测方法可使得得到的编码长度统一,方便特征的提取,避免由于宽泛的特征描述导致特征区分度不高的情形,进一步提高恶意文件检测的准确率。最后,在建立训练样本集时,合并样本API行为及样本API行为参数均相同的行为组成无重复的集合,如此可有效避免数据冗余,减少运算量。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
图3是根据本公开的实施例的电子设备100的框图示意图。参考图3,在硬件层面,该电子设备100包括处理器110,可选地还包括内部总线120、网络接口130、存储器140。其中,存储器140可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然,该电子设备100还可能包括其他业务所需要的硬件。
处理器110、网络接口130和存储器140可以通过内部总线120相互连接,该内部总线120可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图3中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。
存储器140用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器140可以包括内存和非易失性存储器,并向处理器110提供指令和数据。
处理器110从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成恶意文件检测装置150。
处理器110,执行存储器140所存放的程序,并具体用于执行以下操作:对获取到的目标文件的API行为及API行为参数进行向量化转换,得到与目标文件对应的目标行为向量;根据目标行为向量与黑白样本集合中的样本行为向量的距离,确定目标文件是否为恶意文件;在目标文件为恶意文件时,根据目标行为向量与黑白样本集合中不同种类黑样本所对应的样本行为向量的距离确定出目标文件的恶意类别。
上述如本公开图3所示实施例揭示的恶意文件检测装置150执行的方法可以应用于处理器110中,或者由处理器110实现。处理器110可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器110中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器110可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本公开实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本公开实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器140,处理器110读取存储器140中的信息,结合其硬件完成上述方法的步骤。
该电子设备100还可执行图1和图2的方法,并实现恶意文件检测装置150在图1、图2所示实施例的功能,本公开实施例在此不再赘述。
当然,除了软件实现方式之外,本公开的电子设备100并不排除其他实现方式,比如逻辑器件或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。
本公开还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的便携式电子设备执行时,能够使该便携式电子设备执行图1、图2所示实施例的方法,并具体用于执行以下操作:对获取到的目标文件的API行为及API行为参数进行向量化转换,得到与目标文件对应的目标行为向量;根据目标行为向量与黑白样本集合中的样本行为向量的距离,确定目标文件是否为恶意文件;在目标文件为恶意文件时,根据目标行为向量与黑白样本集合中不同种类黑样本所对应的样本行为向量的距离确定出目标文件的恶意类别。
图4是根据本公开的实施例的恶意文件检装置150的框图示意图。参考图4,恶意文件检装置150可包括编码模块151、向量化模块152和确定模块153。
编码模块151用于对获取到的目标文件的API行为及API行为参数进行编码,得到与所述目标文件对应的目标编码集。
所述编码模块151可以用于执行上述的步骤S101或步骤S205。
向量化模块152用于对目标编码集进行向量化处理,得到目标行为向量。
所述向量化模块152可以用于执行上述的步骤S102或步骤S206。
确定模块153用于根据所述目标行为向量与黑白样本集合中的样本行为向量的距离,确定所述目标文件是否为恶意文件。以及在目标文件为恶意文件时,根据所述目标行为向量与所述黑白样本集合中不同种类黑样本所对应的样本行为向量的距离确定出所述目标文件的恶意类别。
所述确定模块153可以用于执行上述的步骤S103和S104或步骤S207和S208。
以上所述仅为本公开的实施例而已,并非用于限定本公开的保护范围。凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用递进的方式描述,各个实施例 之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。

Claims (13)

  1. 一种恶意文件检测方法,包括:
    对获取到的目标文件的应用程序接口(API)行为及API行为参数进行编码,得到与所述目标文件对应的目标编码集;
    对所述目标编码集进行向量化处理,得到目标行为向量;
    根据所述目标行为向量与黑白样本集合中的样本行为向量的距离,确定所述目标文件是否为恶意文件;
    在所述目标文件是恶意文件时,根据所述目标行为向量与所述黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定出所述目标文件的恶意类别。
  2. 根据权利要求1所述的方法,其中,所述对获取到的目标文件的API行为及API行为参数进行编码,得到与所述目标文件对应的目标编码集,包括:
    对所述API行为进行编码,得到第一编码集;
    对所述API行为参数进行编码,得到第二编码集;
    将所述第一编码集与所述第二编码集进行统一维度组合,得到归一化的所述目标编码集。
  3. 根据权利要求2所述的方法,其中,所述API行为参数为目录路径,所述对所述API行为参数进行编码,得到第二编码集,包括:
    对所述API行为参数进行目录分层;
    对目录分层后的所述API行为参数进行编码,得到所述第二编码集,
    其中,当所述API行为参数的路径长度超过预置长度时,将所述API行为参数的路径长度调整为所述预置长度后再进行目录分层。
  4. 根据权利要求2所述的方法,其中,
    所述对所述API行为进行编码,得到第一编码集,包括:
    对所述API行为进行十六进制编码,得到预定编码长度的所述第一编码集,
    所述对所述API行为参数进行编码,得到第二编码集,包括:
    对所述API行为参数进行哈希编码,得到所述第二编码集,
    所述将所述第一编码集与所述第二编码集进行统一维度组合,得到归一化的所述目标编码集,包括:
    将所述第二编码集中的编码转换为十六进制编码,并将所述第一编码集中的编码与转换后的所述第二编码中的编码一一对应组合,得到所述目标编码集。
  5. 根据权利要求1所述的方法,其中,所述根据所述目标行为向量与黑白样本集合中的样本行为向量的距离,确定所述目标文件是否为恶意文件,包括:
    计算所述目标行为向量与所述黑白样本集合中的黑样本所对应的样本行为向量的第一平均距离;
    计算所述目标行为向量与所述黑白样本集合中的白样本所对应的样本行为向量的第二平均距离;
    当所述第一平均距离大于或等于所述第二平均距离时,判定所述目标文件为恶意文件。
  6. 根据权利要求1所述的方法,其中,所述根据所述目标行为向量与所述黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定出所述目标文件的恶意类别,包括:
    计算所述目标行为向量与所述黑白样本集合中的不同种类的黑样本所对应的样本行为向量的第三平均距离;
    当存在未超出预设临界值的所述第三平均距离时,选取所述第三平均距离中最小值所对应的黑样本的恶意类别为所述目标文件的恶意类别;
    当不存在未超出所述预设临界值的所述第三平均距离时,将所 述目标文件的恶意类别划分为新的恶意类别。
  7. 根据权利要求1所述的方法,还包括:
    获取外部分析引擎对所述目标文件运行后的所述API行为和所述API行为参数。
  8. 根据权利要求1所述的方法,其中,所述API行为为加载系统DLL文件、写入临时文件或修改注册表。
  9. 根据权利要求1所述的方法,还包括:
    获取训练样本集中的样本文件的样本API行为及样本API行为参数,所述样本文件包括黑样本文件和白样本文件,所述黑样本文件包括病毒、木马、蠕虫及勒索软件中的至少其中一种,所述白样本文件为正常文件;
    对获取到的所述样本API行为及所述样本API行为参数进行编码,得到与所述训练样本集对应的样本编码集;
    根据所述样本编码集中各编码所对应同一样本文件及不同样本文件出现的频率,确定各编码所对应的权重;
    根据各编码所对应的权重对所述样本编码集进行向量化处理,得到所述黑白样本集合中的样本行为向量。
  10. 根据权利要求9所述的方法,其中,所述样本文件的类型为PE文件、PDF文件或文本文件。
  11. 一种恶意文件检测装置,包括:
    编码模块,用于对获取到的目标文件的API行为及API行为参数进行编码,得到与所述目标文件对应的目标编码集;
    向量化模块,用于对所述目标编码集进行向量化处理,得到目标行为向量;
    确定模块,用于根据所述目标行为向量与黑白样本集合中的样 本行为向量的距离,确定所述目标文件是否为恶意文件,并且在所述目标文件为恶意文件时,根据所述目标行为向量与所述黑白样本集合中的不同种类的黑样本所对应的样本行为向量的距离确定出所述目标文件的恶意类别。
  12. 一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过总线实现相互间的通信,
    存储器,用于存放计算机程序;
    处理器,用于执行存储器上所存放的程序,以实现如权利要求1至10中任一项所述的恶意文件检测方法。
  13. 一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至10中任一项所述的恶意文件检测方法。
PCT/CN2020/108614 2019-08-15 2020-08-12 一种恶意文件检测方法和装置、电子设备及存储介质 WO2021027831A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910755713.1A CN112395612A (zh) 2019-08-15 2019-08-15 一种恶意文件检测方法、装置、电子设备及存储介质
CN201910755713.1 2019-08-15

Publications (1)

Publication Number Publication Date
WO2021027831A1 true WO2021027831A1 (zh) 2021-02-18

Family

ID=74570249

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/108614 WO2021027831A1 (zh) 2019-08-15 2020-08-12 一种恶意文件检测方法和装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112395612A (zh)
WO (1) WO2021027831A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861428A (zh) * 2023-09-04 2023-10-10 北京安天网络安全技术有限公司 一种基于关联文件的恶意检测方法、装置、设备及介质
CN116910756A (zh) * 2023-09-13 2023-10-20 北京安天网络安全技术有限公司 一种恶意pe文件的检测方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343219B (zh) * 2021-05-31 2023-03-07 烟台中科网络技术研究所 一种自动高效的高风险移动应用程序检测方法
CN113449301A (zh) * 2021-06-22 2021-09-28 深信服科技股份有限公司 一种样本检测方法、装置、设备及计算机可读存储介质
CN114006766A (zh) * 2021-11-04 2022-02-01 杭州安恒信息安全技术有限公司 网络攻击检测方法、装置、电子设备及可读存储介质
CN114297645B (zh) * 2021-12-03 2022-09-27 深圳市木浪云科技有限公司 在云备份系统中识别勒索家族的方法、装置和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8245295B2 (en) * 2007-07-10 2012-08-14 Samsung Electronics Co., Ltd. Apparatus and method for detection of malicious program using program behavior
CN104866763A (zh) * 2015-05-28 2015-08-26 天津大学 基于权限的Android恶意软件混合检测方法
CN106960153A (zh) * 2016-01-12 2017-07-18 阿里巴巴集团控股有限公司 病毒的类型识别方法及装置
US20180041536A1 (en) * 2016-08-02 2018-02-08 Invincea, Inc. Methods and apparatus for detecting and identifying malware by mapping feature data into a semantic space
CN109145605A (zh) * 2018-08-23 2019-01-04 北京理工大学 一种基于SinglePass算法的Android恶意软件家族聚类方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8245295B2 (en) * 2007-07-10 2012-08-14 Samsung Electronics Co., Ltd. Apparatus and method for detection of malicious program using program behavior
CN104866763A (zh) * 2015-05-28 2015-08-26 天津大学 基于权限的Android恶意软件混合检测方法
CN106960153A (zh) * 2016-01-12 2017-07-18 阿里巴巴集团控股有限公司 病毒的类型识别方法及装置
US20180041536A1 (en) * 2016-08-02 2018-02-08 Invincea, Inc. Methods and apparatus for detecting and identifying malware by mapping feature data into a semantic space
CN109145605A (zh) * 2018-08-23 2019-01-04 北京理工大学 一种基于SinglePass算法的Android恶意软件家族聚类方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861428A (zh) * 2023-09-04 2023-10-10 北京安天网络安全技术有限公司 一种基于关联文件的恶意检测方法、装置、设备及介质
CN116861428B (zh) * 2023-09-04 2023-12-08 北京安天网络安全技术有限公司 一种基于关联文件的恶意检测方法、装置、设备及介质
CN116910756A (zh) * 2023-09-13 2023-10-20 北京安天网络安全技术有限公司 一种恶意pe文件的检测方法
CN116910756B (zh) * 2023-09-13 2024-01-23 北京安天网络安全技术有限公司 一种恶意pe文件的检测方法

Also Published As

Publication number Publication date
CN112395612A (zh) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2021027831A1 (zh) 一种恶意文件检测方法和装置、电子设备及存储介质
RU2680738C1 (ru) Каскадный классификатор для приложений компьютерной безопасности
Peiravian et al. Machine learning for android malware detection using permission and api calls
US20180183815A1 (en) System and method for detecting malware
WO2018086544A1 (zh) 安全防护方法及安全防护装置、计算机存储介质
RU2614557C2 (ru) Система и способ обнаружения вредоносных файлов на мобильных устройствах
US10986103B2 (en) Signal tokens indicative of malware
Kapratwar et al. Static and dynamic analysis of android malware
US9798981B2 (en) Determining malware based on signal tokens
US10007786B1 (en) Systems and methods for detecting malware
Sundarkumar et al. Malware detection via API calls, topic models and machine learning
Zhao et al. Malicious executables classification based on behavioral factor analysis
US11882134B2 (en) Stateful rule generation for behavior based threat detection
Varma et al. Android mobile security by detecting and classification of malware based on permissions using machine learning algorithms
US11379581B2 (en) System and method for detection of malicious files
JP2020115320A (ja) 悪意あるファイルを検出するためのシステムおよび方法
Belaoued et al. A chi-square-based decision for real-time malware detection using PE-file features
Sun et al. An opcode sequences analysis method for unknown malware detection
Naz et al. Review of machine learning methods for windows malware detection
Du et al. A static Android malicious code detection method based on multi‐source fusion
Yerima et al. Android malware detection: An eigenspace analysis approach
Li et al. Deep learning algorithms for cyber security applications: A survey
Ndagi et al. Machine learning classification algorithms for adware in android devices: a comparative evaluation and analysis
EP3798885B1 (en) System and method for detection of malicious files
Andronio Heldroid: Fast and efficient linguistic-based ransomware detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20851944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20851944

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20851944

Country of ref document: EP

Kind code of ref document: A1