WO2016201819A1 - Method and apparatus for detecting malicious file - Google Patents

Method and apparatus for detecting malicious file Download PDF

Info

Publication number
WO2016201819A1
WO2016201819A1 PCT/CN2015/090707 CN2015090707W WO2016201819A1 WO 2016201819 A1 WO2016201819 A1 WO 2016201819A1 CN 2015090707 W CN2015090707 W CN 2015090707W WO 2016201819 A1 WO2016201819 A1 WO 2016201819A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
string
malicious
detected
url
Prior art date
Application number
PCT/CN2015/090707
Other languages
French (fr)
Chinese (zh)
Inventor
熊蜀光
冯侦探
曹德强
周晓波
耿志峰
白军辉
Original Assignee
安一恒通(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安一恒通(北京)科技有限公司 filed Critical 安一恒通(北京)科技有限公司
Publication of WO2016201819A1 publication Critical patent/WO2016201819A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • the method and device for detecting a malicious file obtained by the present application obtain a uniform resource locator URL of a file to be detected, and match a character string included in a URL of the file to be detected with a character string in a preset model, based on matching The longest string determines whether the file to be detected is a malicious file, and does not need to obtain other information of the file to be detected, thereby improving the efficiency of identifying malicious files.
  • the user can issue a request to download the corresponding file by clicking a hyperlink or a download address on the page displayed by the browser, or by clicking a hyperlink or inputting a download address in the download application to download the corresponding file. Request.
  • the electronic device can directly obtain the download address, and the download address can be regarded as the URL of the file to be detected. If the hyperlink to the downloaded file is being used by the user Clicking, the electronic device can obtain the URL associated with the hyperlink through the browser or the downloading application, that is, the URL of the file to be detected.
  • the present application provides an embodiment of an apparatus for detecting a malicious file, the apparatus embodiment corresponding to the method embodiment shown in FIG. Specifically, it can be applied to an electronic device.
  • the obtaining module 501 of the device 500 for detecting a malicious file may have a root
  • the URL for downloading the file to be detected is obtained according to the request of the user to download the file from the network.
  • the file to be detected may be a file downloaded from the network requested by the user.
  • each side corresponds to a character string
  • the path from the root node corresponds to a string.
  • the string in the path is concatenated by the string corresponding to the edge in the path.
  • Each node stores the number or ratio of non-malicious files and malicious files that satisfy the path matching condition.
  • the path matching condition includes that the string corresponding to the path from the root node to the node is a prefix of a URL of the file.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Disclosed are a method and apparatus for detecting a malicious file. One particular embodiment of the method comprises: acquiring a uniform resource locator (URL) for downloading a file to be detected; matching character strings contained in the URL of the file to be detected with character strings in a pre-set model; and determining whether the file to be detected is a malicious file based on a longest character string matched in the pre-set model by the URL of the file to be detected. According to the embodiment, the efficiency of detecting the malicious file can be improved.

Description

检测恶意文件的方法和装置Method and apparatus for detecting malicious files
相关申请的交叉引用Cross-reference to related applications
本申请要求申请日为2015年6月19日,申请号为201510346583.8,发明名称为“检测恶意文件的方法和装置”的中国专利申请的优先权,其全部内容作为整体并入本申请中。The present application claims the priority of the Chinese Patent Application No. 201510346583.8, the entire disclosure of which is incorporated herein by reference.
技术领域Technical field
本申请涉及计算机技术领域,具体涉及网络信息安全技术领域,尤其涉及一种检测恶意文件的方法和装置。The present application relates to the field of computer technologies, and in particular, to the field of network information security technologies, and in particular, to a method and apparatus for detecting malicious files.
背景技术Background technique
在互联网下载文件时,一些下载链接往往通过伪装指向恶意文件。这些恶意文件(例如包含可以在计算机系统上执行恶意任务的病毒、蠕虫或特洛伊木马的程序的文档)被下载到用户的计算机,可能使得网络用户的信息安全受到威胁。When downloading files on the Internet, some download links often point to malicious files by masquerading. These malicious files, such as those containing programs that can perform malicious tasks on a computer system, viruses, worms, or Trojan horses, are downloaded to the user's computer and may compromise the information security of the network user.
目前,大多杀毒类应用使用的静态检测方法中,通常先提取所要下载的文件的属性信息或者所包含的内容等特征,进而根据预先训练的模型对这些特征进行匹配从而确定文件是否为恶意文件。这些方法需要先获取文件的相关特征,且对于不包含明显的恶意文件特征的文件,不能判定是否为恶意文件,鉴定效率较低。At present, in the static detection method used by most antivirus applications, the attributes of the file to be downloaded or the content of the included content are usually extracted first, and then these features are matched according to the pre-trained model to determine whether the file is a malicious file. These methods need to obtain the relevant features of the file first, and for files that do not contain obvious malicious file characteristics, it is impossible to determine whether it is a malicious file, and the identification efficiency is low.
发明内容Summary of the invention
本申请的目的在于提出一种改进的检测恶意文件的方法和装置,来解决以上背景技术部分提到的技术问题。The purpose of the present application is to propose an improved method and apparatus for detecting malicious files to solve the technical problems mentioned in the background art above.
一方面,本申请提供了一种检测恶意文件的方法,所述方法包括:获取下载待检测文件的统一资源定位符URL;将所述待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;基于所述待检 测文件的URL在所述预设模型中匹配到的最长字符串,确定所述待检测文件是否为恶意文件。In one aspect, the present application provides a method for detecting a malicious file, the method comprising: obtaining a uniform resource locator URL for downloading a file to be detected; and using a string included in a URL of the file to be detected and a preset model The string is matched; based on the pending And determining, by the URL of the file, the longest string matched in the preset model, and determining whether the file to be detected is a malicious file.
在一些实施例中,所述预设模型包括通过已知的恶意文件和非恶意文件的URL样本训练生成的字典树。In some embodiments, the preset model includes training the generated dictionary tree with known malicious files and URL samples of non-malicious files.
在一些实施例中,在所述字典树中:每条边对应一个字符串;In some embodiments, in the dictionary tree: each side corresponds to a character string;
每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的字符串按顺序拼接而成;每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path; each node stores the number of non-malicious files and malicious files that satisfy the path matching condition. Or a ratio, wherein the path matching condition comprises a string corresponding to the path from the root node to the node being a prefix of a URL of the file.
在一些实施例中,所述基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件包括:获取所述预设模型中与所述URL相匹配的最长字符串所达到的节点;读取所述最长字符串所达到的节点记录的所述数量或比值;基于所述数量或比值确定待检测文件是否为恶意文件。In some embodiments, the determining, according to the longest character string matched by the URL of the file to be detected in the preset model, determining whether the file to be detected is a malicious file comprises: acquiring the preset model and the location a node reached by the longest string that matches the URL; reading the number or ratio of node records reached by the longest string; determining whether the file to be detected is a malicious file based on the quantity or ratio.
在一些实施例中,所述基于所述数量或比值确定待检测文件是否为恶意文件包括:根据所述路径匹配条件获取经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值,或者根据所述数量计算经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值;判断所述比值是否大于预设阈值;当大于预设阈值时,确定待检测文件为恶意文件;当不大于预设阈值时,确定待检测文件为非恶意文件。In some embodiments, the determining, according to the quantity or the ratio, whether the file to be detected is a malicious file comprises: acquiring a malicious file in all paths of the node reached by the longest string according to the path matching condition The ratio of the non-malicious file, or the ratio of the malicious file to the non-malicious file in the entire path of the node reached by the longest string according to the quantity; determining whether the ratio is greater than a preset threshold; When the threshold is set, the file to be detected is determined to be a malicious file; when the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file.
在一些实施例中,所述字典树包括通过以下方法将所述样本集训练生成的字典树:将所述样本集中所包含的URL进行字符串匹配,并根据匹配结果获取所述样本集包含的URL的所有公共前缀字符串;使所述字典树的每条边对应一个公共前缀字符串,每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的公共前缀字符串按顺序拼接而成,每条从根节点到达终端节点的路径对应一个URL;在所述字典树的每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。 In some embodiments, the dictionary tree includes a dictionary tree that is trained to generate the sample set by performing string matching on a URL included in the sample set, and acquiring the sample set according to the matching result. All common prefix strings of the URL; each edge of the dictionary tree corresponds to a common prefix string, and each path from the root node corresponds to a string, and the string in the path is shared by the edge in the path. The prefix string is spliced in order, and each path from the root node to the terminal node corresponds to a URL; and each node of the dictionary tree stores the number or ratio of non-malicious files and malicious files that satisfy the path matching condition, wherein The path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.
在一些实施例中,所述方法还包括:根据确定所述待检测文件是否为恶意文件的结果更新所述预设模型。In some embodiments, the method further comprises: updating the preset model according to a result of determining whether the file to be detected is a malicious file.
另一方面,本申请提供了一种检测恶意文件的装置,所述装置包括:获取模块,配置用于获取下载待检测文件的统一资源定位符URL;匹配模块,配置用于将所述待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;确定模块,配置用于基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件。On the other hand, the present application provides an apparatus for detecting a malicious file, the apparatus comprising: an obtaining module configured to acquire a uniform resource locator URL for downloading a file to be detected; and a matching module configured to use the to-be-detected The character string included in the URL of the file is matched with the character string in the preset model; the determining module is configured to determine the longest character string matched in the preset model based on the URL of the file to be detected, and determine Check if the file is a malicious file.
在一些实施例中,所述预设模型包括通过已知的恶意文件和非恶意文件的URL样本训练生成的字典树。In some embodiments, the preset model includes training the generated dictionary tree with known malicious files and URL samples of non-malicious files.
在一些实施例中,在所述字典树中:每条边对应一个字符串;每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的字符串按顺序拼接而成;每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。In some embodiments, in the dictionary tree: each edge corresponds to a character string; each path from the root node corresponds to a string, and the string in the path is in the order of the string corresponding to the edge in the path. Splicing; each node stores the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes that the string corresponding to the path from the root node to the node is the URL of the file. Prefix.
在一些实施例中,所述确定模块包括:获取单元,配置用于根据所述路径匹配条件获取所述预设模型中与所述URL相匹配的最长字符串所达到的节点;读取单元,配置用于读取所述最长字符串所达到的节点记录的所述数量或比值;确定单元,配置用于基于所述数量或比值判断待检测文件是否为恶意文件。In some embodiments, the determining module includes: an acquiring unit, configured to acquire, according to the path matching condition, a node reached by the longest character string matching the URL in the preset model; Configuring a quantity or ratio for reading the node record reached by the longest character string; and determining, configured to determine, according to the quantity or the ratio, whether the file to be detected is a malicious file.
在一些实施例中,所述确定单元包括:比值获取子单元,配置用于获取经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值,或者根据所述数量计算经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值;确定子单元,判断所述比值是否大于预设阈值;以及,当大于预设阈值时,确定待检测文件为恶意文件;当不大于预设阈值时,确定待检测文件为非恶意文件。In some embodiments, the determining unit includes: a ratio obtaining subunit configured to acquire a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string, or according to the Calculating a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string; determining a subunit, determining whether the ratio is greater than a preset threshold; and, when greater than a preset threshold, The file to be detected is determined to be a malicious file; when the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file.
在一些实施例中,所述装置还包括字典树生成模块,所述字典树生成模块包括:字符串匹配单元,配置用于将所述样本集中所包含的URL进行字符串匹配,并根据匹配结果获取所述样本集包含的URL 的所有公共前缀字符串;字典树生成单元,配置用于使所述字典树的每条边对应一个公共前缀字符串,每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的公共前缀字符串按顺序拼接而成,每条从根节点到达终端节点的路径对应一个URL;以及,在所述字典树的每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。In some embodiments, the apparatus further includes a dictionary tree generation module, the dictionary tree generation module includes: a string matching unit configured to perform string matching on the URLs included in the sample set, and according to the matching result Get the URL contained in the sample set All common prefix strings; a dictionary tree generating unit configured to make each edge of the dictionary tree correspond to a common prefix string, each path from the root node corresponding to a string, and the string in the path is The common prefix string corresponding to the edge in the path is spliced in order, and each path from the root node to the terminal node corresponds to a URL; and each non-mali file that satisfies the path matching condition is stored in each node of the dictionary tree. And the number or ratio of malicious files, wherein the path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.
在一些实施例中,所述装置还包括更新模块,所述更新模块配置用于根据确定所述待检测文件是否为恶意文件的结果更新所述预设模型。In some embodiments, the apparatus further includes an update module configured to update the preset model based on a result of determining whether the file to be detected is a malicious file.
本申请提供的检测恶意文件的方法和装置,通过获取待检测文件的统一资源定位符URL,并将待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配,基于匹配到的最长字符串,确定待检测文件是否为恶意文件,不需要获取待检测文件的其他信息,提高了对恶意文件的鉴定效率。The method and device for detecting a malicious file provided by the present application obtain a uniform resource locator URL of a file to be detected, and match a character string included in a URL of the file to be detected with a character string in a preset model, based on matching The longest string determines whether the file to be detected is a malicious file, and does not need to obtain other information of the file to be detected, thereby improving the efficiency of identifying malicious files.
附图说明DRAWINGS
通过阅读参照以下附图所作的对非限制性实施例的详细描述,本申请的其它特征、目的和优点将会变得更明显:Other features, objects, and advantages of the present application will become more apparent from the detailed description of the accompanying drawings.
图1是根据本申请的检测恶意文件的方法的一个实施例的流程图;1 is a flow diagram of one embodiment of a method of detecting a malicious file in accordance with the present application;
图2是根据本申请的预设模型的一个字典树的示意图;2 is a schematic diagram of a dictionary tree of a preset model according to the present application;
图3a是根据本申请的预设模型的另一个字典树的示意图;Figure 3a is a schematic illustration of another dictionary tree of a preset model in accordance with the present application;
图3b是根据图3a所示的字典树的一个示例的更新后的示意图;Figure 3b is an updated schematic diagram of an example of a dictionary tree shown in Figure 3a;
图4是根据本申请的一种检测恶意文件的方法的一个应用场景的示意图;4 is a schematic diagram of an application scenario of a method for detecting a malicious file according to the present application;
图5是根据本申请的检测恶意文件的装置的一个实施例的结构示意图。FIG. 5 is a block diagram showing an embodiment of an apparatus for detecting a malicious file according to the present application.
具体实施方式 detailed description
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention, rather than the invention. It is also to be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings.
请参考图1,其示出了检测恶意文件的方法的一个实施例的流程100。本实施例主要以该方法应用于支持下载类应用和/或浏览器应用安装于其上的各种电子设备,包括但不限于智能手机、智能手表、平板电脑、个人数字助理、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。该检测恶意文件的方法,包括以下步骤:Please refer to FIG. 1, which illustrates a flow 100 of one embodiment of a method of detecting a malicious file. This embodiment is mainly applied to various electronic devices supported by the download application and/or the browser application, including but not limited to smart phones, smart watches, tablets, personal digital assistants, and e-book readers. , MP3 player (Moving Picture Experts Group Audio Layer III), MP4 (Moving Picture Experts Group Audio Layer IV) player, laptop portable computer And desktop computers and more. The method for detecting malicious files includes the following steps:
步骤101,获取下载待检测文件的URL。Step 101: Obtain a URL for downloading a file to be detected.
在本实施例中,电子设备首先可以根据用户从网络下载文件的请求获取下载待检测文件的URL(Uniform Resoure Locator,统一资源定位符),在这里,待检测文件可以为用户所请求的从网络下载的文件。In this embodiment, the electronic device may first obtain a URL (Uniform Resoure Locator) for downloading the file to be detected according to the request of the user to download the file from the network, where the file to be detected may be the slave network requested by the user. Downloaded file.
其中,统一资源定位符URL是对可以从互联网上得到的资源的位置和访问方法的一种简洁的表示,是互联网上标准资源的地址。互联网上的每个文件都有一个唯一的URL,它包含的信息指出文件的位置以及浏览器或下载类应用应该怎么处理它。基本URL包含模式(或称协议)、服务器名称(或IP地址)、路径和文件名。URL可以通过包括字母、数字、符号的字符串表示,例如:http://www.sohu.com/。The Uniform Resource Locator URL is a compact representation of the location and access method of resources that can be obtained from the Internet, and is the address of a standard resource on the Internet. Every file on the Internet has a unique URL that contains information indicating the location of the file and how the browser or download application should handle it. The base URL contains the mode (or protocol), server name (or IP address), path, and file name. The URL can be represented by a string including letters, numbers, and symbols, for example: http://www.sohu.com/.
用户在从服务器下载文件时,可以通过在浏览器所显示的页面上点击超链接或者下载地址发出下载相应文件的请求,也可以在下载类应用中点击超链接或输入下载地址发出下载相应的文件的请求。此时,如果文件的下载地址已知,则电子设备可以直接获取该下载地址,该下载地址可以视为待检测文件的URL。如果下载文件的超链接被用户 点击,则电子设备可以通过浏览器或者下载类应用获取该超链接所关联的URL,即为待检测文件的URL。When downloading a file from the server, the user can issue a request to download the corresponding file by clicking a hyperlink or a download address on the page displayed by the browser, or by clicking a hyperlink or inputting a download address in the download application to download the corresponding file. Request. At this time, if the download address of the file is known, the electronic device can directly obtain the download address, and the download address can be regarded as the URL of the file to be detected. If the hyperlink to the downloaded file is being used by the user Clicking, the electronic device can obtain the URL associated with the hyperlink through the browser or the downloading application, that is, the URL of the file to be detected.
步骤102,将待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配。Step 102: Match the character string included in the URL of the file to be detected with the character string in the preset model.
在本实施例中,电子设备可以接着将待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配。其中,预设模型中可以包括多个已知的恶意文件的URL对应的字符串及已知的非恶意文件的URL对应的字符串。在一些实现中,电子设备可以通过人工获得多个恶意文件的URL及非恶意文件的URL。在另一些实现中,电子设备可以首先从多个下载站点抓取文件,并保存其URL,然后通过预定的杀毒引擎(例如大蜘蛛Dr.web、卡巴斯基Kaspersky等等)对这些文件进行鉴定,以确定他们是恶意文件还是非恶意文件,从而获得多个已知的恶意文件的URL及已知的非恶意文件的URL。实践中,电子设备还可以通过其他任意可行的方式获取多个已知的恶意文件的URL及非恶意文件的URL,本申请对此不做限定。In this embodiment, the electronic device may then match the character string included in the URL of the file to be detected with the character string in the preset model. The preset model may include a string corresponding to a URL of a plurality of known malicious files and a string corresponding to a URL of a known non-malicious file. In some implementations, the electronic device can manually obtain URLs of a plurality of malicious files and URLs of non-malicious files. In other implementations, the electronic device may first fetch files from multiple download sites and save their URLs, and then authenticate the files through a predetermined anti-virus engine (eg, Big Spider Dr.web, Kaspersky Kaspersky, etc.). To determine whether they are malicious or non-malicious, to obtain URLs for multiple known malicious files and URLs for known non-malicious files. In practice, the electronic device may obtain the URLs of the plurality of known malicious files and the URLs of the non-malicious files in any other feasible manner, which is not limited in this application.
电子设备可以将预设模型中的URL单独保存(一个URL对应一个存储地址),也可以预先通过字符串匹配将URL中的字符串以树的结构形式(例如可以是字典树)保存。相应地,电子设备可以将预设模型中的URL逐条与待检测文件的URL进行字符串匹配,也可以以一个字符或多个字符为单元按照树的结构形式中包含的字符串进行字符串匹配。其中,进行字符串匹配时,按次序从字符串开始处进行匹配,相同位置的字符都相同的两个字符串为相匹配的字符串。对于待检测文件的URL,如果当前位置的字符与预设模型中的URL对应位置的字符串不相匹配,则认为待检测文件的URL包含的字符串与预设模型中的字符串不相匹配。The electronic device may separately save the URL in the preset model (a URL corresponds to a storage address), or may pre-store the string in the URL as a tree structure (for example, may be a dictionary tree) by string matching. Correspondingly, the electronic device may perform string matching on the URL of the preset model one by one with the URL of the file to be detected, or may perform string matching according to the character string included in the structural form of the tree in units of one character or multiple characters. . Wherein, when string matching is performed, matching is performed from the beginning of the string in order, and the two characters having the same character at the same position are matched strings. If the character of the file to be detected does not match the character string corresponding to the URL in the preset model, the character string included in the URL of the file to be detected does not match the string in the preset model. .
作为示例,电子设备可以将URL中的字符串以图2所示的字典树的形式保存。字典树又称单词查找树,可以将大量的字符串(但不仅限于字符串)排序和保存,它的优点是:利用字符串的公共前缀来减少查询时间,最大限度地减少无谓的字符串比较,提高查询效率。其中,如果一个字符串是由另一个字符串的前面部分的连续字符组成的, 那么该字符串是另一个字符串的前缀,比如“ac”是字符串“acm”的前缀,“abcd”是字符串“abcddfasf”的前缀,特别地,“kdfa”是字符串“kdfa”的前缀。在图2给出的示例中,假如已知的4个URL分别为:www.abc.com/hello.exe、www.ok.com/ok.exe、down.com/notepad.exe、www.ok.com/malware.exe。电子设备可以根据字符串匹配获取上述4个URL之间的公共前缀,并在字典树的一个节点中存储共用的字符。如:www.abc.com/hello.exe、www.ok.com/ok.exe、www.ok.com/malware.exe,具有共用的字符“w”、“w”、“w”、“.”,则将这3个URL在字典树根节点的一个子树的节点上分别存储字符“w”、“w”、“w”、“.”。URL“down.com/notepad.exe”与上述3个URL没有共用的字符,则在字典树根节点的一个子树的节点上分别存储URL“down.com/notepad.exe”的字符。以此类推,3个URL www.abc.com/hello.exe、www.ok.com/ok.exe、www.ok.com/malware.exe继续匹配,当有不同的字符时,建立节点的多个子节点。As an example, the electronic device may save the string in the URL in the form of a dictionary tree as shown in FIG. 2. The dictionary tree, also known as the word search tree, can sort and save a large number of strings (but not limited to strings). Its advantages are: use the common prefix of the string to reduce the query time, and minimize the unnecessary string comparison. Improve query efficiency. Where, if a string is composed of consecutive characters in the front part of another string, Then the string is a prefix of another string, such as "ac" is the prefix of the string "acm", "abcd" is the prefix of the string "abcddfasf", in particular, "kdfa" is the string "kdfa" Prefix. In the example given in Figure 2, if the known four URLs are: www.abc.com/hello.exe, www.ok.com/ok.exe, down.com/notepad.exe, www.ok .com/malware.exe. The electronic device can obtain the common prefix between the above four URLs according to the string matching, and store the shared characters in one node of the dictionary tree. Such as: www.abc.com/hello.exe, www.ok.com/ok.exe, www.ok.com/malware.exe, with the shared characters "w", "w", "w", ". Then, the three URLs store the characters "w", "w", "w", "." on the nodes of a subtree of the root node of the dictionary tree, respectively. The URL "down.com/notepad.exe" is a character that is not shared with the above three URLs, and the character of the URL "down.com/notepad.exe" is stored on the node of a subtree of the root node of the dictionary tree, respectively. By analogy, the three URLs www.abc.com/hello.exe, www.ok.com/ok.exe, www.ok.com/malware.exe continue to match, when there are different characters, more nodes are created. Child nodes.
步骤103,基于待检测文件的URL在预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件。Step 103: Determine whether the file to be detected is a malicious file, based on the longest string matched in the preset model by the URL of the file to be detected.
在本实施例中,电子设备可以接着基于待检测文件的URL在预设模型中匹配到的最长字符串,确定出待检测文件是否为恶意文件。In this embodiment, the electronic device may then determine whether the file to be detected is a malicious file based on the longest character string matched in the preset model based on the URL of the file to be detected.
其中,待检测文件的URL在预设模型中匹配到的最长字符串,可以是和待检测文件的URL相匹配的字符最多的字符串,例如,预设模型中包括4个URL:www.abc.com/hello.exe、www.ok.com/ok.exe、down.com/notepad.exe、www.ok.com/malware.exe,当待检测文件的URL为www.ok.com/ok malware.exe时,待测文件的URL包含的字符串与预设模型中的字符串相匹配,可以将匹配到字符串“www.ok.com/ok”作为在预设模型中匹配到的最长字符串。在一些实现中,预设模型中的URL单独保存,电子设备可以将待检测文件的URL与预设模型中的URL逐个匹配,并根据与待检测文件的URL具有最长的相匹配字符串的URL所对应的文件类型作为待检测文件的类型。例如在前述的例子中,待检测文件的URL在预设模型中匹配到的最长字符串为“www.ok.com/ok”,对应的URL为 “www.ok.com/ok.exe”,则如果URL“www.ok.com/ok.exe”对应的文件为恶意文件,则电子设备可以确定待检测文件为恶意文件,如果“www.ok.com/ok.exe”对应的文件为非恶意文件,则电子设备可以确定待检测文件为非恶意文件。在另一些实现中,预设模型中的URL以图2所示的字典树形式储存,电子设备可以将待检测文件的URL包含的字符串与字典树中节点处的字符逐个匹配,并按照匹配到的最后一个字符所存储的节点的子树中包括的URL对应的恶意文件和非恶意文件的数量或比值确定待检测文件是否为恶意文件。如前述的例子中,待检测文件的URL“www.ok.com/ok malware.exe”,在图2所示的字典树中匹配到的最后一个字符为“www.ok.com/ok”中的最后一个字符“k”,而该字符对应的子树中只包括1个URL“www.ok.com/ok.exe”,如果URL“www.ok.com/ok.exe”对应的文件是非恶意文件,则电子设备可以根据该字符所存储在的节点对应的子树中所包括的恶意文件与非恶意文件的数量来确定待检测文件是否为恶意文件,例如可以根据恶意文件与非恶意文件的数量(如根据恶意文件在总文件数量中的比重0/(1+0)=0)确定待检测文件为非恶意文件;电子设备还可以根据该字符所存储在的节点对应的子树中所包括的恶意文件与非恶意文件的比值来确定待检测文件是否为恶意文件,例如恶意文件与非恶意文件的比值为0:1=0确定待检测文件为非恶意文件。实践中,电子设备可以预设恶意文件与非恶意文件的比值的阈值(例如可以是100:1),当恶意文件与非恶意文件的比值大于该阈值时,确定待检测文件为恶意文件,否则,确定待检测文件为非恶意文件。该阈值可以由人工根据经验设定,也可以根据对预设模型的验证样本集的判断准确率(例如是99%)训练确定。可选地,电子设备也可以预设非恶意文件与恶意文件的比值,并在该比值是否小于预设的非恶意文件与恶意文件的比值阈值时,确定待检测文件为恶意文件等,本申请对此不做限定。The longest string matched by the URL of the file to be detected in the preset model may be the string with the most characters matching the URL of the file to be detected. For example, the preset model includes four URLs: www. Abc.com/hello.exe, www.ok.com/ok.exe, down.com/notepad.exe, www.ok.com/malware.exe, when the URL of the file to be tested is www.ok.com/ok In malware.exe, the URL of the file to be tested contains a string matching the string in the preset model, and the matching string "www.ok.com/ok" can be matched as the most matched in the preset model. Long string. In some implementations, the URLs in the preset model are separately saved, and the electronic device may match the URL of the file to be detected with the URL in the preset model one by one, and have the longest matching string according to the URL of the file to be detected. The file type corresponding to the URL is the type of the file to be detected. For example, in the foregoing example, the longest character string matched to the URL of the file to be detected in the preset model is “www.ok.com/ok”, and the corresponding URL is "www.ok.com/ok.exe", if the file corresponding to the URL "www.ok.com/ok.exe" is a malicious file, the electronic device can determine that the file to be detected is a malicious file, if "www.ok The file corresponding to .com/ok.exe is a non-malicious file, and the electronic device can determine that the file to be detected is a non-malicious file. In other implementations, the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 2, and the electronic device may match the character string included in the URL of the file to be detected with the characters at the node in the dictionary tree one by one, and match the matching. The number or ratio of malicious files and non-malicious files corresponding to the URL included in the subtree of the node stored by the last character to the last character determines whether the file to be detected is a malicious file. As in the foregoing example, the URL of the file to be detected "www.ok.com/ok malware.exe", the last character matched in the dictionary tree shown in Figure 2 is "www.ok.com/ok" The last character "k", and the subtree corresponding to the character includes only one URL "www.ok.com/ok.exe", if the file corresponding to the URL "www.ok.com/ok.exe" is non- The malicious file can determine whether the file to be detected is a malicious file according to the number of malicious files and non-malicious files included in the subtree corresponding to the node in which the character is stored, for example, according to malicious files and non-malicious files. The number of the files (such as the proportion of malicious files in the total number of files 0 / (1 + 0) = 0) determines that the file to be detected is a non-malicious file; the electronic device can also be based on the subtree corresponding to the node in which the character is stored The ratio of the malicious file to the non-malicious file is included to determine whether the file to be detected is a malicious file. For example, the ratio of the malicious file to the non-malicious file is 0:1=0, and the file to be detected is a non-malicious file. In practice, the electronic device may preset a threshold of a ratio of a malicious file to a non-malicious file (for example, may be 100:1). When the ratio of the malicious file to the non-malicious file is greater than the threshold, the file to be detected is determined to be a malicious file, otherwise , to determine that the file to be detected is a non-malicious file. The threshold may be manually determined according to experience, or may be determined based on the judgment accuracy (for example, 99%) of the verification sample set of the preset model. Optionally, the electronic device may also preset a ratio of the non-malicious file to the malicious file, and determine whether the to-be-detected file is a malicious file, etc., when the ratio is less than a preset threshold ratio of the non-malicious file and the malicious file. There is no limit to this.
在本实施例的一个可选实现方式中,当预设模型中的URL以字典树形式储存时,为节约存储资源和提高匹配效率,在字典树中,每条边可以对应一个字符串;每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的字符串按顺序拼接而成;每个节 点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值。其中,上述路径匹配条件可以包括:从根节点到该节点处的路径对应的字符串是文件的URL的前缀。可选地,字典树的边对应的字符串可以记录在该边末端连接的节点处。如图3a所示,4个已知的恶意文件和非恶意文件的URL中,包括1个恶意文件的URL“www.ok.com/malware.exe”,和3个非恶意文件的URL“www.abc.com/hello.exe”、“www.ok.com/ok.exe”、“down.com/notepad.exe”,电子设备可以在根节点3000处记录恶意文件与非恶意文件的数量分别为3和1。根据前述的字符串匹配方法,其中,URL“down.com/notepad.exe”与其他3个URL没有公共前缀,则通过连接根节点的一个边3010对应字符串“down.com/notepad.exe”,并在该边的另一端的节点3001处记录非恶意文件与恶意文件的数量分别为1和0。URL“www.ok.com/malware.exe”、“www.abc.com/hello.exe”、“www.ok.com/ok.exe”具有相同的前缀字符串“www.”,则在字典树中可以通过连接根节点的另一个边3020对应3个URL的公共前缀“www.”,并通过该边另一端的节点3002记录非恶意文件与恶意文件的数量分别为2和1。接着,URL“www.abc.com/hello.exe”与其他两个URL接下来的字符不相同,则通过与3个URL经过的共同节点3002连接的一个边3030对应字符串“abc.com/hello.exe”,并在与该边3030连接的另一个节点3003记录恶意文件与非恶意文件的数量分别为0和1,而通过与3个URL经过的共同节点3002连接的一个边3040对应另两个URL的公共字符串“ok.com/”,并在该边3040的另一个节点3004记录恶意文件与非恶意文件的数量分别为1和1,接着,通过边3050对应字符串“malware.exe”,对应节点3005处记录恶意文件与非恶意文件的数量分别为0和1,同样,通过边3060对应字符串“ok.exe”,对应节点3006处记录恶意文件与非恶意文件的数量分别为1和0。以此类推,直到样本集中所有已知恶意文件和非恶意文件的URL包含的字符都通过字典树存储。可选地,边对应的字符串可以通过边所到达的节点存储,如边3020对应的字符串可以通过节点3002存储。可选地,节 点处也可以记录满足路径匹配条件的非恶意文件和恶意文件的比值,例如根节点3000处记录比值为3:1。In an optional implementation manner of this embodiment, when the URL in the preset model is stored in a dictionary tree, in order to save storage resources and improve matching efficiency, each edge in the dictionary tree may correspond to one character string; The path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path; each section Points the number or ratio of non-malicious files and malicious files that satisfy the path matching criteria. The path matching condition may include: the string corresponding to the path from the root node to the node is a prefix of the URL of the file. Optionally, a string corresponding to the edge of the dictionary tree may be recorded at the node connected at the end of the edge. As shown in Figure 3a, the URLs of four known malicious and non-malicious files include the URL of a malicious file "www.ok.com/malware.exe", and the URL of three non-malicious files "www" .abc.com/hello.exe", "www.ok.com/ok.exe", "down.com/notepad.exe", the electronic device can record the number of malicious files and non-malicious files at the root node 3000 respectively For 3 and 1. According to the foregoing string matching method, wherein the URL "down.com/notepad.exe" has no common prefix with the other three URLs, the string "down.com/notepad.exe" is corresponding to one side 3010 of the root node. And the number of non-malicious files and malicious files recorded at node 3001 at the other end of the side is 1 and 0, respectively. The URL "www.ok.com/malware.exe", "www.abc.com/hello.exe", "www.ok.com/ok.exe" have the same prefix string "www.", then in the dictionary The tree can correspond to the common prefix "www." of the three URLs by connecting the other edge 3020 of the root node, and the number of non-malicious files and malicious files is 2 and 1 respectively by the node 3002 at the other end of the edge. Next, the URL "www.abc.com/hello.exe" is different from the next two characters of the other two URLs, and the character string "abc.com/" corresponds to a side 3030 connected to the common node 3002 through which the three URLs pass. Hello.exe", and the number of malicious files and non-malicious files recorded by another node 3003 connected to the side 3030 is 0 and 1, respectively, and one side 3040 connected to the common node 3002 passing through the three URLs corresponds to another The public string "ok.com/" of the two URLs, and the number of malicious files and non-malicious files recorded by the other node 3004 on the side 3040 are 1 and 1, respectively, and then, by the side 3050, the corresponding string "malware. Exe", the number of malicious files and non-malicious files recorded by the corresponding node 3005 is 0 and 1, respectively. Similarly, the number of malicious files and non-malicious files recorded by the corresponding node 3006 is respectively corresponding to the string "ok.exe" at the edge 3060. It is 1 and 0. By analogy, the characters contained in the URLs of all known malicious and non-malicious files in the sample set are stored in the dictionary tree. Optionally, the string corresponding to the edge may be stored by the node reached by the edge, and the string corresponding to the edge 3020 may be stored by the node 3002. Optional The ratio of non-malicious files and malicious files that satisfy the path matching condition can also be recorded at the point, for example, the record ratio of the root node 3000 is 3:1.
在本实施例的一些实现方式中,当预设模型中的URL以图3a所示的字典树形式储存时,电子设备可以首先根据上述的路径匹配条件获取预设模型中与待检测文件的URL相匹配的最长字符串所达到的节点;接着读取最长字符串所达到的节点记录的数量或比值;然后,基于上述数量或比值确定待检测文件是否为恶意文件。可选地,电子设备可以直接获取经过待检测文件的URL在预设模型中匹配到的最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值,或者根据待检测文件的URL在预设模型中匹配到的最长字符串所达到的节点处记录的数量计算经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值;并判断上述比值是否大于预设阈值:当大于预设阈值时,确定待检测文件为恶意文件;当不大于预设阈值时,确定待检测文件为非恶意文件。其中,该阈值可以由人工根据经验设定,也可以根据对预设模型的验证样本集的判断准确率训练确定。在一些情况下,非恶意文件的数量可能为0,则计算恶意文件与非恶意文件的比值时,可以将非恶意文件的数量取为电子设备可以计算的最小非零的小数,如0.0000001,或将恶意文件与非恶意文件的比值取为电子设备可以计算的最大数值,如99999999。本领域技术人员可以理解,当字典树中记录的为非恶意文件与恶意文件的比值时,上述根据比值判断待检测文件是否为恶意文件的方法同样适用。In some implementations of the embodiment, when the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 3a, the electronic device may first obtain the URL of the preset model and the file to be detected according to the path matching condition. The node reached by the longest matching string; then read the number or ratio of the node records reached by the longest string; then, based on the above number or ratio, determine whether the file to be detected is a malicious file. Optionally, the electronic device may directly obtain the ratio of the malicious file to the non-malicious file in the path of the node reached by the longest string matched in the preset model by the URL of the file to be detected, or according to the file to be detected. The URL records the number of records at the node reached by the longest string matched in the preset model, and calculates the ratio of malicious files to non-malicious files in all paths of the node reached by the longest string; Whether the ratio is greater than the preset threshold: when the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file; when the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file. The threshold may be manually determined according to experience, or may be determined according to the judgment accuracy of the verification sample set of the preset model. In some cases, the number of non-malicious files may be 0. When calculating the ratio of malicious files to non-malicious files, the number of non-malicious files may be taken as the smallest non-zero fraction that the electronic device can calculate, such as 0.0000001, or The ratio of malicious files to non-malicious files is taken as the maximum value that the electronic device can calculate, such as 99999999. Those skilled in the art can understand that when the ratio of the non-malicious file to the malicious file is recorded in the dictionary tree, the above method for determining whether the file to be detected is a malicious file according to the ratio is also applicable.
作为一个示例,电子设备将图3a所示的字典树作为预设模型,则可以通过以下过程对待检测文件的URL进行匹配。假设电子设备获取了下载待检测文件的URL为“www.ok.com/ok malware.exe”,电子设备接着将该URL包含的字符串与如图3a所示的字典树的预设模型中的字符串进行匹配。首先,电子设备匹配到边3020对应的字符串“www.”,并到达节点3002,接着,电子设备匹配到边3040对应的字符串“ok.com/”,并到达节点3004,再接着,电子设备将字符串“ok malware.exe”分别与边3050对应的字符串“malware.exe”和边3060对应的字符串“ok.exe”进行匹配,结果都不相匹配。因此,电子设 备可以确定,待检测文件的URL“www.ok.com/ok malware.exe”在图3a所示的字典树中匹配到的最长字符串为边3020、边3040对应的字符串“www.ok.com/”,该最长的字符串到达的最远节点为节点3004,此时,电子设备可以读取节点3004处记录的恶意文件与非恶意文件的数量分别为1和1。电子设备接着可以计算经过节点3004的字符串对应的URL中所包含的恶意文件与非恶意文件的比值为1:1,假设电子设备预设的恶意文件与非恶意文件的比值阈值为100:1,则节点3004的字符串对应的URL中所包含的恶意文件与非恶意文件的比值小于预设阈值,电子设备可以确定待检测文件为非恶意文件。As an example, if the electronic device uses the dictionary tree shown in FIG. 3a as a preset model, the URL of the detected file may be matched by the following procedure. Assuming that the electronic device obtains the URL for downloading the file to be detected as “www.ok.com/ok malware.exe”, the electronic device then includes the character string included in the URL and the preset model of the dictionary tree as shown in FIG. 3a. Strings are matched. First, the electronic device matches the character string "www." corresponding to the edge 3020 and reaches the node 3002. Then, the electronic device matches the character string "ok.com/" corresponding to the edge 3040, and reaches the node 3004, and then, the electronic The device matches the string "ok malware.exe" with the string "malware.exe" corresponding to the edge 3050 and the string "ok.exe" corresponding to the edge 3060, and the results do not match. Therefore, electronic design It can be determined that the longest character string matched with the URL of the file to be detected "www.ok.com/ok malware.exe" in the dictionary tree shown in Fig. 3a is the side string 3020, the character string corresponding to the side 3040 "www. Ok.com/", the farthest node that the longest string arrives is the node 3004. At this time, the electronic device can read the number of malicious files and non-malicious files recorded at the node 3004 are 1 and 1, respectively. The electronic device can then calculate that the ratio of the malicious file to the non-malicious file included in the URL corresponding to the string of the node 3004 is 1:1, and the ratio threshold of the malicious file and the non-malicious file preset by the electronic device is 100:1. The ratio of the malicious file to the non-malicious file included in the URL corresponding to the character string of the node 3004 is less than a preset threshold, and the electronic device can determine that the file to be detected is a non-malicious file.
在本实施例的一些实现方式中,电子设备在确定待检测文件为恶意文件或非恶意文件后,还可以根据确定的结果更新预设模型。换句话说,电子设备可以将待检测文件的URL存入预设模型,并作为已知的恶意文件或非恶意文件对预设模型中的相关内容进行更新。例如,在上述的以图3a所示的字典树为预设模型的例子中,电子设备根据待检测文件的URL“www.ok.com/ok malware.exe”判断待检测文件为非恶意文件,则电子设备可以进一步将URL“www.ok.com/ok malware.exe”作为已知的样本更新图3a中的字典树,得到更新的字典树如图3b所示。在图3b中,字典树生成新的节点3007、3008,边3060对应字符串更新为“ok malware.exe”与“ok.exe”的公共字符串“ok”,边3070对应的字符串为“malware.exe”,边3080对应的字符串为“.exe”。相应路径上非恶意文件数量增加1,则各节点的数据也进行更新,例如,节点3000中非恶意文件数量更新为4,节点3002中非恶意文件数量更新为3,等等。In some implementations of this embodiment, after determining that the to-be-detected file is a malicious file or a non-malicious file, the electronic device may further update the preset model according to the determined result. In other words, the electronic device can store the URL of the file to be detected into a preset model, and update related content in the preset model as a known malicious file or non-malicious file. For example, in the above example in which the dictionary tree shown in FIG. 3a is used as a preset model, the electronic device determines that the file to be detected is a non-malicious file according to the URL of the file to be detected, “www.ok.com/ok malware.exe”. Then the electronic device can further update the dictionary tree in FIG. 3a with the URL "www.ok.com/ok malware.exe" as a known sample, and obtain an updated dictionary tree as shown in FIG. 3b. In Figure 3b, the dictionary tree generates new nodes 3007, 3008, and the edge 3060 corresponds to the string being updated to the public string "ok" of "ok malware.exe" and "ok.exe", and the string corresponding to edge 3070 is " Malware.exe", the string corresponding to edge 3080 is ".exe". If the number of non-malicious files on the corresponding path increases by 1, the data of each node is also updated. For example, the number of non-malicious files in node 3000 is updated to 4, the number of non-malicious files in node 3002 is updated to 3, and so on.
在本实施例的一些实现方式中,电子设备可以通过以下方法将已知恶意文件的URL和已知非恶意文件的URL组成的样本集训练生成的字典树:将样本集中所包含的URL进行字符串匹配,并根据匹配结果获取样本集包含的URL的所有公共前缀字符串;使字典树的每条边对应一个公共前缀字符串,每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的公共前缀字符串按顺序拼接而成,每条从根节点到达终端节点的路径对应一个URL;在字典树的每 个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值。其中,路径匹配条件可以包括:从根节点到该节点处的路径对应的字符串是文件的URL的前缀。在这里,公共前缀字符串可以是包含公共前缀的URL的公共前缀的一部分,例如上述的例子中,图3a中的边3040对应的字符串“ok.com/”;也可以是一个URL与其他URL不相匹配的字符串,例如上述的例子中,图3a中的边3060对应的字符串“ok.exe”、边3010对应的字符串“down.com/notepad.exe”等等。In some implementations of this embodiment, the electronic device may train a generated dictionary set by a sample set consisting of a URL of a known malicious file and a URL of a known non-malicious file by performing a character in a URL included in the sample set. The string matches, and according to the matching result, all common prefix strings of the URLs included in the sample set are obtained; each edge of the dictionary tree corresponds to a common prefix string, and each path from the root node corresponds to a string, in the path The string is spliced in order by the common prefix string corresponding to the edge in the path, and each path from the root node to the terminal node corresponds to a URL; each in the dictionary tree The nodes store the number or ratio of non-malicious files and malicious files that satisfy the path matching criteria. The path matching condition may include: the string corresponding to the path from the root node to the node is a prefix of the URL of the file. Here, the common prefix string may be part of a common prefix of a URL containing a common prefix, such as the string "ok.com/" corresponding to the edge 3040 in FIG. 3a in the above example; or a URL and other A character string whose URL does not match. For example, in the above example, the character string "ok.exe" corresponding to the edge 3060 in FIG. 3a, the character string "down.com/notepad.exe" corresponding to the side 3010, and the like.
本实施例的一个应用场景可以为安装杀毒应用的电子设备检测恶意文件的过程(杀毒过程)。其中,在杀毒应用中包含预先训练的预设模型。如图4所示,在标号401中,用户通过电子设备点击所要下载的文件对应的超链接或下载地址下载文件。此时,电子设备上的杀毒应用将用户所要下载的文件作为待检测文件,并获取待检测文件的下载地址(URL)或者超链接所关联的URL,如标号402所示。接着,如标号403所示,杀毒应用将URL所包含的字符串与预设模型中的字符串进行匹配。然后,如标号404所示,杀毒应用根据待检测文件的URL在预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件。若待检测文件是恶意文件,则如标号405所示,杀毒应用给出用户所要下载的文件为恶意文件的提示或拒绝连接到相应网站。否则,电子设备正常下载文件。本实施例通过待检测文件的URL判断待简称为文件是否恶意文件,提高了恶意文件的鉴定效率。An application scenario of this embodiment may be a process for detecting a malicious file (an antivirus process) for an electronic device that installs an antivirus application. Among them, a pre-trained preset model is included in the anti-virus application. As shown in FIG. 4, in reference numeral 401, the user downloads a file by clicking on the hyperlink or download address corresponding to the file to be downloaded by the electronic device. At this time, the antivirus application on the electronic device uses the file to be downloaded by the user as the file to be detected, and obtains the download address (URL) of the file to be detected or the URL associated with the hyperlink, as indicated by reference numeral 402. Next, as indicated by reference numeral 403, the antivirus application matches the string contained in the URL with the string in the preset model. Then, as indicated by reference numeral 404, the antivirus application determines whether the file to be detected is a malicious file according to the longest string matched in the preset model according to the URL of the file to be detected. If the file to be detected is a malicious file, as shown by reference numeral 405, the antivirus application gives a prompt for the file to be downloaded by the user to be a malicious file or refuses to connect to the corresponding website. Otherwise, the electronic device downloads the file normally. In this embodiment, the URL of the file to be detected is used to determine whether the file is abbreviated as a malicious file, and the identification efficiency of the malicious file is improved.
进一步参考图5,作为对上述各图所示方法的实现,本申请提供了一种检测恶意文件的装置的一个实施例,该装置实施例与图1所示的方法实施例相对应,该装置具体可以应用于电子设备中。With further reference to FIG. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for detecting a malicious file, the apparatus embodiment corresponding to the method embodiment shown in FIG. Specifically, it can be applied to an electronic device.
如图5所示,检测恶意文件的装置500包括获取模块501、匹配模块502、确定模块503。其中,获取模块501可以配置用于获取下载待检测文件的统一资源定位符URL;匹配模块502可以配置用于将待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;确定模块503可以配置用于基于待检测文件的URL在预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件。As shown in FIG. 5, the apparatus 500 for detecting a malicious file includes an obtaining module 501, a matching module 502, and a determining module 503. The obtaining module 501 may be configured to obtain a uniform resource locator URL for downloading the file to be detected; the matching module 502 may be configured to match the character string included in the URL of the file to be detected with the character string in the preset model; The determining module 503 can be configured to determine whether the file to be detected is a malicious file based on the longest string matched in the preset model based on the URL of the file to be detected.
在本实施例中,检测恶意文件的装置500的获取模块501可以根 据用户从网络下载文件的请求获取下载待检测文件的URL,在这里,待检测文件可以为用户所请求的从网络下载的文件。In this embodiment, the obtaining module 501 of the device 500 for detecting a malicious file may have a root The URL for downloading the file to be detected is obtained according to the request of the user to download the file from the network. Here, the file to be detected may be a file downloaded from the network requested by the user.
在本实施例中,匹配模块502可以接着将待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配。其中,预设模型中可以包括多个已知的恶意文件及非恶意文件的URL的字符串。上述已知的恶意文件及非恶意文件的URL的字符串在预设模型中可以单独保存,也可以以树的结构形式(例如可以是字典树)保存。相应地,匹配模块502可以将预设模型中的URL逐条与待检测文件的URL进行字符串匹配,也可以以一个字符或多个字符为单元按照树的结构形式中包含的字符串进行字符串匹配。In this embodiment, the matching module 502 can then match the character string included in the URL of the file to be detected with the character string in the preset model. The preset model may include a string of a plurality of known malicious files and URLs of non-malicious files. The string of the above-mentioned known malicious file and the URL of the non-malicious file may be separately saved in the preset model, or may be saved in the form of a tree structure (for example, may be a dictionary tree). Correspondingly, the matching module 502 may perform string matching on the URL of the preset model one by one with the URL of the file to be detected, or may perform a string according to a string included in the structural form of the tree in units of one character or multiple characters. match.
在本实施例中,确定模块503可以接着基于待检测文件的URL在预设模型中匹配到的最长字符串,确定出待检测文件是否为恶意文件。在一些实现中,预设模型中的URL单独保存,匹配模块502可以将待检测文件的URL与预设模型中的URL逐个匹配,则确定模块503可以根据与待检测文件的URL具有最长的相匹配字符串的URL所对应的文件类型作为待检测文件的类型。在另一些实现中,预设模型中的URL以图2或图3a所示的字典树形式储存,匹配模块502可以将待检测文件的URL包含的字符串与字典树中节点处的字符逐个匹配,则确定模块503可以按照匹配到的最后一个字符的子树中包括的URL对应的恶意文件和非恶意文件的数量或比值确定待检测文件是否为恶意文件。In this embodiment, the determining module 503 may then determine whether the file to be detected is a malicious file based on the longest string matched in the preset model based on the URL of the file to be detected. In some implementations, the URLs in the preset model are separately saved, and the matching module 502 can match the URLs of the files to be detected with the URLs in the preset model one by one, and the determining module 503 can have the longest according to the URL of the file to be detected. The file type corresponding to the URL of the matching string is used as the type of the file to be detected. In other implementations, the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 2 or FIG. 3a, and the matching module 502 can match the string included in the URL of the file to be detected with the characters at the nodes in the dictionary tree one by one. The determining module 503 can determine whether the file to be detected is a malicious file according to the number or ratio of the malicious file and the non-malicious file corresponding to the URL included in the subtree of the last character matched.
在本实施例的一些实现方式中,当预设模型中的URL以图2或图3a所示的字典树形式储存时,在所述字典树中:每条边对应一个字符串;每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的字符串按顺序拼接而成;每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值。其中,上述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。In some implementations of this embodiment, when the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 2 or FIG. 3a, in the dictionary tree: each side corresponds to a character string; The path from the root node corresponds to a string. The string in the path is concatenated by the string corresponding to the edge in the path. Each node stores the number or ratio of non-malicious files and malicious files that satisfy the path matching condition. The path matching condition includes that the string corresponding to the path from the root node to the node is a prefix of a URL of the file.
在本实施例的一些实现方式中,确定模块可以包括:获取单元(未示出),配置用于获取预设模型中与URL相匹配的最长字符串所达到 的节点;读取单元(未示出),配置用于读取最长字符串所达到的节点记录的数量或比值;确定单元(未示出),配置用于基于数量或比值判断待检测文件是否为恶意文件。In some implementations of this embodiment, the determining module may include: an obtaining unit (not shown) configured to obtain the longest string that matches the URL in the preset model. a node; a reading unit (not shown) configured to read the number or ratio of node records reached by the longest string; a determining unit (not shown) configured to determine the file to be detected based on the quantity or ratio Whether it is a malicious file.
在本实施例的一些实现方式中,确定单元还可以包括:比值获取子单元(未示出),配置用于获取经过最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值,或者根据数量计算经过最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值;确定子单元(未示出),判断比值是否大于预设阈值;以及,当大于预设阈值时,确定待检测文件为恶意文件;当不大于预设阈值时,确定待检测文件为非恶意文件。In some implementations of this embodiment, the determining unit may further include: a ratio acquisition subunit (not shown) configured to acquire malicious files and non-malicious files in all paths of the node reached by the longest string Ratio, or the ratio of malicious files to non-malicious files in all paths of the node reached by the longest string according to the quantity; determining a subunit (not shown), determining whether the ratio is greater than a preset threshold; and, when When the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file. When the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file.
在本实施例的一些实现方式中,检测恶意文件的装置500还可以包括字典树生成模块,字典树生成模块可以包括:字符串匹配单元(未示出),配置用于将样本集中所包含的URL进行字符串匹配,并根据匹配结果获取样本集包含的URL的所有公共前缀字符串;字典树生成单元(未示出),配置用于使字典树的每条边对应一个公共前缀字符串,每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的公共前缀字符串按顺序拼接而成,每条从根节点到达终端节点的路径对应一个URL,以及,在字典树的每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。In some implementations of this embodiment, the apparatus 500 for detecting a malicious file may further include a dictionary tree generating module, where the dictionary tree generating module may include: a string matching unit (not shown) configured to include the sample set The URL performs string matching, and obtains all common prefix strings of the URLs included in the sample set according to the matching result; a dictionary tree generating unit (not shown) configured to make each side of the dictionary tree correspond to a common prefix string, Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the common prefix string corresponding to the edge in the path, and each path from the root node to the terminal node corresponds to a URL, and And storing, in each node of the dictionary tree, the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes a string corresponding to the path from the root node to the node is a prefix of the file URL .
在本实施例的一些实现方式中,检测恶意文件的装置500还可以包括更新模块(未示出),配置用于根据确定待检测文件是否为恶意文件的结果更新预设模型。在确定模块503确定待检测文件为恶意文件或非恶意文件后,更新模块可以将待检测文件的URL存入预设模型,并作为已知的恶意文件或非恶意文件对预设模型中的相关内容进行更新。In some implementations of this embodiment, the apparatus 500 for detecting a malicious file may further include an update module (not shown) configured to update the preset model according to a result of determining whether the file to be detected is a malicious file. After the determining module 503 determines that the file to be detected is a malicious file or a non-malicious file, the update module may store the URL of the file to be detected into a preset model, and use the known malicious file or non-malicious file as a correlation in the preset model. The content is updated.
本领域技术人员可以理解,上述检测恶意文件的装置500还包括一些其他公知结构,例如处理器、存储器等,为了不必要地模糊本公开的实施例,这些公知的结构在图5中未示出。 Those skilled in the art will appreciate that the apparatus 500 for detecting malicious files described above also includes other well-known structures, such as processors, memories, etc., which are not shown in FIG. 5 in order to unnecessarily obscure the embodiments of the present disclosure. .
本申请实施例中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的模块或单元也可以设置在处理器中,例如,可以描述为:一种处理器包括获取模块,匹配模块和确定模块。其中,这些模块的名称在某种情况下并不构成对该模块本身的限定,例如,获取模块还可以被描述为“配置用于获取下载待检测文件的统一资源定位符URL的模块”。The unit or module involved in the embodiment of the present application may be implemented by software or by hardware. The described modules or units may also be provided in the processor, for example, as described in the following: a processor includes an acquisition module, a matching module, and a determination module. The name of these modules does not constitute a limitation on the module itself in some cases. For example, the acquisition module may also be described as “a module configured to acquire a uniform resource locator URL for downloading a file to be detected”.
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中所述装置中所包含的计算机可读存储介质;也可以是单独存在,未装配入终端中的计算机可读存储介质。所述计算机可读存储介质存储有一个或者一个以上程序,所述程序被一个或者一个以上的处理器用来执行描述于本申请的检测恶意文件的方法。In another aspect, the present application further provides a computer readable storage medium, which may be a computer readable storage medium included in the apparatus described in the foregoing embodiment, or may exist separately, not A computer readable storage medium that is assembled into a terminal. The computer readable storage medium stores one or more programs that are used by one or more processors to perform the methods of detecting malicious files as described herein.
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离所述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。 The above description is only a preferred embodiment of the present application and a description of the principles of the applied technology. It should be understood by those skilled in the art that the scope of the invention referred to in the present application is not limited to the specific combination of the above technical features, and should also be covered by the above technical features without departing from the inventive concept. Other technical solutions formed by any combination of their equivalent features. For example, the above features are combined with the technical features disclosed in the present application, but are not limited to the technical features having similar functions.

Claims (16)

  1. 一种检测恶意文件的方法,其特征在于,所述方法包括:A method for detecting a malicious file, the method comprising:
    获取下载待检测文件的统一资源定位符URL;Obtaining a uniform resource locator URL for downloading the file to be detected;
    将所述待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;Matching a character string included in the URL of the file to be detected with a character string in a preset model;
    基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定所述待检测文件是否为恶意文件。Determining whether the file to be detected is a malicious file, based on the longest character string matched in the preset model by the URL of the file to be detected.
  2. 根据权利要求1所述的方法,其特征在于,所述预设模型包括通过样本集训练生成的字典树,其中,所述样本集包括已知的恶意文件的URL和已知的非恶意文件的URL。The method of claim 1, wherein the preset model comprises a dictionary tree generated by training a sample set, wherein the sample set includes a URL of a known malicious file and a known non-malicious file. URL.
  3. 根据权利要求2所述的方法,其特征在于,在所述字典树中:The method of claim 2, wherein in said dictionary tree:
    每条边对应一个字符串;Each side corresponds to a string;
    每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的字符串按顺序拼接而成;Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path;
    每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。Each node stores the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件包括:The method according to claim 3, wherein the determining whether the file to be detected is a malicious file comprises: determining whether the file to be detected is a malicious file based on the longest character string matched in the preset model by the URL of the file to be detected:
    根据所述路径匹配条件获取所述预设模型中与所述待检测文件的URL相匹配的最长字符串所达到的节点;Acquiring, according to the path matching condition, a node reached by the longest character string in the preset model that matches the URL of the file to be detected;
    读取所述最长字符串所达到的节点记录的所述数量或比值;Reading the number or ratio of node records reached by the longest string;
    基于所述数量或比值确定待检测文件是否为恶意文件。Whether the file to be detected is a malicious file is determined based on the quantity or the ratio.
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述数量 或比值确定待检测文件是否为恶意文件包括:The method of claim 4 wherein said quantity is based Or the ratio determines whether the file to be detected is a malicious file, including:
    根据所述数量计算经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值,或者获取经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值;Calculating, according to the quantity, a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string, or obtaining a malicious file in all paths of the node reached by the longest string The ratio to non-malicious files;
    判断所述比值是否大于预设阈值;Determining whether the ratio is greater than a preset threshold;
    当大于预设阈值时,确定待检测文件为恶意文件;When the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file;
    当不大于预设阈值时,确定待检测文件为非恶意文件。When it is not greater than the preset threshold, it is determined that the file to be detected is a non-malicious file.
  6. 根据权利要求2-5任意一项所述的方法,其特征在于,所述字典树包括通过以下方法将所述样本集训练生成的字典树:The method according to any one of claims 2-5, wherein the dictionary tree comprises a dictionary tree trained by the sample set by:
    将所述样本集中所包含的URL进行字符串匹配,并根据匹配结果获取所述样本集包含的URL的所有公共前缀字符串;Performing string matching on the URLs included in the sample set, and acquiring all common prefix strings of the URLs included in the sample set according to the matching result;
    使所述字典树的每条边对应一个公共前缀字符串,每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的公共前缀字符串按顺序拼接而成,每条从根节点到达终端节点的路径对应一个URL;Each side of the dictionary tree corresponds to a common prefix string, and each path from the root node corresponds to a character string, and the string in the path is spliced in sequence by the common prefix string corresponding to the edge in the path. Each path from the root node to the terminal node corresponds to a URL;
    在所述字典树的每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。Storing, in each node of the dictionary tree, a number or a ratio of non-malicious files and malicious files satisfying a path matching condition, wherein the path matching condition includes a string corresponding to a path from the root node to the node is a file The prefix of the URL.
  7. 根据权利要求1-6任意一项所述的方法,其特征在于,所述方法还包括:The method of any of claims 1-6, wherein the method further comprises:
    根据确定所述待检测文件是否为恶意文件的结果更新所述预设模型。Updating the preset model according to a result of determining whether the file to be detected is a malicious file.
  8. 一种检测恶意文件的装置,其特征在于,所述装置包括:A device for detecting a malicious file, characterized in that the device comprises:
    获取模块,配置用于获取待检测文件的统一资源定位符URL;Obtaining a module, configured to obtain a uniform resource locator URL of the file to be detected;
    匹配模块,配置用于将所述待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;a matching module, configured to match a character string included in the URL of the file to be detected with a character string in a preset model;
    确定模块,配置用于基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定待检测文件是否为恶意文件。 And a determining module, configured to determine whether the file to be detected is a malicious file, based on the longest string matched in the preset model by the URL of the file to be detected.
  9. 根据权利要求8所述的装置,其特征在于,所述预设模型包括通过已知的恶意文件和非恶意文件的URL样本训练生成的字典树。The apparatus of claim 8, wherein the preset model comprises training the generated dictionary tree by a known malicious file and a URL sample of the non-malicious file.
  10. 根据权利要求9所述的装置,其特征在于,在所述字典树中:The apparatus of claim 9 wherein in said dictionary tree:
    每条边对应一个字符串;Each side corresponds to a string;
    每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的字符串按顺序拼接而成;Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path;
    每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。Each node stores the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.
  11. 根据权利要求10所述的装置,其特征在于,所述确定模块包括:The apparatus according to claim 10, wherein the determining module comprises:
    获取单元,配置用于获取所述预设模型中与所述URL相匹配的最长字符串所达到的节点;An obtaining unit, configured to acquire a node reached by the longest character string matching the URL in the preset model;
    读取单元,配置用于读取所述最长字符串所达到的节点记录的所述数量或比值;a reading unit configured to read the number or ratio of node records reached by the longest string;
    确定单元,配置用于基于所述数量或比值判断待检测文件是否为恶意文件。And a determining unit configured to determine, according to the quantity or the ratio, whether the file to be detected is a malicious file.
  12. 根据权利要求11所述的装置,其特征在于,所述确定单元包括:The apparatus according to claim 11, wherein said determining unit comprises:
    比值获取子单元,配置用于根据所述路径匹配条件获取经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值,或者根据所述数量计算经过所述最长字符串所达到的节点的全部路径中的恶意文件与非恶意文件的比值;a ratio obtaining subunit configured to obtain, according to the path matching condition, a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string, or according to the quantity The ratio of malicious files to non-malicious files in the entire path of the node reached by the long string;
    确定子单元,判断所述比值是否大于预设阈值;以及Determining a subunit, determining whether the ratio is greater than a preset threshold;
    当大于预设阈值时,确定待检测文件为恶意文件;When the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file;
    当不大于预设阈值时,确定待检测文件为非恶意文件。 When it is not greater than the preset threshold, it is determined that the file to be detected is a non-malicious file.
  13. 根据权利要求9-12任意一项所述的装置,其特征在于,所述装置还包括字典树生成模块,所述字典树生成模块包括:The device according to any one of claims 9 to 12, wherein the device further comprises a dictionary tree generating module, the dictionary tree generating module comprising:
    字符串匹配单元,配置用于将所述样本集中所包含的URL进行字符串匹配,并根据匹配结果获取所述样本集包含的URL的所有公共前缀字符串;a string matching unit configured to perform string matching on the URLs included in the sample set, and obtain all common prefix strings of the URLs included in the sample set according to the matching result;
    字典树生成单元,配置用于使所述字典树的每条边对应一个公共前缀字符串,每条从根节点出发的路径对应一个字符串,路径中的字符串由路径中的边对应的公共前缀字符串按顺序拼接而成,每条从根节点到达终端节点的路径对应一个URL,以及,在所述字典树的每个节点存放满足路径匹配条件的非恶意文件和恶意文件的数量或比值,其中,所述路径匹配条件包括从根节点到该节点处的路径对应的字符串是文件的URL的前缀。a dictionary tree generating unit, configured to make each edge of the dictionary tree correspond to a common prefix string, each path starting from the root node corresponding to a string, and the string in the path is common to the edge in the path The prefix string is spliced in order, each path from the root node to the terminal node corresponds to a URL, and the number or ratio of non-malicious files and malicious files satisfying the path matching condition is stored in each node of the dictionary tree. Wherein the path matching condition comprises a string corresponding to the path from the root node to the node being a prefix of a URL of the file.
  14. 根据权利要求8-13任意一项所述的装置,其特征在于,所述装置还包括更新模块,所述更新模块配置用于根据确定所述待检测文件是否为恶意文件的结果更新所述预设模型。The apparatus according to any one of claims 8-13, wherein the apparatus further comprises an update module, the update module configured to update the pre-determination according to a result of determining whether the file to be detected is a malicious file Set the model.
  15. 一种设备,其特征在于,包括:An apparatus, comprising:
    一个或者多个处理器;One or more processors;
    存储器;Memory
    一个或者多个程序,所述一个或者多个程序存储在所述存储器中,当被所述一个或多个处理器执行时:One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:
    获取下载待检测文件的统一资源定位符URL;Obtaining a uniform resource locator URL for downloading the file to be detected;
    将所述待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;Matching a character string included in the URL of the file to be detected with a character string in a preset model;
    基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定所述待检测文件是否为恶意文件。Determining whether the file to be detected is a malicious file, based on the longest character string matched in the preset model by the URL of the file to be detected.
  16. 一种非易失性计算机存储介质,所述计算机存储介质存储有一个或多个程序,当所述一个或者多个程序被一个设备执行时,使得 所述设备:A non-volatile computer storage medium storing one or more programs when the one or more programs are executed by a device The device:
    获取下载待检测文件的统一资源定位符URL;Obtaining a uniform resource locator URL for downloading the file to be detected;
    将所述待检测文件的URL所包含的字符串与预设模型中的字符串进行匹配;Matching a character string included in the URL of the file to be detected with a character string in a preset model;
    基于所述待检测文件的URL在所述预设模型中匹配到的最长字符串,确定所述待检测文件是否为恶意文件。 Determining whether the file to be detected is a malicious file, based on the longest character string matched in the preset model by the URL of the file to be detected.
PCT/CN2015/090707 2015-06-19 2015-09-25 Method and apparatus for detecting malicious file WO2016201819A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510346583.8A CN104933363B (en) 2015-06-19 2015-06-19 The method and apparatus for detecting malicious file
CN201510346583.8 2015-06-19

Publications (1)

Publication Number Publication Date
WO2016201819A1 true WO2016201819A1 (en) 2016-12-22

Family

ID=54120526

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/090707 WO2016201819A1 (en) 2015-06-19 2015-09-25 Method and apparatus for detecting malicious file

Country Status (2)

Country Link
CN (1) CN104933363B (en)
WO (1) WO2016201819A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177596A (en) * 2019-12-25 2020-05-19 微梦创科网络科技(中国)有限公司 URL (Uniform resource locator) request classification method and device based on LSTM (least Square TM) model
CN111614575A (en) * 2020-04-01 2020-09-01 宜通世纪科技股份有限公司 Deep packet inspection method, system and storage medium based on internet flow
CN111898046A (en) * 2020-07-16 2020-11-06 北京天空卫士网络安全技术有限公司 Redirection management method and device
CN113051565A (en) * 2021-03-16 2021-06-29 深信服科技股份有限公司 Malicious script detection method and device, equipment and storage medium
CN113312549A (en) * 2021-05-25 2021-08-27 北京天空卫士网络安全技术有限公司 Domain name processing method and device
CN117640259A (en) * 2024-01-25 2024-03-01 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium
CN117828382A (en) * 2024-02-26 2024-04-05 闪捷信息科技有限公司 Network interface clustering method and device based on URL

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933363B (en) * 2015-06-19 2018-09-11 安一恒通(北京)科技有限公司 The method and apparatus for detecting malicious file
CN107665164A (en) * 2016-07-29 2018-02-06 百度在线网络技术(北京)有限公司 Secure data detection method and device
CN106612283B (en) * 2016-12-29 2020-02-28 北京奇虎科技有限公司 Method and device for identifying source of downloaded file
CN107301334B (en) * 2017-06-28 2020-03-17 Oppo广东移动通信有限公司 Payment application program downloading protection method and device and mobile terminal
CN107563201B (en) * 2017-09-08 2021-01-29 北京奇宝科技有限公司 Associated sample searching method and device based on machine learning and server
CN109670163B (en) * 2017-10-17 2023-03-28 阿里巴巴集团控股有限公司 Information identification method, information recommendation method, template construction method and computing device
CN108040069A (en) * 2017-12-28 2018-05-15 成都数成科技有限公司 A kind of quick method for opening network data APMB package
CN110245330B (en) * 2018-03-09 2023-07-07 腾讯科技(深圳)有限公司 Character sequence matching method, preprocessing method and device for realizing matching
CN108549679B (en) * 2018-04-03 2022-03-25 国家计算机网络与信息安全管理中心 File extension fast matching method and device for URL analysis system
CN116827677A (en) * 2019-04-16 2023-09-29 北京嘀嘀无限科技发展有限公司 System and method for detecting anomalies
CN111046938B (en) * 2019-12-06 2020-12-01 邑客得(上海)信息技术有限公司 Network traffic classification and identification method and equipment based on character string multi-mode matching
CN116149669B (en) * 2023-04-14 2023-07-18 杭州安恒信息技术股份有限公司 Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819586A (en) * 2012-07-31 2012-12-12 北京网康科技有限公司 Uniform Resource Locator (URL) classifying method and equipment based on cache
CN103761478A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Judging method and device of malicious files
US9027128B1 (en) * 2013-02-07 2015-05-05 Trend Micro Incorporated Automatic identification of malicious budget codes and compromised websites that are employed in phishing attacks
CN104933363A (en) * 2015-06-19 2015-09-23 安一恒通(北京)科技有限公司 Method and device for detecting malicious file

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104333558B (en) * 2014-11-17 2018-02-23 广州华多网络科技有限公司 A kind of network address detection method and network address detection means

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819586A (en) * 2012-07-31 2012-12-12 北京网康科技有限公司 Uniform Resource Locator (URL) classifying method and equipment based on cache
US9027128B1 (en) * 2013-02-07 2015-05-05 Trend Micro Incorporated Automatic identification of malicious budget codes and compromised websites that are employed in phishing attacks
CN103761478A (en) * 2014-01-07 2014-04-30 北京奇虎科技有限公司 Judging method and device of malicious files
CN104933363A (en) * 2015-06-19 2015-09-23 安一恒通(北京)科技有限公司 Method and device for detecting malicious file

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177596B (en) * 2019-12-25 2023-08-25 微梦创科网络科技(中国)有限公司 URL request classification method and device based on LSTM model
CN111177596A (en) * 2019-12-25 2020-05-19 微梦创科网络科技(中国)有限公司 URL (Uniform resource locator) request classification method and device based on LSTM (least Square TM) model
CN111614575A (en) * 2020-04-01 2020-09-01 宜通世纪科技股份有限公司 Deep packet inspection method, system and storage medium based on internet flow
CN111898046B (en) * 2020-07-16 2024-02-13 北京天空卫士网络安全技术有限公司 Method and device for redirection management
CN111898046A (en) * 2020-07-16 2020-11-06 北京天空卫士网络安全技术有限公司 Redirection management method and device
CN113051565A (en) * 2021-03-16 2021-06-29 深信服科技股份有限公司 Malicious script detection method and device, equipment and storage medium
CN113051565B (en) * 2021-03-16 2024-05-28 深信服科技股份有限公司 Malicious script detection method and device, equipment and storage medium
CN113312549B (en) * 2021-05-25 2024-01-26 北京天空卫士网络安全技术有限公司 Domain name processing method and device
CN113312549A (en) * 2021-05-25 2021-08-27 北京天空卫士网络安全技术有限公司 Domain name processing method and device
CN117640259A (en) * 2024-01-25 2024-03-01 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium
CN117640259B (en) * 2024-01-25 2024-06-04 武汉思普崚技术有限公司 Script step-by-step detection method and device, electronic equipment and medium
CN117828382A (en) * 2024-02-26 2024-04-05 闪捷信息科技有限公司 Network interface clustering method and device based on URL
CN117828382B (en) * 2024-02-26 2024-05-10 闪捷信息科技有限公司 Network interface clustering method and device based on URL

Also Published As

Publication number Publication date
CN104933363B (en) 2018-09-11
CN104933363A (en) 2015-09-23

Similar Documents

Publication Publication Date Title
WO2016201819A1 (en) Method and apparatus for detecting malicious file
US9614862B2 (en) System and method for webpage analysis
US10785246B2 (en) Mining attack vectors for black-box security testing
US10491618B2 (en) Method and apparatus for website scanning
US9304979B2 (en) Authorized syndicated descriptions of linked web content displayed with links in user-generated content
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
WO2019085474A1 (en) Calculation engine implementing method, electronic device, and storage medium
WO2013044744A1 (en) Download resource providing method and device
US20140096241A1 (en) Cloud-assisted method and service for application security verification
US10706032B2 (en) Unsolicited bulk email detection using URL tree hashes
US9954880B2 (en) Protection via webpage manipulation
US11604843B2 (en) Method and system for generating phrase blacklist to prevent certain content from appearing in a search result in response to search queries
US9355250B2 (en) Method and system for rapidly scanning files
CN107239701B (en) Method and device for identifying malicious website
WO2015081848A1 (en) Socialized extended search method and corresponding device and system
CN107463844B (en) WEB Trojan horse detection method and system
WO2015109928A1 (en) Method, device and system for loading recommendation information and detecting url
BR112016010052B1 (en) PAGE OPERATION PROCESSING METHOD AND APPLIANCE, AND TERMINAL
US11036479B2 (en) Devices, systems, and methods of program identification, isolation, and profile attachment
US20210176274A1 (en) System and method for blocking phishing attempts in computer networks
CN107786529B (en) Website detection method, device and system
US9398041B2 (en) Identifying stored vulnerabilities in a web service
CN104361094A (en) Storage method and device for file in search result, and browser client
WO2017054731A1 (en) Method and device for processing hijacked browser
US20130339158A1 (en) Determining legitimate and malicious advertisements using advertising delivery sequences

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15895390

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11/04/2018)

122 Ep: pct application non-entry in european phase

Ref document number: 15895390

Country of ref document: EP

Kind code of ref document: A1