WO2016201819A1

WO2016201819A1 - Method and apparatus for detecting malicious file

Info

Publication number: WO2016201819A1
Application number: PCT/CN2015/090707
Authority: WO
Inventors: 熊蜀光; 冯侦探; 曹德强; 周晓波; 耿志峰; 白军辉
Original assignee: 安一恒通（北京）科技有限公司
Priority date: 2015-06-19
Filing date: 2015-09-25
Publication date: 2016-12-22
Also published as: CN104933363B; CN104933363A

Abstract

Disclosed are a method and apparatus for detecting a malicious file. One particular embodiment of the method comprises: acquiring a uniform resource locator (URL) for downloading a file to be detected; matching character strings contained in the URL of the file to be detected with character strings in a pre-set model; and determining whether the file to be detected is a malicious file based on a longest character string matched in the pre-set model by the URL of the file to be detected. According to the embodiment, the efficiency of detecting the malicious file can be improved.

Description

Method and apparatus for detecting malicious files

Cross-reference to related applications

The present application claims the priority of the Chinese Patent Application No. 201510346583.8, the entire disclosure of which is incorporated herein by reference.

Technical field

The present application relates to the field of computer technologies, and in particular, to the field of network information security technologies, and in particular, to a method and apparatus for detecting malicious files.

Background technique

When downloading files on the Internet, some download links often point to malicious files by masquerading. These malicious files, such as those containing programs that can perform malicious tasks on a computer system, viruses, worms, or Trojan horses, are downloaded to the user's computer and may compromise the information security of the network user.

At present, in the static detection method used by most antivirus applications, the attributes of the file to be downloaded or the content of the included content are usually extracted first, and then these features are matched according to the pre-trained model to determine whether the file is a malicious file. These methods need to obtain the relevant features of the file first, and for files that do not contain obvious malicious file characteristics, it is impossible to determine whether it is a malicious file, and the identification efficiency is low.

Summary of the invention

The purpose of the present application is to propose an improved method and apparatus for detecting malicious files to solve the technical problems mentioned in the background art above.

In one aspect, the present application provides a method for detecting a malicious file, the method comprising: obtaining a uniform resource locator URL for downloading a file to be detected; and using a string included in a URL of the file to be detected and a preset model The string is matched; based on the pending And determining, by the URL of the file, the longest string matched in the preset model, and determining whether the file to be detected is a malicious file.

In some embodiments, the preset model includes training the generated dictionary tree with known malicious files and URL samples of non-malicious files.

In some embodiments, in the dictionary tree: each side corresponds to a character string;

Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path; each node stores the number of non-malicious files and malicious files that satisfy the path matching condition. Or a ratio, wherein the path matching condition comprises a string corresponding to the path from the root node to the node being a prefix of a URL of the file.

In some embodiments, the determining, according to the longest character string matched by the URL of the file to be detected in the preset model, determining whether the file to be detected is a malicious file comprises: acquiring the preset model and the location a node reached by the longest string that matches the URL; reading the number or ratio of node records reached by the longest string; determining whether the file to be detected is a malicious file based on the quantity or ratio.

In some embodiments, the determining, according to the quantity or the ratio, whether the file to be detected is a malicious file comprises: acquiring a malicious file in all paths of the node reached by the longest string according to the path matching condition The ratio of the non-malicious file, or the ratio of the malicious file to the non-malicious file in the entire path of the node reached by the longest string according to the quantity; determining whether the ratio is greater than a preset threshold; When the threshold is set, the file to be detected is determined to be a malicious file; when the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file.

In some embodiments, the dictionary tree includes a dictionary tree that is trained to generate the sample set by performing string matching on a URL included in the sample set, and acquiring the sample set according to the matching result. All common prefix strings of the URL; each edge of the dictionary tree corresponds to a common prefix string, and each path from the root node corresponds to a string, and the string in the path is shared by the edge in the path. The prefix string is spliced in order, and each path from the root node to the terminal node corresponds to a URL; and each node of the dictionary tree stores the number or ratio of non-malicious files and malicious files that satisfy the path matching condition, wherein The path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.

In some embodiments, the method further comprises: updating the preset model according to a result of determining whether the file to be detected is a malicious file.

On the other hand, the present application provides an apparatus for detecting a malicious file, the apparatus comprising: an obtaining module configured to acquire a uniform resource locator URL for downloading a file to be detected; and a matching module configured to use the to-be-detected The character string included in the URL of the file is matched with the character string in the preset model; the determining module is configured to determine the longest character string matched in the preset model based on the URL of the file to be detected, and determine Check if the file is a malicious file.

In some embodiments, in the dictionary tree: each edge corresponds to a character string; each path from the root node corresponds to a string, and the string in the path is in the order of the string corresponding to the edge in the path. Splicing; each node stores the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes that the string corresponding to the path from the root node to the node is the URL of the file. Prefix.

In some embodiments, the determining module includes: an acquiring unit, configured to acquire, according to the path matching condition, a node reached by the longest character string matching the URL in the preset model; Configuring a quantity or ratio for reading the node record reached by the longest character string; and determining, configured to determine, according to the quantity or the ratio, whether the file to be detected is a malicious file.

In some embodiments, the determining unit includes: a ratio obtaining subunit configured to acquire a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string, or according to the Calculating a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string; determining a subunit, determining whether the ratio is greater than a preset threshold; and, when greater than a preset threshold, The file to be detected is determined to be a malicious file; when the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file.

In some embodiments, the apparatus further includes a dictionary tree generation module, the dictionary tree generation module includes: a string matching unit configured to perform string matching on the URLs included in the sample set, and according to the matching result Get the URL contained in the sample set All common prefix strings; a dictionary tree generating unit configured to make each edge of the dictionary tree correspond to a common prefix string, each path from the root node corresponding to a string, and the string in the path is The common prefix string corresponding to the edge in the path is spliced in order, and each path from the root node to the terminal node corresponds to a URL; and each non-mali file that satisfies the path matching condition is stored in each node of the dictionary tree. And the number or ratio of malicious files, wherein the path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.

In some embodiments, the apparatus further includes an update module configured to update the preset model based on a result of determining whether the file to be detected is a malicious file.

The method and device for detecting a malicious file provided by the present application obtain a uniform resource locator URL of a file to be detected, and match a character string included in a URL of the file to be detected with a character string in a preset model, based on matching The longest string determines whether the file to be detected is a malicious file, and does not need to obtain other information of the file to be detected, thereby improving the efficiency of identifying malicious files.

DRAWINGS

Other features, objects, and advantages of the present application will become more apparent from the detailed description of the accompanying drawings.

1 is a flow diagram of one embodiment of a method of detecting a malicious file in accordance with the present application;

2 is a schematic diagram of a dictionary tree of a preset model according to the present application;

Figure 3a is a schematic illustration of another dictionary tree of a preset model in accordance with the present application;

Figure 3b is an updated schematic diagram of an example of a dictionary tree shown in Figure 3a;

4 is a schematic diagram of an application scenario of a method for detecting a malicious file according to the present application;

FIG. 5 is a block diagram showing an embodiment of an apparatus for detecting a malicious file according to the present application.

detailed description

The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention, rather than the invention. It is also to be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings.

Please refer to FIG. 1, which illustrates a flow 100 of one embodiment of a method of detecting a malicious file. This embodiment is mainly applied to various electronic devices supported by the download application and/or the browser application, including but not limited to smart phones, smart watches, tablets, personal digital assistants, and e-book readers. , MP3 player (Moving Picture Experts Group Audio Layer III), MP4 (Moving Picture Experts Group Audio Layer IV) player, laptop portable computer And desktop computers and more. The method for detecting malicious files includes the following steps:

Step 101: Obtain a URL for downloading a file to be detected.

In this embodiment, the electronic device may first obtain a URL (Uniform Resoure Locator) for downloading the file to be detected according to the request of the user to download the file from the network, where the file to be detected may be the slave network requested by the user. Downloaded file.

The Uniform Resource Locator URL is a compact representation of the location and access method of resources that can be obtained from the Internet, and is the address of a standard resource on the Internet. Every file on the Internet has a unique URL that contains information indicating the location of the file and how the browser or download application should handle it. The base URL contains the mode (or protocol), server name (or IP address), path, and file name. The URL can be represented by a string including letters, numbers, and symbols, for example: http://www.sohu.com/.

When downloading a file from the server, the user can issue a request to download the corresponding file by clicking a hyperlink or a download address on the page displayed by the browser, or by clicking a hyperlink or inputting a download address in the download application to download the corresponding file. Request. At this time, if the download address of the file is known, the electronic device can directly obtain the download address, and the download address can be regarded as the URL of the file to be detected. If the hyperlink to the downloaded file is being used by the user Clicking, the electronic device can obtain the URL associated with the hyperlink through the browser or the downloading application, that is, the URL of the file to be detected.

Step 102: Match the character string included in the URL of the file to be detected with the character string in the preset model.

In this embodiment, the electronic device may then match the character string included in the URL of the file to be detected with the character string in the preset model. The preset model may include a string corresponding to a URL of a plurality of known malicious files and a string corresponding to a URL of a known non-malicious file. In some implementations, the electronic device can manually obtain URLs of a plurality of malicious files and URLs of non-malicious files. In other implementations, the electronic device may first fetch files from multiple download sites and save their URLs, and then authenticate the files through a predetermined anti-virus engine (eg, Big Spider Dr.web, Kaspersky Kaspersky, etc.). To determine whether they are malicious or non-malicious, to obtain URLs for multiple known malicious files and URLs for known non-malicious files. In practice, the electronic device may obtain the URLs of the plurality of known malicious files and the URLs of the non-malicious files in any other feasible manner, which is not limited in this application.

The electronic device may separately save the URL in the preset model (a URL corresponds to a storage address), or may pre-store the string in the URL as a tree structure (for example, may be a dictionary tree) by string matching. Correspondingly, the electronic device may perform string matching on the URL of the preset model one by one with the URL of the file to be detected, or may perform string matching according to the character string included in the structural form of the tree in units of one character or multiple characters. . Wherein, when string matching is performed, matching is performed from the beginning of the string in order, and the two characters having the same character at the same position are matched strings. If the character of the file to be detected does not match the character string corresponding to the URL in the preset model, the character string included in the URL of the file to be detected does not match the string in the preset model. .

As an example, the electronic device may save the string in the URL in the form of a dictionary tree as shown in FIG. 2. The dictionary tree, also known as the word search tree, can sort and save a large number of strings (but not limited to strings). Its advantages are: use the common prefix of the string to reduce the query time, and minimize the unnecessary string comparison. Improve query efficiency. Where, if a string is composed of consecutive characters in the front part of another string, Then the string is a prefix of another string, such as "ac" is the prefix of the string "acm", "abcd" is the prefix of the string "abcddfasf", in particular, "kdfa" is the string "kdfa" Prefix. In the example given in Figure 2, if the known four URLs are: www.abc.com/hello.exe, www.ok.com/ok.exe, down.com/notepad.exe, www.ok .com/malware.exe. The electronic device can obtain the common prefix between the above four URLs according to the string matching, and store the shared characters in one node of the dictionary tree. Such as: www.abc.com/hello.exe, www.ok.com/ok.exe, www.ok.com/malware.exe, with the shared characters "w", "w", "w", ". Then, the three URLs store the characters "w", "w", "w", "." on the nodes of a subtree of the root node of the dictionary tree, respectively. The URL "down.com/notepad.exe" is a character that is not shared with the above three URLs, and the character of the URL "down.com/notepad.exe" is stored on the node of a subtree of the root node of the dictionary tree, respectively. By analogy, the three URLs www.abc.com/hello.exe, www.ok.com/ok.exe, www.ok.com/malware.exe continue to match, when there are different characters, more nodes are created. Child nodes.

Step 103: Determine whether the file to be detected is a malicious file, based on the longest string matched in the preset model by the URL of the file to be detected.

In this embodiment, the electronic device may then determine whether the file to be detected is a malicious file based on the longest character string matched in the preset model based on the URL of the file to be detected.

The longest string matched by the URL of the file to be detected in the preset model may be the string with the most characters matching the URL of the file to be detected. For example, the preset model includes four URLs: www. Abc.com/hello.exe, www.ok.com/ok.exe, down.com/notepad.exe, www.ok.com/malware.exe, when the URL of the file to be tested is www.ok.com/ok In malware.exe, the URL of the file to be tested contains a string matching the string in the preset model, and the matching string "www.ok.com/ok" can be matched as the most matched in the preset model. Long string. In some implementations, the URLs in the preset model are separately saved, and the electronic device may match the URL of the file to be detected with the URL in the preset model one by one, and have the longest matching string according to the URL of the file to be detected. The file type corresponding to the URL is the type of the file to be detected. For example, in the foregoing example, the longest character string matched to the URL of the file to be detected in the preset model is “www.ok.com/ok”, and the corresponding URL is "www.ok.com/ok.exe", if the file corresponding to the URL "www.ok.com/ok.exe" is a malicious file, the electronic device can determine that the file to be detected is a malicious file, if "www.ok The file corresponding to .com/ok.exe is a non-malicious file, and the electronic device can determine that the file to be detected is a non-malicious file. In other implementations, the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 2, and the electronic device may match the character string included in the URL of the file to be detected with the characters at the node in the dictionary tree one by one, and match the matching. The number or ratio of malicious files and non-malicious files corresponding to the URL included in the subtree of the node stored by the last character to the last character determines whether the file to be detected is a malicious file. As in the foregoing example, the URL of the file to be detected "www.ok.com/ok malware.exe", the last character matched in the dictionary tree shown in Figure 2 is "www.ok.com/ok" The last character "k", and the subtree corresponding to the character includes only one URL "www.ok.com/ok.exe", if the file corresponding to the URL "www.ok.com/ok.exe" is non- The malicious file can determine whether the file to be detected is a malicious file according to the number of malicious files and non-malicious files included in the subtree corresponding to the node in which the character is stored, for example, according to malicious files and non-malicious files. The number of the files (such as the proportion of malicious files in the total number of files 0 / (1 + 0) = 0) determines that the file to be detected is a non-malicious file; the electronic device can also be based on the subtree corresponding to the node in which the character is stored The ratio of the malicious file to the non-malicious file is included to determine whether the file to be detected is a malicious file. For example, the ratio of the malicious file to the non-malicious file is 0:1=0, and the file to be detected is a non-malicious file. In practice, the electronic device may preset a threshold of a ratio of a malicious file to a non-malicious file (for example, may be 100:1). When the ratio of the malicious file to the non-malicious file is greater than the threshold, the file to be detected is determined to be a malicious file, otherwise , to determine that the file to be detected is a non-malicious file. The threshold may be manually determined according to experience, or may be determined based on the judgment accuracy (for example, 99%) of the verification sample set of the preset model. Optionally, the electronic device may also preset a ratio of the non-malicious file to the malicious file, and determine whether the to-be-detected file is a malicious file, etc., when the ratio is less than a preset threshold ratio of the non-malicious file and the malicious file. There is no limit to this.

In an optional implementation manner of this embodiment, when the URL in the preset model is stored in a dictionary tree, in order to save storage resources and improve matching efficiency, each edge in the dictionary tree may correspond to one character string; The path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path; each section Points the number or ratio of non-malicious files and malicious files that satisfy the path matching criteria. The path matching condition may include: the string corresponding to the path from the root node to the node is a prefix of the URL of the file. Optionally, a string corresponding to the edge of the dictionary tree may be recorded at the node connected at the end of the edge. As shown in Figure 3a, the URLs of four known malicious and non-malicious files include the URL of a malicious file "www.ok.com/malware.exe", and the URL of three non-malicious files "www" .abc.com/hello.exe", "www.ok.com/ok.exe", "down.com/notepad.exe", the electronic device can record the number of malicious files and non-malicious files at the root node 3000 respectively For 3 and 1. According to the foregoing string matching method, wherein the URL "down.com/notepad.exe" has no common prefix with the other three URLs, the string "down.com/notepad.exe" is corresponding to one side 3010 of the root node. And the number of non-malicious files and malicious files recorded at node 3001 at the other end of the side is 1 and 0, respectively. The URL "www.ok.com/malware.exe", "www.abc.com/hello.exe", "www.ok.com/ok.exe" have the same prefix string "www.", then in the dictionary The tree can correspond to the common prefix "www." of the three URLs by connecting the other edge 3020 of the root node, and the number of non-malicious files and malicious files is 2 and 1 respectively by the node 3002 at the other end of the edge. Next, the URL "www.abc.com/hello.exe" is different from the next two characters of the other two URLs, and the character string "abc.com/" corresponds to a side 3030 connected to the common node 3002 through which the three URLs pass. Hello.exe", and the number of malicious files and non-malicious files recorded by another node 3003 connected to the side 3030 is 0 and 1, respectively, and one side 3040 connected to the common node 3002 passing through the three URLs corresponds to another The public string "ok.com/" of the two URLs, and the number of malicious files and non-malicious files recorded by the other node 3004 on the side 3040 are 1 and 1, respectively, and then, by the side 3050, the corresponding string "malware. Exe", the number of malicious files and non-malicious files recorded by the corresponding node 3005 is 0 and 1, respectively. Similarly, the number of malicious files and non-malicious files recorded by the corresponding node 3006 is respectively corresponding to the string "ok.exe" at the edge 3060. It is 1 and 0. By analogy, the characters contained in the URLs of all known malicious and non-malicious files in the sample set are stored in the dictionary tree. Optionally, the string corresponding to the edge may be stored by the node reached by the edge, and the string corresponding to the edge 3020 may be stored by the node 3002. Optional The ratio of non-malicious files and malicious files that satisfy the path matching condition can also be recorded at the point, for example, the record ratio of the root node 3000 is 3:1.

In some implementations of the embodiment, when the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 3a, the electronic device may first obtain the URL of the preset model and the file to be detected according to the path matching condition. The node reached by the longest matching string; then read the number or ratio of the node records reached by the longest string; then, based on the above number or ratio, determine whether the file to be detected is a malicious file. Optionally, the electronic device may directly obtain the ratio of the malicious file to the non-malicious file in the path of the node reached by the longest string matched in the preset model by the URL of the file to be detected, or according to the file to be detected. The URL records the number of records at the node reached by the longest string matched in the preset model, and calculates the ratio of malicious files to non-malicious files in all paths of the node reached by the longest string; Whether the ratio is greater than the preset threshold: when the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file; when the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file. The threshold may be manually determined according to experience, or may be determined according to the judgment accuracy of the verification sample set of the preset model. In some cases, the number of non-malicious files may be 0. When calculating the ratio of malicious files to non-malicious files, the number of non-malicious files may be taken as the smallest non-zero fraction that the electronic device can calculate, such as 0.0000001, or The ratio of malicious files to non-malicious files is taken as the maximum value that the electronic device can calculate, such as 99999999. Those skilled in the art can understand that when the ratio of the non-malicious file to the malicious file is recorded in the dictionary tree, the above method for determining whether the file to be detected is a malicious file according to the ratio is also applicable.

As an example, if the electronic device uses the dictionary tree shown in FIG. 3a as a preset model, the URL of the detected file may be matched by the following procedure. Assuming that the electronic device obtains the URL for downloading the file to be detected as “www.ok.com/ok malware.exe”, the electronic device then includes the character string included in the URL and the preset model of the dictionary tree as shown in FIG. 3a. Strings are matched. First, the electronic device matches the character string "www." corresponding to the edge 3020 and reaches the node 3002. Then, the electronic device matches the character string "ok.com/" corresponding to the edge 3040, and reaches the node 3004, and then, the electronic The device matches the string "ok malware.exe" with the string "malware.exe" corresponding to the edge 3050 and the string "ok.exe" corresponding to the edge 3060, and the results do not match. Therefore, electronic design It can be determined that the longest character string matched with the URL of the file to be detected "www.ok.com/ok malware.exe" in the dictionary tree shown in Fig. 3a is the side string 3020, the character string corresponding to the side 3040 "www. Ok.com/", the farthest node that the longest string arrives is the node 3004. At this time, the electronic device can read the number of malicious files and non-malicious files recorded at the node 3004 are 1 and 1, respectively. The electronic device can then calculate that the ratio of the malicious file to the non-malicious file included in the URL corresponding to the string of the node 3004 is 1:1, and the ratio threshold of the malicious file and the non-malicious file preset by the electronic device is 100:1. The ratio of the malicious file to the non-malicious file included in the URL corresponding to the character string of the node 3004 is less than a preset threshold, and the electronic device can determine that the file to be detected is a non-malicious file.

In some implementations of this embodiment, after determining that the to-be-detected file is a malicious file or a non-malicious file, the electronic device may further update the preset model according to the determined result. In other words, the electronic device can store the URL of the file to be detected into a preset model, and update related content in the preset model as a known malicious file or non-malicious file. For example, in the above example in which the dictionary tree shown in FIG. 3a is used as a preset model, the electronic device determines that the file to be detected is a non-malicious file according to the URL of the file to be detected, “www.ok.com/ok malware.exe”. Then the electronic device can further update the dictionary tree in FIG. 3a with the URL "www.ok.com/ok malware.exe" as a known sample, and obtain an updated dictionary tree as shown in FIG. 3b. In Figure 3b, the dictionary tree generates

new nodes

3007, 3008, and the edge 3060 corresponds to the string being updated to the public string "ok" of "ok malware.exe" and "ok.exe", and the string corresponding to edge 3070 is " Malware.exe", the string corresponding to edge 3080 is ".exe". If the number of non-malicious files on the corresponding path increases by 1, the data of each node is also updated. For example, the number of non-malicious files in node 3000 is updated to 4, the number of non-malicious files in node 3002 is updated to 3, and so on.

In some implementations of this embodiment, the electronic device may train a generated dictionary set by a sample set consisting of a URL of a known malicious file and a URL of a known non-malicious file by performing a character in a URL included in the sample set. The string matches, and according to the matching result, all common prefix strings of the URLs included in the sample set are obtained; each edge of the dictionary tree corresponds to a common prefix string, and each path from the root node corresponds to a string, in the path The string is spliced in order by the common prefix string corresponding to the edge in the path, and each path from the root node to the terminal node corresponds to a URL; each in the dictionary tree The nodes store the number or ratio of non-malicious files and malicious files that satisfy the path matching criteria. The path matching condition may include: the string corresponding to the path from the root node to the node is a prefix of the URL of the file. Here, the common prefix string may be part of a common prefix of a URL containing a common prefix, such as the string "ok.com/" corresponding to the edge 3040 in FIG. 3a in the above example; or a URL and other A character string whose URL does not match. For example, in the above example, the character string "ok.exe" corresponding to the edge 3060 in FIG. 3a, the character string "down.com/notepad.exe" corresponding to the side 3010, and the like.

An application scenario of this embodiment may be a process for detecting a malicious file (an antivirus process) for an electronic device that installs an antivirus application. Among them, a pre-trained preset model is included in the anti-virus application. As shown in FIG. 4, in reference numeral 401, the user downloads a file by clicking on the hyperlink or download address corresponding to the file to be downloaded by the electronic device. At this time, the antivirus application on the electronic device uses the file to be downloaded by the user as the file to be detected, and obtains the download address (URL) of the file to be detected or the URL associated with the hyperlink, as indicated by reference numeral 402. Next, as indicated by reference numeral 403, the antivirus application matches the string contained in the URL with the string in the preset model. Then, as indicated by reference numeral 404, the antivirus application determines whether the file to be detected is a malicious file according to the longest string matched in the preset model according to the URL of the file to be detected. If the file to be detected is a malicious file, as shown by reference numeral 405, the antivirus application gives a prompt for the file to be downloaded by the user to be a malicious file or refuses to connect to the corresponding website. Otherwise, the electronic device downloads the file normally. In this embodiment, the URL of the file to be detected is used to determine whether the file is abbreviated as a malicious file, and the identification efficiency of the malicious file is improved.

With further reference to FIG. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for detecting a malicious file, the apparatus embodiment corresponding to the method embodiment shown in FIG. Specifically, it can be applied to an electronic device.

As shown in FIG. 5, the apparatus 500 for detecting a malicious file includes an obtaining module 501, a matching module 502, and a determining module 503. The obtaining module 501 may be configured to obtain a uniform resource locator URL for downloading the file to be detected; the matching module 502 may be configured to match the character string included in the URL of the file to be detected with the character string in the preset model; The determining module 503 can be configured to determine whether the file to be detected is a malicious file based on the longest string matched in the preset model based on the URL of the file to be detected.

In this embodiment, the obtaining module 501 of the device 500 for detecting a malicious file may have a root The URL for downloading the file to be detected is obtained according to the request of the user to download the file from the network. Here, the file to be detected may be a file downloaded from the network requested by the user.

In this embodiment, the matching module 502 can then match the character string included in the URL of the file to be detected with the character string in the preset model. The preset model may include a string of a plurality of known malicious files and URLs of non-malicious files. The string of the above-mentioned known malicious file and the URL of the non-malicious file may be separately saved in the preset model, or may be saved in the form of a tree structure (for example, may be a dictionary tree). Correspondingly, the matching module 502 may perform string matching on the URL of the preset model one by one with the URL of the file to be detected, or may perform a string according to a string included in the structural form of the tree in units of one character or multiple characters. match.

In this embodiment, the determining module 503 may then determine whether the file to be detected is a malicious file based on the longest string matched in the preset model based on the URL of the file to be detected. In some implementations, the URLs in the preset model are separately saved, and the matching module 502 can match the URLs of the files to be detected with the URLs in the preset model one by one, and the determining module 503 can have the longest according to the URL of the file to be detected. The file type corresponding to the URL of the matching string is used as the type of the file to be detected. In other implementations, the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 2 or FIG. 3a, and the matching module 502 can match the string included in the URL of the file to be detected with the characters at the nodes in the dictionary tree one by one. The determining module 503 can determine whether the file to be detected is a malicious file according to the number or ratio of the malicious file and the non-malicious file corresponding to the URL included in the subtree of the last character matched.

In some implementations of this embodiment, when the URL in the preset model is stored in the form of a dictionary tree as shown in FIG. 2 or FIG. 3a, in the dictionary tree: each side corresponds to a character string; The path from the root node corresponds to a string. The string in the path is concatenated by the string corresponding to the edge in the path. Each node stores the number or ratio of non-malicious files and malicious files that satisfy the path matching condition. The path matching condition includes that the string corresponding to the path from the root node to the node is a prefix of a URL of the file.

In some implementations of this embodiment, the determining module may include: an obtaining unit (not shown) configured to obtain the longest string that matches the URL in the preset model. a node; a reading unit (not shown) configured to read the number or ratio of node records reached by the longest string; a determining unit (not shown) configured to determine the file to be detected based on the quantity or ratio Whether it is a malicious file.

In some implementations of this embodiment, the determining unit may further include: a ratio acquisition subunit (not shown) configured to acquire malicious files and non-malicious files in all paths of the node reached by the longest string Ratio, or the ratio of malicious files to non-malicious files in all paths of the node reached by the longest string according to the quantity; determining a subunit (not shown), determining whether the ratio is greater than a preset threshold; and, when When the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file. When the threshold is not greater than the preset threshold, the file to be detected is determined to be a non-malicious file.

In some implementations of this embodiment, the apparatus 500 for detecting a malicious file may further include a dictionary tree generating module, where the dictionary tree generating module may include: a string matching unit (not shown) configured to include the sample set The URL performs string matching, and obtains all common prefix strings of the URLs included in the sample set according to the matching result; a dictionary tree generating unit (not shown) configured to make each side of the dictionary tree correspond to a common prefix string, Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the common prefix string corresponding to the edge in the path, and each path from the root node to the terminal node corresponds to a URL, and And storing, in each node of the dictionary tree, the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes a string corresponding to the path from the root node to the node is a prefix of the file URL .

In some implementations of this embodiment, the apparatus 500 for detecting a malicious file may further include an update module (not shown) configured to update the preset model according to a result of determining whether the file to be detected is a malicious file. After the determining module 503 determines that the file to be detected is a malicious file or a non-malicious file, the update module may store the URL of the file to be detected into a preset model, and use the known malicious file or non-malicious file as a correlation in the preset model. The content is updated.

Those skilled in the art will appreciate that the apparatus 500 for detecting malicious files described above also includes other well-known structures, such as processors, memories, etc., which are not shown in FIG. 5 in order to unnecessarily obscure the embodiments of the present disclosure. .

The unit or module involved in the embodiment of the present application may be implemented by software or by hardware. The described modules or units may also be provided in the processor, for example, as described in the following: a processor includes an acquisition module, a matching module, and a determination module. The name of these modules does not constitute a limitation on the module itself in some cases. For example, the acquisition module may also be described as “a module configured to acquire a uniform resource locator URL for downloading a file to be detected”.

In another aspect, the present application further provides a computer readable storage medium, which may be a computer readable storage medium included in the apparatus described in the foregoing embodiment, or may exist separately, not A computer readable storage medium that is assembled into a terminal. The computer readable storage medium stores one or more programs that are used by one or more processors to perform the methods of detecting malicious files as described herein.

The above description is only a preferred embodiment of the present application and a description of the principles of the applied technology. It should be understood by those skilled in the art that the scope of the invention referred to in the present application is not limited to the specific combination of the above technical features, and should also be covered by the above technical features without departing from the inventive concept. Other technical solutions formed by any combination of their equivalent features. For example, the above features are combined with the technical features disclosed in the present application, but are not limited to the technical features having similar functions.

Claims

A method for detecting a malicious file, the method comprising:

Obtaining a uniform resource locator URL for downloading the file to be detected;

Matching a character string included in the URL of the file to be detected with a character string in a preset model;

Determining whether the file to be detected is a malicious file, based on the longest character string matched in the preset model by the URL of the file to be detected.
The method of claim 1, wherein the preset model comprises a dictionary tree generated by training a sample set, wherein the sample set includes a URL of a known malicious file and a known non-malicious file. URL.
The method of claim 2, wherein in said dictionary tree:

Each side corresponds to a string;

Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path;

Each node stores the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.
The method according to claim 3, wherein the determining whether the file to be detected is a malicious file comprises: determining whether the file to be detected is a malicious file based on the longest character string matched in the preset model by the URL of the file to be detected:

Acquiring, according to the path matching condition, a node reached by the longest character string in the preset model that matches the URL of the file to be detected;

Reading the number or ratio of node records reached by the longest string;

Whether the file to be detected is a malicious file is determined based on the quantity or the ratio.
The method of claim 4 wherein said quantity is based Or the ratio determines whether the file to be detected is a malicious file, including:

Calculating, according to the quantity, a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string, or obtaining a malicious file in all paths of the node reached by the longest string The ratio to non-malicious files;

Determining whether the ratio is greater than a preset threshold;

When the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file;

When it is not greater than the preset threshold, it is determined that the file to be detected is a non-malicious file.
The method according to any one of claims 2-5, wherein the dictionary tree comprises a dictionary tree trained by the sample set by:

Performing string matching on the URLs included in the sample set, and acquiring all common prefix strings of the URLs included in the sample set according to the matching result;

Each side of the dictionary tree corresponds to a common prefix string, and each path from the root node corresponds to a character string, and the string in the path is spliced in sequence by the common prefix string corresponding to the edge in the path. Each path from the root node to the terminal node corresponds to a URL;

Storing, in each node of the dictionary tree, a number or a ratio of non-malicious files and malicious files satisfying a path matching condition, wherein the path matching condition includes a string corresponding to a path from the root node to the node is a file The prefix of the URL.
The method of any of claims 1-6, wherein the method further comprises:

Updating the preset model according to a result of determining whether the file to be detected is a malicious file.
A device for detecting a malicious file, characterized in that the device comprises:

Obtaining a module, configured to obtain a uniform resource locator URL of the file to be detected;

a matching module, configured to match a character string included in the URL of the file to be detected with a character string in a preset model;

And a determining module, configured to determine whether the file to be detected is a malicious file, based on the longest string matched in the preset model by the URL of the file to be detected.
The apparatus of claim 8, wherein the preset model comprises training the generated dictionary tree by a known malicious file and a URL sample of the non-malicious file.
The apparatus of claim 9 wherein in said dictionary tree:

Each side corresponds to a string;

Each path from the root node corresponds to a string, and the string in the path is spliced in sequence by the string corresponding to the edge in the path;

Each node stores the number or ratio of non-malicious files and malicious files satisfying the path matching condition, wherein the path matching condition includes a string corresponding to the path from the root node to the node being a prefix of the URL of the file.
The apparatus according to claim 10, wherein the determining module comprises:

An obtaining unit, configured to acquire a node reached by the longest character string matching the URL in the preset model;

a reading unit configured to read the number or ratio of node records reached by the longest string;

And a determining unit configured to determine, according to the quantity or the ratio, whether the file to be detected is a malicious file.
The apparatus according to claim 11, wherein said determining unit comprises:

a ratio obtaining subunit configured to obtain, according to the path matching condition, a ratio of a malicious file to a non-malicious file in all paths of the node reached by the longest string, or according to the quantity The ratio of malicious files to non-malicious files in the entire path of the node reached by the long string;

Determining a subunit, determining whether the ratio is greater than a preset threshold;

When the threshold is greater than the preset threshold, the file to be detected is determined to be a malicious file;

When it is not greater than the preset threshold, it is determined that the file to be detected is a non-malicious file.
The device according to any one of claims 9 to 12, wherein the device further comprises a dictionary tree generating module, the dictionary tree generating module comprising:

a string matching unit configured to perform string matching on the URLs included in the sample set, and obtain all common prefix strings of the URLs included in the sample set according to the matching result;

a dictionary tree generating unit, configured to make each edge of the dictionary tree correspond to a common prefix string, each path starting from the root node corresponding to a string, and the string in the path is common to the edge in the path The prefix string is spliced in order, each path from the root node to the terminal node corresponds to a URL, and the number or ratio of non-malicious files and malicious files satisfying the path matching condition is stored in each node of the dictionary tree. Wherein the path matching condition comprises a string corresponding to the path from the root node to the node being a prefix of a URL of the file.
The apparatus according to any one of claims 8-13, wherein the apparatus further comprises an update module, the update module configured to update the pre-determination according to a result of determining whether the file to be detected is a malicious file Set the model.
An apparatus, comprising:

One or more processors;

Memory

One or more programs, the one or more programs being stored in the memory, when executed by the one or more processors:

Obtaining a uniform resource locator URL for downloading the file to be detected;

Matching a character string included in the URL of the file to be detected with a character string in a preset model;

Determining whether the file to be detected is a malicious file, based on the longest character string matched in the preset model by the URL of the file to be detected.
A non-volatile computer storage medium storing one or more programs when the one or more programs are executed by a device The device:

Obtaining a uniform resource locator URL for downloading the file to be detected;

Matching a character string included in the URL of the file to be detected with a character string in a preset model;

Determining whether the file to be detected is a malicious file, based on the longest character string matched in the preset model by the URL of the file to be detected.