WO2018077035A1 - 恶意资源地址检测方法和装置、存储介质 - Google Patents

恶意资源地址检测方法和装置、存储介质 Download PDF

Info

Publication number
WO2018077035A1
WO2018077035A1 PCT/CN2017/105796 CN2017105796W WO2018077035A1 WO 2018077035 A1 WO2018077035 A1 WO 2018077035A1 CN 2017105796 W CN2017105796 W CN 2017105796W WO 2018077035 A1 WO2018077035 A1 WO 2018077035A1
Authority
WO
WIPO (PCT)
Prior art keywords
resource address
detected
malicious
related attribute
address
Prior art date
Application number
PCT/CN2017/105796
Other languages
English (en)
French (fr)
Inventor
林全智
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018077035A1 publication Critical patent/WO2018077035A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Definitions

  • the embodiments of the present invention relate to the field of network security technologies, and in particular, to a malicious resource address detection method and apparatus, and a storage medium.
  • the resource address is an identifier for indicating the location of the resource stored on the network, such as a URL (Uniform Resource Locator).
  • URL Uniform Resource Locator
  • Embodiments of the present invention provide a method and device for detecting a malicious resource address, and a storage medium.
  • a method for detecting a malicious resource address comprising:
  • a malicious resource address detecting apparatus comprising:
  • At least one memory At least one memory
  • At least one processor wherein
  • the at least one memory stores at least one instruction module configured to be executed by the at least one processor; wherein
  • the at least one instruction module includes:
  • a data access module configured to obtain a resource address to be detected
  • a feature extraction module configured to acquire a character feature of the to-be-detected resource address, obtain a related attribute of the to-be-detected resource address, and query the related information in a malicious related attribute library corresponding to the related attribute type to which the related attribute belongs And generating, according to the query result, a related attribute feature corresponding to the to-be-detected resource address, and combining the character feature and the related attribute feature to obtain a multi-dimensional feature;
  • the detecting module is configured to determine, according to the multi-dimensional feature, whether the to-be-detected resource address is a malicious resource address.
  • the feature extraction module is configured to obtain a malicious resource address type, where the malicious resource address type is a type of a malicious resource address that needs to be detected when detecting the malicious address to be detected; Selecting, from the related attribute features, a feature that is adapted to the malicious resource address type; combining the selected features to obtain the multi-dimensional feature.
  • the detecting module is configured to determine, by using a machine learning classifier, whether the to-be-detected resource address is a malicious resource address;
  • the device also includes:
  • Missing or false positive collection module for collecting the judgment in the machine learning classifier Declaring a malicious resource address that is falsely reported or falsely reported when the detected resource address is a malicious resource address;
  • a malicious related attribute database update module configured to acquire a related attribute of the missed or falsely reported malicious resource address; and update a related attribute type to which the related attribute belongs according to the related attribute of the missed or falsely reported malicious resource address Corresponding malicious related attribute library.
  • the apparatus further includes:
  • a machine learning classifier updating module configured to acquire a character feature of the missed or falsely reported malicious resource address; acquire a related attribute of the missed or falsely reported malicious resource address, and a related attribute type to which the related attribute belongs Querying the related attribute in the corresponding malicious related attribute database, and generating related attribute features corresponding to the missed or falsely reported malicious resource address according to the query result; character characteristics of the malicious resource address that is missing or falsely reported Combining with related attribute features to obtain a multi-dimensional feature of the missed or falsely reported malicious resource address; updating the machine learning classifier according to the multi-dimensional feature of the missed or falsely reported malicious resource address.
  • the apparatus further includes:
  • the malicious resource address database management module is configured to add the to-be-detected resource address to the malicious resource address database when the to-be-detected resource address is determined as a malicious resource address, where the malicious resource address library is used to target The resource access request of the malicious resource address in the malicious resource address library is intercepted.
  • a method for detecting a malicious resource address is performed by a server, and the method includes:
  • a computer readable storage medium having stored therein computer readable instructions or programs, the computer readable instructions or programs being executed by a processor to perform the aforementioned malicious resource address detection method.
  • the method and device for detecting a malicious resource address the storage medium, using the character features of the resource address to be detected obtained by statistics, and the related attribute features obtained by querying the malicious related attribute library, and combining to form a multi-dimensional feature representing the address of the resource to be detected, and then The multi-dimensional feature determines whether the detected resource address is a malicious resource address.
  • the technical solution combines the character characteristics of the resource address to be detected and the related attribute feature, and can detect the malicious resource address more effectively than the network crawler only crawls the resource corresponding to the resource address to be detected. Improve detection accuracy.
  • 1 is an application environment diagram of a malicious resource address detection system in an embodiment
  • FIG. 2 is a schematic diagram showing the internal structure of a server in an embodiment
  • FIG. 3 is a schematic flowchart of a method for detecting a malicious resource address in an embodiment
  • FIG. 4 is a flow chart showing the steps of combining character features and related attribute features to obtain multi-dimensional features in one embodiment
  • FIG. 5 is a schematic flowchart of a step of updating a malicious related attribute database according to a malicious resource address that is missing or falsely reported in an embodiment
  • FIG. 6 is a flow chart showing the steps of updating a machine learning classifier according to a missed or falsely reported malicious resource address in an embodiment
  • FIG. 7 is a schematic flowchart of a method for detecting a malicious resource address in a specific application scenario
  • FIG. 8 is a structural block diagram of a malicious resource address detecting apparatus in an embodiment
  • FIG. 9 is a structural block diagram of a malicious resource address detecting apparatus in another embodiment.
  • FIG. 1 is a diagram of an application environment of a malicious resource address detection system in an embodiment.
  • the malicious resource address detecting system includes a terminal 110 and a server 120.
  • the terminal 110 can be configured to send the to-be-detected resource address to the server 120.
  • the server 120 may be configured to obtain a resource address to be detected sent by the terminal 110, obtain a character feature of the resource address to be detected, and query whether a related attribute of the to-be-detected resource address belongs to a corresponding malicious related attribute library, and obtain a corresponding related attribute feature; Combining with related attribute features to obtain multi-dimensional features; determining whether the to-be-detected resource address is a malicious resource address according to the multi-dimensional feature.
  • the server 120 is further configured to feed back the malicious resource address detection result of whether the resource address to be detected is a malicious resource address to the terminal 110. That is to say, the server 120 can be used to execute the malicious resource address detecting method, so the server 120 can also function as a malicious resource address detecting device.
  • the server includes a processor, a non-volatile storage medium, an internal memory, and a network interface connected by a system bus.
  • the non-volatile storage medium of the server stores an instruction module in an operating system, a database, and a malicious resource address detecting device (ie, an instruction module that executes a malicious resource address detecting method).
  • the database may include a malicious related attribute library, a malicious resource address library, a non-malicious resource address library, and a preset non-malicious resource address library.
  • the instruction module in the malicious resource address detecting apparatus is used to implement a malicious resource address detecting method applicable to the server.
  • the server's processor is used to provide computing and control capabilities that support the operation of the entire server.
  • the internal memory of the server is a non-volatile storage medium
  • An operation providing environment of the instruction module of the malicious resource address detecting device wherein the internal memory can store computer readable instructions, and when the computer readable instructions are executed by the processor, the processor can execute a malicious resource address detecting method .
  • the network interface of the server is configured to communicate with an external terminal through a network connection, such as receiving a resource address to be detected sent by the terminal, and feeding back a malicious resource address detection result to the terminal.
  • the server can be implemented with a stand-alone server or a server cluster consisting of multiple servers. It will be understood by those skilled in the art that the structure shown in FIG.
  • FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the server to which the solution of the present application is applied.
  • the specific server may include a ratio. More or fewer components are shown in the figures, or some components are combined, or have different component arrangements.
  • FIG. 3 is a schematic flowchart of a method for detecting a malicious resource address in an embodiment. This embodiment is mainly illustrated by the method applied to the server 120 in the malicious resource address detecting system in FIG. 1 described above.
  • the method for detecting a malicious resource address specifically includes the following steps:
  • the resource address to be detected is a resource address that needs to detect whether it is a malicious resource address.
  • the resource address is data that identifies the location of the resource in the network, such as a URL or a URI (Uniform Resource Identifier).
  • Resources are data that can be stored and transmitted on the network, such as web pages or network files.
  • a malicious resource address is a resource address linked to a malicious resource, such as a phishing website or a scam website, and a malicious resource address may be a URL linked to a phishing website or a scam website.
  • a phishing website is a website that spoofs other regular websites.
  • malicious code is implanted. When the malicious code is executed, sensitive information such as bank accounts and passwords can be collected. Fraudulent websites are websites that use false facts to guide users to disclose sensitive information about users, such as winning scam websites.
  • the terminal may initiate the resource access request according to a certain resource address (that is, when the user wants to initiate a resource access request when accessing the resource corresponding to the resource address), the terminal may The source address is sent to the server as the to-be-detected resource address, and the to-be-detected resource address is obtained by the server.
  • the server can also actively collect the resource address as the resource address to be detected.
  • the server to be detected is a character string composed of a plurality of characters, and the server may perform statistical analysis on the characters constituting the detection resource address to obtain a character feature corresponding to the to-be-detected resource address.
  • the statistical analysis may be a statistical analysis of words composed of characters or characters in the resource address to be detected.
  • the characters constituting the address of the detection resource may be letters or symbols, such as "/", "?" or ".”. If the character to be detected includes the standard prefix "http://", and the character of the to-be-detected resource is obtained, the character of the to-be-detected resource address including the standard prefix may be counted, and the standard prefix may be used from the to-be-detected resource address. Characters are counted after culling.
  • the character feature includes the total length of the resource address to be detected, the total number of words in the resource address to be detected, whether the resource address to be detected includes a preset suspicious keyword, and the length of the host address in the resource address to be detected and the to-be-detected The ratio of the total length of the resource address, and a combination of one or more of the KL divergence between the frequency of occurrence of the character in the resource address to be detected and the frequency of occurrence of the corresponding character in the malicious resource address library.
  • the total length of the to-be-detected resource address may be the total number of characters included in the resource address to be detected.
  • the preset suspicious keyword is a preset word. When the word is included in the to-be-detected resource address, the probability that the to-be-detected resource address is a malicious resource address is greater than zero. Because the malicious resource address may be mixed with a vocabulary similar to the normal resource address, the character feature adopting the preset suspicious keyword may reflect the possibility that the resource address to be detected is malicious.
  • the host address is the address of the device in the network where the resource is located, and is part of the resource address to be detected. Kullback–Leibler divergence, also known as relative entropy, is the quantity that describes the difference between two probability distributions.
  • the resource address to be detected is "http://www.icloud-service-centre.com/ic/indexa.asp?b6mrhzlw”.
  • the total length of the resource address to be detected can be recorded as 59.
  • the host address is "http://www.icloud-service-centre.com”, and the host address has a length of 36 and contains the default suspicious keyword "icloud”.
  • the server may obtain the related attribute of the resource address to be detected, thereby querying the malicious related attribute library corresponding to the related attribute type to which the related attribute belongs, and determining whether the malicious related attribute library can be hit, according to whether the result of the malicious related attribute library is hit.
  • a related attribute feature corresponding to the attribute related to the type of the resource address to be detected is generated.
  • the foregoing S306 may also be described as: “acquiring the related attribute of the to-be-detected resource address, querying the related attribute in the malicious related attribute library corresponding to the related attribute type to which the related attribute belongs, and according to the query result. Generating related attribute features corresponding to the to-be-detected resource address. Malicious related attribute libraries can be cached in the server memory to improve query efficiency.
  • the related attribute is an attribute related to the resource address to be detected.
  • the related attribute feature is a feature that indicates whether the related attribute of the querying the resource address to be detected belongs to the query result of the corresponding malicious related attribute database, and may specifically be a binarized value, such as 0 or 1.
  • the related attribute types may be one or more than one type, and each related attribute type corresponds to a corresponding malicious related attribute library, which is a set of related attributes of the type that the malicious resource address has. Malicious related attribute libraries can be obtained by big data analysis of known malicious resource addresses.
  • the related attribute of the resource address to be detected may include a combination of one or more of propagation channel information, webpage template information, website registrant information, and internet protocol address of the resource address to be detected.
  • the information about the channel of the resource to be detected is the information indicating the path of the resource address to be detected. Specifically, the channel of the channel to be detected is backtracked, and the channel information of the resource address to be detected can be obtained. Since the malicious resource address may be sent through some specific tools, the propagation channel information may reflect the resource address to be detected to some extent. The possibility of being malicious.
  • the webpage template information is information indicating a webpage structure of a webpage corresponding to the resource address to be detected.
  • the webpage template information may be webpage data indicating a webpage structure, or may be a hash value generated based on webpage data representing a webpage structure.
  • Web page data representing a web page structure such as a tag in a web page file or a DOM (Document Object Model) tree.
  • the website registrant information is the registrant information registered when the domain name of the resource address to be detected is registered.
  • the website registrant can be a company or an individual.
  • the website registrant information may include the name, code, and other registration information of the website registrant, or may be a hash value generated based on the website registrant's name, code, and other registration information.
  • the Internet Protocol Address in English is called Internet Protocol Address, which is the IP address.
  • the Internet Protocol address is a scarce resource, and the malicious resource address has a certain aggregation on the Internet Protocol address.
  • step S304 and step S306 can be performed simultaneously, or sequentially.
  • Step S304 can be performed before or after step S306.
  • character features include one or more than one feature
  • related attribute features also include one or more than one feature.
  • the server may combine the character features and the related attribute features in turn according to a preset feature combination order to obtain a multi-dimensional feature.
  • Each dimension in a multi-dimensional feature represents a character feature or a related attribute feature.
  • the multi-dimensional feature can characterize the resource address to be detected.
  • the character feature can be recorded as 53; if the total number of words in the resource address to be detected is 13, the character feature can be recorded as 13; if the resource address to be detected includes a preset Suspicious keyword, the character feature can be recorded as 1 (if the resource address to be detected does not include the preset suspicious keyword, it can be recorded as 0); the length of the host address in the resource address to be detected is 12, and the character feature is recorded as 12; If the ratio of the length of the host address in the to-be-detected resource address to the total length of the to-be-detected resource address is 12/53; the channel information of the to-be-detected resource address, the webpage template information, the website registrant information, and the internet protocol address are both Hit the corresponding malicious related attribute library, and these related attribute characteristics can all be recorded as 1. Then, these character features and related attribute features are sequentially combined to form a feature vector [53, 13, 1, 12, 12/53, 1, 1,
  • the server may determine, according to the multi-dimensional feature, whether the to-be-detected resource address is a malicious resource address.
  • the machine learning classifier is a trained machine learning algorithm model. Machine learning English is called Machine Learning, referred to as ML.
  • the machine learning classifier can be classified into the malicious resource address and the non-malicious resource address by the multi-dimensional feature.
  • a non-malicious resource address is a resource address that does not point to a malicious resource.
  • the machine learning classifier can use a SVM (Support Vector Machine) classifier, a Bayesian classifier, or a neural network model. In practice, the SVM classifier can achieve good results.
  • the server inputs the multi-dimensional feature into the pre-trained machine learning classifier, and the machine learning classifier operates the multi-dimensional feature to output a malicious resource address detection result, where the malicious resource address detection result indicates whether the to-be-detected resource address is Is the address of a malicious resource.
  • the feature type and feature sequence included in the multi-dimensional feature adopted by the training machine learning classifier are consistent with the feature type and feature order of the multi-dimensional feature according to whether the resource address to be detected is a malicious resource address.
  • the server learns the classifier and calculates the probability that the to-be-detected resource address represented by the multi-dimensional feature belongs to the malicious resource address according to the input multi-dimensional feature, and determines whether the probability is greater than a condition threshold; If the value is greater than or equal to the condition threshold, the machine learning classifier outputs a malicious resource address detection result indicating that the to-be-detected resource address is a malicious resource address; if the probability is less than the condition threshold, the machine learning classifier output indicates that the to-be-detected resource address is non-malicious The malicious resource address detection result of the resource address.
  • the condition threshold can be set to 0.8 to 0.98, and can be set to 0.95.
  • the machine learning classifier can be represented as f(x):
  • x represents a multidimensional feature in vector form and is used to represent the resource address to be detected.
  • m indicates that the address is a malicious resource, for example, 1 may be taken;
  • n indicates that the address is a non-malicious resource, for example, 0 or -1 may be taken.
  • the function g() represents a logistic regression function.
  • q represents a conditional threshold, such as 0.8 to 0.98.
  • w T x+b denotes a hyperplane which maximizes the interval between the two categories of multidimensional features of the training set in the feature space.
  • w denotes a normal vector, T denotes a transpose, and b denotes a coefficient.
  • w and b are obtained through training.
  • the problem of finding w and b during training can be transformed into a convex quadratic programming problem, so that
  • is the second-order norm of w.
  • the method for detecting a malicious resource address uses a character feature of a resource address to be detected obtained by statistics, and a related attribute feature obtained by querying a malicious related attribute library, and combines to form a multi-dimensional feature representing a resource address to be detected, and then uses a machine learning classifier pair.
  • the multi-dimensional feature is classified to obtain a detection result of whether the resource address to be detected is a malicious resource address.
  • the method can detect the malicious resource address more than the resource corresponding to the network resource to be detected by the network crawler. A malicious resource address.
  • the method for detecting the malicious resource address further includes: determining that the to-be-detected resource address is a non-malicious resource address or a suspicious resource address; and when the to-be-detected resource address is a suspicious resource address At the same time, step S304 and step S306 are performed.
  • the server may obtain related attribute features and/or character features of the to-be-detected resource address, and input the obtained related attribute features and/or character features into the filter classifier, and output by the filter classifier indicates whether the to-be-detected resource address is a suspicious resource.
  • the suspicious resource address detection result of the address may employ a Bayesian classifier, preferably a decision tree classifier.
  • the server filters out the to-be-detected resource address that is determined to be a non-suspicious resource address, and only retains the to-be-detected resource address that is determined to be the suspicious resource address, and then determines the suspect resource address to be detected.
  • the resource address continues to perform steps S304, S306, S308, and S310 to obtain a malicious resource address detection result.
  • the process of determining that the to-be-detected resource address is a non-malicious resource address or a suspicious resource address is actually a filtering process, and therefore the determining process may also be described as “filtering out the to-be-detected resource address.
  • the non-malicious resource address in the "filtered address" can be suspected of malicious addresses, and S304, S306, S308, and S310 can be performed on these suspected malicious addresses.
  • the suspicious resource address is a resource address in which a certain probability is a malicious resource address.
  • the decision tree classifier is used to filter out the to-be-detected resource address determined as a non-malicious resource address, and the decision tree classifier has high processing efficiency, and can filter the non-malicious resource address from a large number of to-be-detected resource addresses, thereby reducing the load. And improve the accuracy of detecting malicious resource addresses.
  • the training set of the decision tree classifier includes a malicious resource address library and a non-malicious resource address library. When training the decision tree classifier, more than one type of related attribute corresponding to each resource address in the training set can be extracted, and the extracted related attributes are queried. Whether it belongs to the corresponding malicious related attribute library, and obtains corresponding related attribute features, thereby training the decision tree classifier according to the related attribute characteristics.
  • filtering the to-be-detected resource address may filter out the to-be-detected resource address that is not a malicious resource address, reduce the load, and improve the accuracy of detecting the malicious resource address.
  • FIG. 4 is a schematic flow chart of step S308 in an embodiment. Referring to FIG. 4, the step S308 specifically includes the following steps:
  • the type of the malicious resource address that is currently detected refers to the type of the malicious resource address that needs to be detected when the malicious resource address detection method is currently executed. That is, the malicious resource address type is a type of a malicious resource address that needs to be detected when detecting the malicious address to be detected.
  • Malicious resource addresses can be divided into different types of malicious resource addresses, such as phishing website types and scam site types. Phishing site types can be subdivided into counterfeit shopping site types, counterfeit bank sites Types and counterfeiting specify official websites and so on. For different malicious resource address types, different machine learning classifiers are trained to detect malicious resource addresses.
  • the server may pre-store the correspondence between the malicious resource address type and the matching feature, so that after the currently detected malicious resource address type is obtained, the character feature and the related attribute feature are selected according to the corresponding relationship and the currently detected The characteristics of the malicious resource address type adaptation.
  • the correspondence between the malicious resource address type and the matching feature can be set according to prior knowledge, or can be obtained by performing big data analysis on the known malicious resource address.
  • the server may combine the selected features in turn according to a preset feature combination order to obtain a multi-dimensional feature.
  • the server may further weight the features of each dimension in the multi-dimensional feature according to the inter-feature weight relationship adapted to the currently detected malicious resource address type.
  • the multi-dimensional feature is more suitable for the currently detected malicious resource address type by weighting processing.
  • the to-be-detected resource address includes the character feature of the preset suspicious keyword is not effective, and may be removed when the feature is selected.
  • whether the resource address to be detected includes the character characteristics of the preset suspicious keyword can play a good role, and the feature needs to be selected to form a multi-dimensional feature.
  • different malicious resource address types are subdivided.
  • the full-featured feature may not play a role in detecting the malicious resource address, and may even play the opposite role, so the selection and current
  • the detected characteristics of the malicious resource address type adaptation can detect the malicious resource address more accurately and effectively.
  • the malicious resource address detecting method further includes: according to a false negative or a false positive
  • the machine learning classifier may be used to determine whether the to-be-detected resource address is a malicious resource address. Referring to FIG. 5, the step of updating the malicious related attribute database according to the missed or false reported malicious resource address specifically includes the following steps:
  • the missed malicious resource address refers to the malicious resource address but is judged as a non-malicious resource address by the machine learning classifier; the false positive resource address refers to the original non-malicious resource address but is judged by the machine learning classifier. Is the address of a malicious resource.
  • the missed malicious resource address may be obtained through a manual reporting manner, or may be obtained by using a different machine learning classifier and performing cross-comparison on the malicious resource address detection results of the same to-be-detected resource address.
  • the same resource address to be detected passes through the machine learning classifiers A, B, and C.
  • the malicious resource address detection result is: a malicious resource address, a non-malicious resource address, and a non-malicious resource address, and the resource address to be detected can be used as a machine. Learn the malicious resource addresses that Classifiers B and C missed. False positive resource addresses can be obtained through manual appeals or manual checks.
  • the server can collect related attributes of the missed or falsely reported malicious resource address through big data analysis.
  • S506 Update the corresponding malicious related attribute library according to the collected related attributes. That is, according to the related attribute of the missed or falsely reported malicious resource address, the malicious related attribute library corresponding to the related attribute type to which the related attribute belongs is updated.
  • the server may add the related attribute of the collected missed malicious resource address to the corresponding malicious related attribute database.
  • the server may delete the related attribute of the false positive malicious resource address from the corresponding malicious related attribute database.
  • the malicious related attribute is misreported or falsely reported.
  • the library is updated to avoid the spread of subsequent false negatives or false positives, and improve the accuracy of detecting malicious resource addresses.
  • the malicious resource address detection method further includes the step of updating the machine learning classifier based on the missed or falsely reported malicious resource address.
  • the step of updating the machine learning classifier according to the missed or false reported malicious resource address specifically includes the following steps:
  • the character feature may include the total length of the malicious resource address that is missing or falsely reported, the total number of words in the malicious resource address that is missing or falsely reported, and whether the malicious resource address that is missing or falsely reported includes the preset suspicious keyword.
  • the server may obtain the related attribute of the malicious resource address that is missing or falsely reported, thereby querying the malicious related attribute library corresponding to the related attribute type to which the related attribute belongs, and determining whether the malicious related attribute library can be hit, according to whether the malicious relevance is hit.
  • the query result of the attribute library generates related attribute features corresponding to the related attribute of the type of the malicious resource address of the missed or false report.
  • the server may combine the character features and the related attribute features in turn according to a preset feature combination order to obtain a multi-dimensional feature.
  • the server may also obtain a malicious resource address type corresponding to the current machine learning classifier, thereby selecting a feature that is adapted to the malicious resource address type among the character features and related attribute features.
  • the machine learning classifier when a malicious resource address that is missing or falsely reported is generated, the machine learning classifier is updated according to the malicious resource address that is missed or falsely reported, and the accuracy of detecting the malicious resource address after the update is improved.
  • the method for detecting a malicious resource address further includes: determining that the resource address to be detected is a malicious resource address, and adding the to-be-detected resource address to the malicious resource address library; wherein the malicious resource address library is used for targeting the malicious resource A resource access request for a malicious resource address in the address library is intercepted.
  • the terminal when the terminal initiates a resource access request according to a certain resource address, the terminal first queries whether the resource address belongs to the malicious resource address library, and if yes, intercepts the resource access request; if not, sends the resource access request.
  • the terminal may specifically query whether a resource address belongs to a malicious resource address library from a server or a local, and the local malicious resource address library may be periodically synchronized from the server.
  • the resource address to be detected is added to the malicious resource address database, so that the resource access request for the malicious resource address in the malicious resource address database can be intercepted according to the malicious resource address library to ensure resource access security.
  • the method for detecting the malicious resource address further includes: adding the to-be-detected resource address to the malicious device when the to-be-detected resource address is determined to be a malicious resource address, and the to-be-detected resource address does not belong to the preset non-malicious resource address database.
  • the malicious resource address library is used for resource access requests for malicious resource addresses in the malicious resource address library Intercept.
  • the server may continue to determine whether the to-be-detected resource address belongs to the preset non-malicious resource address library.
  • the preset non-malicious resource address library is a preset set of non-malicious resource addresses for preventing false positive processing. If the to-be-detected resource address belongs to the preset non-malicious resource address library, the server does not process the to-be-detected resource address.
  • the server may add the to-be-detected resource address to the malicious resource address library, so that the detected malicious resource address may be used to intercept the corresponding resource access request.
  • the malicious resource address detected by the machine learning classifier may have a false positive, and the malicious resource address library is used to intercept the resource access request.
  • a false positive may affect normal resource access.
  • the resource address to be detected does not belong to the default non-malicious resource address database, the resource address to be detected is added to the malicious resource address database to prevent false positives from occurring and to prevent the falsely reported malicious resource address from affecting normal resource access.
  • the server may use a malicious resource address library and a non-malicious resource address library as a training sample library, and use related attributes of malicious resource addresses in the malicious resource address library to form a related attribute library, and generate character features and related information of resource addresses in the training samples. Attribute characteristics, and selecting features from the generated character features and related attribute features to form a multi-dimensional feature according to the corresponding malicious resource address type.
  • the server trains the machine learning classifier with multi-dimensional feature training corresponding to the resource address in the training sample.
  • the server receives the incoming resource address to be detected, and uses the decision tree classifier to filter the to-be-detected resource address of the non-malicious resource address, and extracts character features and related attribute features from the remaining resource addresses to be detected after filtering, according to the current
  • the detected malicious resource address type selects features from the extracted character features and related attribute features to form a multi-dimensional feature.
  • the server inputs the multi-dimensional feature corresponding to the resource address to be detected and the currently detected malicious resource.
  • the machine type classifier of the address type adaptation wherein the machine learning classifier outputs a malicious resource address detection result of whether the resource address to be detected is a malicious resource address.
  • the server may perform anti-false alarm processing on the malicious resource address detection result.
  • the server may determine whether the to-be-detected resource address belongs to the preset non-malicious resource address database when the to-be-detected resource address is determined as the malicious resource address, and add the to-be-detected resource address to the malicious resource if it is not in the preset non-malicious resource address database. Address library.
  • the server may also determine, when the to-be-detected resource address is determined to be a malicious resource address, whether the specified feature of the to-be-detected resource address meets the specified feature condition of the non-malicious resource address, such as the search volume or the click volume or the heat exceeds a preset value, if not The match adds the to-be-detected resource address to the malicious resource address pool.
  • the server may also determine the malicious resource address of the false positive according to the manual appeal, determine the missed malicious resource address according to the manual report, and update the related attribute library and the machine learning classifier according to the false alarm and the falsely reported resource address.
  • the server may also monitor the malicious resource address by another machine learning classifier that is more relaxed than the machine learning classifier probability judgment condition used when detecting the malicious resource address, such as the machine learning classifier used when the condition threshold is less than the detection of the malicious resource address.
  • Another machine learning classifier for the conditional threshold, the condition threshold of the other machine learning classifier may be, for example, 0.5.
  • the another machine learning classifier determines the accuracy of the resource address to be detected as the malicious resource address, and is lower than the accuracy of the machine learning classifier used to detect the malicious resource address to determine the resource address to be detected as the malicious resource address;
  • the machine learning classifier monitors the coverage rate of the malicious resource address, and the machine learning classifier used when detecting the malicious resource address detects the coverage rate of the malicious resource address. By monitoring the malicious resource address by another machine learning classifier, more malicious resource addresses can be found to ensure coverage of malicious resource address detection.
  • FIG. 8 is a structural block diagram of an instruction module in a malicious resource address detecting apparatus 800 (for example, a server) in an embodiment.
  • the malicious resource address detecting apparatus 800 includes:
  • At least one memory (eg, the non-volatile storage medium in FIG. 2);
  • At least one processor wherein
  • the at least one memory stores at least one instruction module configured to be executed by the at least one processor; wherein
  • the at least one instruction module includes:
  • the data access module 810 the feature extraction module 820, and the detection module 830.
  • the data access module 810 is configured to obtain a resource address to be detected.
  • the feature extraction module 820 is configured to acquire a character feature of the resource address to be detected. Query whether the related attribute of the to-be-detected resource address belongs to the corresponding malicious related attribute library, and obtain corresponding related attribute features. Used to combine character features and related attribute features to obtain multi-dimensional features. That is, the character of the to-be-detected resource is obtained, the related attribute of the to-be-detected resource address is obtained, and the related attribute is queried in the malicious related attribute library corresponding to the related attribute type to which the related attribute belongs. Generating related attribute features corresponding to the to-be-detected resource address according to the query result, and combining the character feature and the related attribute feature to obtain a multi-dimensional feature.
  • the detecting module 830 is configured to determine, according to the multi-dimensional feature, whether the to-be-detected resource address is a malicious resource address.
  • the malicious resource address detecting apparatus 800 uses the character features of the to-be-detected resource address obtained by the statistics, and the related attribute features obtained by querying the malicious related attribute library, and combines to form a multi-dimensional feature representing the resource address to be detected, and then uses the machine learning classifier.
  • the multi-dimensional feature is classified to obtain a detection result of whether the resource address to be detected is a malicious resource address.
  • the method can detect the malicious resource address more than the resource corresponding to the network resource to be detected by the network crawler. A malicious resource address.
  • FIG. 9 is a structural block diagram of an instruction module in the malicious resource address detecting apparatus 800 in another embodiment.
  • the instruction module in the malicious resource address detecting apparatus 800 further includes: a filtering module 840, configured to determine that the to-be-detected resource address is a non-malicious resource address or The suspicious resource address is notified to the feature extraction module 820 when the to-be-detected resource address is a suspicious resource address. That is, it is used to filter out the non-malicious resource address in the resource address to be detected.
  • the feature extraction module 820 is further configured to: when the resource address to be detected is a suspicious resource address, acquire a character feature of the resource address to be detected, and query whether a related attribute of the to-be-detected resource address belongs to a corresponding malicious related attribute library, and obtain a corresponding related attribute. feature. That is, the resource address to be detected remaining after the filtering is used as the suspicious resource address, and the character feature of the resource address to be detected and the related attribute of the resource address to be detected are performed on the suspect resource address. step.
  • filtering the to-be-detected resource address may filter out the to-be-detected resource address that is not necessarily a malicious resource address, reduce the load, and improve the accuracy of detecting the malicious resource address.
  • the character feature includes the total length of the resource address to be detected, the total number of words in the resource address to be detected, whether the resource address to be detected includes a preset suspicious keyword, and the length of the host address in the resource address to be detected and the to-be-detected The ratio of the total length of the resource address, and a combination of one or more of the KL divergence between the frequency of occurrence of the character in the resource address to be detected and the frequency of occurrence of the corresponding character in the malicious resource address library.
  • the related attribute of the to-be-detected resource address includes a combination of one or more of propagation channel information, webpage template information, website registrant information, and internet protocol address of the resource address to be detected.
  • the feature extraction module 820 is further configured to acquire a currently detected malicious resource address type, select a feature that matches the malicious resource address type in the character feature and the related attribute feature, and combine the selected feature to obtain a multi-dimensional feature. That is, the feature extraction module 820 is configured to acquire a malicious resource address type, where the malicious resource address type is a type of a malicious resource address that needs to be detected when detecting the malicious address to be detected; Selecting, from the related attribute features, a feature that is adapted to the malicious resource address type; combining the selected features to obtain the multi-dimensional feature.
  • different malicious resource address types are subdivided.
  • the full-featured feature may not play a role in detecting the malicious resource address, and may even play the opposite role, so the selection and current
  • the detected characteristics of the malicious resource address type adaptation can detect the malicious resource address more accurately and effectively.
  • the detecting module 830 is further configured to adopt a machine learning classifier and determine, according to the multi-dimensional feature, whether the to-be-detected resource address is a malicious resource address.
  • the malicious resource address detecting apparatus 800 further includes: a false negative or false positive collecting module 850 and a malicious related attribute database updating module 860.
  • the false negative or false positive collection module 850 is configured to collect a malicious resource address that is misreported or falsely reported when the machine learning classifier determines whether the resource address to be detected is a malicious resource address.
  • the malicious related attribute database update module 860 is configured to obtain related attributes of the malicious resource address that are missing or falsely reported; and update the corresponding malicious related attribute database according to the collected related attributes. That is, the malicious related attribute database update module 860 is configured to acquire the related attribute of the missed or falsely reported malicious resource address; and update the related attribute to which the related attribute belongs according to the related attribute of the missed or falsely reported malicious resource address.
  • the malicious related attribute library corresponding to the type.
  • the malicious resource address detecting apparatus 800 further includes: a machine learning classifier updating module 870, configured to acquire a character feature of a missed or falsely reported malicious resource address; and query a false resource or a false positive malicious resource address. Whether the related attribute belongs to the corresponding malicious related attribute library, and obtains the corresponding related attribute characteristics; the character characteristics of the malicious resource address that is missing or falsely reported and the related attribute features corresponding to the false or falsely reported malicious resource address are correspondingly obtained. Multi-dimensional features; update the machine learning classifier based on multi-dimensional features corresponding to missed or false positive malicious resource addresses.
  • the machine learning classifier update module is configured to obtain the character feature of the missed or falsely reported malicious resource address; and obtain the malicious information of the missing or false positive
  • the related attribute of the source address is queried in the malicious related attribute library corresponding to the related attribute type to which the related attribute belongs, and the related attribute feature corresponding to the missed or falsely reported malicious resource address is generated according to the query result; Combining the character features of the missed or falsely reported malicious resource address and related attribute features to obtain a multi-dimensional feature of the missed or falsely reported malicious resource address; according to the missed or falsely reported malicious resource address
  • the multi-dimensional feature updates the machine learning classifier.
  • the machine learning classifier when a malicious resource address that is missing or falsely reported is generated, the machine learning classifier is updated according to the malicious resource address that is missed or falsely reported, and the accuracy of detecting the malicious resource address after the update is improved.
  • the malicious resource address detecting apparatus 800 further includes: a malicious resource address library management module 880, configured to add the to-be-detected resource address to the malicious resource address library when the to-be-detected resource address is determined to be a malicious resource address;
  • the malicious resource address library is used to intercept resource access requests for malicious resource addresses in the malicious resource address library.
  • the resource address to be detected is added to the malicious resource address database, so that the resource access request for the malicious resource address in the malicious resource address database can be intercepted according to the malicious resource address library to ensure resource access security.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or the like.
  • the embodiment of the present invention further provides a computer readable storage medium having stored therein computer readable instructions or programs, the computer readable instructions or programs being executed by the processor to perform the foregoing malicious resource address detecting method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明涉及一种恶意资源地址检测方法和装置、存储介质,该方法包括:获取待检测资源地址;获取所述待检测资源地址的字符特征;获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,并根据查询结果生成所述待检测资源地址对应的相关属性特征;将所述字符特征和所述相关属性特征组合得到多维度特征;根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。

Description

恶意资源地址检测方法和装置、存储介质
本申请要求于2016年10月31日提交中国专利局、申请号为201610978043.6、发明名称为“恶意资源地址检测方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及网络安全技术领域,特别是涉及一种恶意资源地址检测方法和装置、存储介质。
背景
资源地址是用于表示网络上所存储资源的位置的标识,如URL(Uniform Resource Locator,统一资源定位符)。将资源放置在网络中,通过资源地址就可以方便地访问和分享资源。但资源地址也会被一些人用作从事非法活动的媒介,链接到对用户不利的恶意资源,如仿冒网站或诈骗网站等,成为恶意资源地址。
技术内容
本发明实施例提供一种恶意资源地址检测方法和装置、存储介质。
一种恶意资源地址检测方法,所述方法包括:
获取待检测资源地址;
获取所述待检测资源地址的字符特征;
获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,并根据查询结果生成所述待检测资源地址对应的相关属性特征;
将所述字符特征和所述相关属性特征组合得到多维度特征;
根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。
一种恶意资源地址检测装置,所述装置包括:
至少一个存储器;
至少一个处理器;其中,
所述至少一个存储器存储有至少一个指令模块,经配置由所述至少一个处理器执行;其中,
所述至少一个指令模块包括:
数据接入模块,用于获取待检测资源地址;
特征提取模块,用于获取所述待检测资源地址的字符特征,获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,根据查询结果生成所述待检测资源地址对应的相关属性特征,以及,将所述字符特征和所述相关属性特征组合得到多维度特征;
检测模块,用于根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。
在一些实例中,所述特征提取模块用于获取恶意资源地址类型,所述恶意资源地址类型为对所述待检测恶意地址进行检测时需要检测的恶意资源地址的类型;在所述字符特征和所述相关属性特征中选择与所述恶意资源地址类型适配的特征;将选择的特征进行组合得到所述多维度特征。
在一些实例中,所述检测模块用于采用机器学习分类器判断所述待检测资源地址是否为恶意资源地址;
所述装置还包括:
漏报或误报收集模块,用于收集在采用所述机器学习分类器判断所 述待检测资源地址是否为恶意资源地址时漏报或误报的恶意资源地址;
恶意相关属性库更新模块,用于获取所述漏报或误报的恶意资源地址的相关属性;根据所述漏报或误报的恶意资源地址的相关属性,更新该相关属性所属的相关属性类型对应的恶意相关属性库。
在一些实例中,所述装置还包括:
机器学习分类器更新模块,用于获取所述漏报或误报的恶意资源地址的字符特征;获取所述漏报或误报的恶意资源地址的相关属性,在该相关属性所属的相关属性类型所对应的恶意相关属性库中查询该相关属性,并根据查询结果生成所述漏报或误报的恶意资源地址对应的相关属性特征;将所述漏报或误报的恶意资源地址的字符特征和相关属性特征进行组合得到所述漏报或误报的恶意资源地址的多维度特征;根据所述漏报或误报的恶意资源地址的多维度特征,更新所述机器学习分类器。
在一些实例中,所述装置还包括:
恶意资源地址库管理模块,用于当所述待检测资源地址被判断为恶意资源地址时,将所述待检测资源地址加入恶意资源地址库中;其中,所述恶意资源地址库用于对针对所述恶意资源地址库中的恶意资源地址的资源访问请求进行拦截。
一种恶意资源地址检测方法,由服务器执行,所述方法包括:
获取待检测资源地址;
获取所述待检测资源地址的字符特征;
获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,并根据查询结果生成所述待检测资源地址对应的相关属性特征;
将所述字符特征和所述相关属性特征组合得到多维度特征;
根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。
一种计算机可读存储介质,其内存储有计算机可读指令或程序,所述计算机可读指令或程序被处理器执行前述恶意资源地址检测方法。
上述恶意资源地址检测方法和装置、存储介质,利用统计得到的待检测资源地址的字符特征,以及查询恶意相关属性库得到的相关属性特征,组合形成代表待检测资源地址的多维度特征,再根据多维度特征对待检测资源地址是否为恶意资源地址进行判断。该技术方案结合了待检测资源地址自身的字符特征以及与相关属性特征,相比仅依赖于网络爬虫爬取待检测资源地址对应的资源进行检测的方式,能够更加有效地检测出恶意资源地址,提高检测准确性。
图简要说明
图1为一个实施例中恶意资源地址检测系统的应用环境图;
图2为一个实施例中服务器的内部结构示意图;
图3为一个实施例中恶意资源地址检测方法的流程示意图;
图4为一个实施例中将字符特征和相关属性特征组合得到多维度特征的步骤的流程示意图;
图5为一个实施例中根据漏报或误报的恶意资源地址更新恶意相关属性库的步骤的流程示意图;
图6为一个实施例中根据漏报或误报的恶意资源地址更新机器学习分类器的步骤的流程示意图;
图7为一个具体应用场景中恶意资源地址检测方法的流程示意图;
图8为一个实施例中恶意资源地址检测装置的结构框图;
图9为另一个实施例中恶意资源地址检测装置的结构框图。
实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
图1为一个实施例中恶意资源地址检测系统的应用环境图。参照图1,该恶意资源地址检测系统包括终端110和服务器120。终端110可用于将待检测资源地址发送至服务器120。服务器120可用于获取终端110发送的待检测资源地址;获取待检测资源地址的字符特征;查询待检测资源地址的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征;将字符特征和相关属性特征组合得到多维度特征;根据多维度特征判断待检测资源地址是否为恶意资源地址。服务器120还可用于将待检测资源地址是否为恶意资源地址的恶意资源地址检测结果反馈至终端110。也就是说,服务器120可以用来执行恶意资源地址检测方法,因此服务器120也可以作为一种恶意资源地址检测装置。
图2为一个实施例中服务器(或者恶意资源地址检测装置)的内部结构示意图。如图2所示,该服务器包括通过系统总线连接的处理器、非易失性存储介质、内存储器和网络接口。其中,该服务器的非易失性存储介质存储有操作系统、数据库和恶意资源地址检测装置中的指令模块(即执行恶意资源地址检测方法的指令模块)。数据库可包括恶意相关属性库、恶意资源地址库、非恶意资源地址库以及预设无恶意资源地址库。该恶意资源地址检测装置中的指令模块用于实现适用于服务器的一种恶意资源地址检测方法。该服务器的处理器用于提供计算和控制能力,支撑整个服务器的运行。该服务器的内存储器为非易失性存储介质 中的恶意资源地址检测装置的指令模块的运行提供环境,该内存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种恶意资源地址检测方法。该服务器的网络接口用于据以与外部的终端通过网络连接通信,比如接收终端发送的待检测资源地址,向终端反馈恶意资源地址检测结果等。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的服务器的限定,具体的服务器可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
图3为一个实施例中恶意资源地址检测方法的流程示意图。本实施例主要以该方法应用于上述图1中恶意资源地址检测系统中的服务器120来举例说明。参照图3,该恶意资源地址检测方法具体包括如下步骤:
S302,获取待检测资源地址。
其中,待检测资源地址是需要检测是否为恶意资源地址的资源地址。资源地址则是标识资源在网络中位置的数据,比如URL或URI(Uniform Resource Identifier,统一资源标识符)。资源是指可在网络中存储和传输的数据,如网页或者网络文件。恶意资源地址是链接至恶意资源的资源地址,恶意资源如仿冒网站或者诈骗网站,恶意资源地址则可以是链接至仿冒网站或者诈骗网站的URL。仿冒网站是仿冒其它正规网站的网站,一般会植入恶意代码,该恶意代码执行时可搜集用户敏感信息,如银行账号和密码等。诈骗网站是通过虚假事实引导用户泄露用户敏感信息的网站,如中奖诈骗网站。
具体地,终端可在依据某资源地址发起资源访问请求(即用户想要对某资源地址对应的资源进行访问时会发起资源访问请求)时,将该资 源地址作为待检测资源地址发送至服务器,由服务器获取该待检测资源地址。服务器也可以主动收集资源地址作为待检测资源地址。
S304,获取待检测资源地址的字符特征。
其中,待检测资源地址是若干字符组成的字符串,服务器可对组成该检测资源地址的字符进行统计分析,得到与该待检测资源地址相应的字符特征。统计分析可以是针对待检测资源地址中字符或字符所组成单词的统计分析。组成该检测资源地址的字符可以是字母或者符号,符号如“/”,“?”或者“.”等。待检测资源地址若包括标准前缀“http://”,获取待检测资源地址的字符特征时,可以统计包括该标准前缀的待检测资源地址的字符特征,也可以将标准前缀从待检测资源地址中剔除后再统计字符特征。
在一个实施例中,字符特征包括待检测资源地址的总长度,待检测资源地址中的单词总数,待检测资源地址是否包括预设可疑关键词,待检测资源地址中主机地址的长度与待检测资源地址的总长度的比值,以及,待检测资源地址中字符出现频率与恶意资源地址库中相应字符出现频率之间的KL散度中的一种或几种的组合。
其中,待检测资源地址的总长度可以是待检测资源地址所包括字符的总数量。预设可疑关键词是预先设定的单词,当待检测资源地址中包括该单词时表示待检测资源地址是恶意资源地址的概率大于0。因恶意资源地址中可能会混入与正常的资源地址相似的词汇,字符特征采用预设可疑关键词可以一定程度上反映待检测资源地址存在恶意的可能性。主机(host)地址是标识资源所在的网络中设备的地址,是待检测资源地址的一部分。KL散度(Kullback–Leibler divergence)又称为相对熵,是描述两个概率分布差异的量。
举例说明,假设待检测资源地址为“http://www.icloud-service-centre.com/ic/indexa.asp?b6mrhzlw”。该待检测资源地址的总长度可记为59, 主机地址为“http://www.icloud-service-centre.com”,主机地址的长度为36,含有预设可疑关键词“icloud”。
S306,查询待检测资源地址的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征。
具体地,服务器可获取待检测资源地址的相关属性,从而查询该相关属性所属的相关属性类型对应的恶意相关属性库,判断是否能够命中恶意相关属性库,根据是否命中恶意相关属性库的查询结果生成与待检测资源地址该类型相关属性相应的相关属性特征。也就是说,上述S306也可以描述成“获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,并根据查询结果生成所述待检测资源地址对应的相关属性特征”。恶意相关属性库可缓存在服务器内存中,提高查询效率。
其中,相关属性是与待检测资源地址相关的属性。相关属性特征是表征查询待检测资源地址的相关属性是否属于相应的恶意相关属性库的查询结果的特征,具体可以是二值化的数值,如0或者1。相关属性类型可以是一种或者多于一种,每种相关属性类型对应有相应的恶意相关属性库,该恶意相关属性库是恶意资源地址所具有的该种类型的相关属性构成的集合。恶意相关属性库可通过对已知的恶意资源地址进行大数据分析得到。
在一个实施例中,待检测资源地址的相关属性可以包括待检测资源地址的传播渠道信息、网页模板信息、网站注册人信息以及网际协议地址中的一种或几种的组合。
待检测资源地址的传播渠道信息是表示待检测资源地址传播途径的信息,具体可通过对待检测资源地址的传播路径进行回溯,可以得到待检测资源地址的传播渠道信息。由于恶意资源地址可能会通过某些特定工具发送,因此传播渠道信息可以一定程度上反映待检测资源地址存 在恶意的可能性。
网页模板信息是表示待检测资源地址所对应网页的网页结构的信息。网页模板信息可以是表示网页结构的网页数据,也可以是根据表示网页结构的网页数据生成的哈希值。表示网页结构的网页数据如网页文件中的标签或者DOM(Document Object Model,文档对象模型)树。
网站注册人信息是注册待检测资源地址的域名时登记的注册人信息。网站注册人可以是公司或者个人。网站注册人信息可以包括网站注册人的名称、代码及其他注册信息,也可以是根据网站注册人的名称、代码及其他注册信息生成的哈希值。网际协议地址英文全称为Internet Protocol Address,即IP地址。网际协议地址是一种稀缺资源,恶意资源地址在网际协议地址上具有一定聚集性。
其中,步骤S304和步骤S306可以同时执行,也可以先后顺序。步骤S304可以在步骤S306之前或之后执行。
S308,将字符特征和相关属性特征组合得到多维度特征。
具体地,字符特征包括一种或多于一种的特征,相关属性特征也包括一种或多于一种的特征。服务器可按照预设的特征组合顺序,依次将字符特征和相关属性特征组合,得到多维度特征。多维度特征中每个维度表示一个字符特征或者相关属性特征。该多维度特征可表征待检测资源地址。
举例说明,假设待检测资源地址的总长度为53,该字符特征可记为53;若待检测资源地址中的单词总数为13,该字符特征可记为13;若待检测资源地址包括预设可疑关键词,该字符特征可记为1(若待检测资源地址不包括预设可疑关键词则可记为0);待检测资源地址中主机地址的长度为12,该字符特征记为12;若待检测资源地址中主机地址的长度与所述待检测资源地址的总长度的比值为12/53;待检测资源地址的传播渠道信息、网页模板信息、网站注册人信息以及网际协议地址均 命中相应的恶意相关属性库,这些相关属性特征可均记为1。则将这些字符特征和相关属性特征依次组合形成特征向量[53,13,1,12,12/53,1,1,1,1]。
S310,根据多维度特征判断待检测资源地址是否为恶意资源地址。
具体地,服务器可根据多维度特征判断待检测资源地址是否为恶意资源地址。机器学习分类器是经过训练的机器学习算法模型。机器学习英文全称为Machine Learning,简称ML。机器学习分类器可通过样本学习具备分类能力,本实施例的机器学习分类器用于将由多维度特征表征的待检测资源地址划分到恶意资源地址和非恶意资源地址中的一类。非恶意资源地址是不指向恶意资源的资源地址。机器学习分类器可以采用SVM(Support Vector Machine,支持向量机)分类器、贝叶斯分类器或者神经网络模型等。实践中采用SVM分类器可以达到很好的效果。
具体地,服务器将多维度特征输入预先训练得到的机器学习分类器,由机器学习分类器对该多维度特征进行运算,输出恶意资源地址检测结果,该恶意资源地址检测结果表示待检测资源地址是否为恶意资源地址。训练机器学习分类器采用的多维度特征所包括的特征类型和特征顺序,与判断待检测资源地址是否为恶意资源地址时依据的多维度特征的特征类型和特征顺序一致。
在一个实施例中,服务器通过机器学习分类器并根据输入的多维度特征,计算出多维度特征所表征的待检测资源地址属于恶意资源地址的概率,并判断该概率是否大于条件阈值;若概率大于或等于条件阈值,则通过机器学习分类器输出表示待检测资源地址为恶意资源地址的恶意资源地址检测结果;若概率小于条件阈值,则通过机器学习分类器输出表示待检测资源地址为非恶意资源地址的恶意资源地址检测结果。条件阈值可以设置为0.8~0.98,具体可设置为0.95。
在一个实施例中,机器学习分类器可表示为f(x):
Figure PCTCN2017105796-appb-000001
其中,x表示向量形式的多维特征,用来表征待检测资源地址。m表示判断为恶意资源地址,比如可取1;n表示判断为非恶意资源地址,比如可取0或者-1等。函数g()表示逻辑回归函数。q表示条件阈值,比如可取0.8~0.98。wTx+b表示超平面,该超平面使得训练集在特征空间中两种类别的多维特征之间的间隔最大。w表示法向量,T表示转置,b表示系数。w和b通过训练获得。训练时求取w和b的问题可转化为凸二次规划问题求解,使得||w||最小化;||w||是w的二阶范数。
上述恶意资源地址检测方法,利用统计得到的待检测资源地址的字符特征,以及查询恶意相关属性库得到的相关属性特征,组合形成代表待检测资源地址的多维度特征,再利用机器学习分类器对多维度特征进行分类,得到待检测资源地址是否为恶意资源地址的检测结果。结合了待检测资源地址自身的字符特征以及与待检测资源地址相应的相关属性,相比仅依赖于网络爬虫爬取待检测资源地址对应的资源进行恶意资源地址检测的方式,能够更加有效地检测出恶意资源地址。
在一个实施例中,步骤S304和步骤S306之前,该恶意资源地址检测方法还包括:判断所述待检测资源地址为非恶意资源地址或可疑资源地址;当所述待检测资源地址为可疑资源地址时,执行步骤S304以及步骤S306。
具体地,服务器可获取待检测资源地址的相关属性特征和/或字符特征,将获取的相关属性特征和/或字符特征输入过滤分类器,由过滤分类器输出表示待检测资源地址是否为可疑资源地址的可疑资源地址检测结果。过滤分类器可采用贝叶斯分类器,优选可采用决策树分类器。服务器将判断为非可疑资源地址的待检测资源地址过滤掉,仅保留判断为可疑资源地址的待检测资源地址,进而将判断为可疑资源地址的待检测 资源地址继续执行步骤S304、S306、S308以及S310,得到恶意资源地址检测结果。
可理解的是,上述判断所述待检测资源地址为非恶意资源地址或可疑资源地址的过程,实际上是一个过滤的过程,因此该判断过程也可以描述成“滤除所述待检测资源地址中的非恶意资源地址”,过滤后剩余的待检测资源地址即可疑恶意地址,对这些可疑恶意地址执行S304、S306、S308以及S310即可。
其中,可疑资源地址是存在一定概率是恶意资源地址的资源地址。决策树分类器用于过滤掉确定为非恶意资源地址的待检测资源地址,且决策树分类器的处理效率很高,可以从数量庞大的待检测资源地址中过滤掉非恶意资源地址,减少负载,并提高检测恶意资源地址的准确率。决策树分类器的训练集包括恶意资源地址库和非恶意资源地址库,训练决策树分类器时可提取训练集中每个资源地址相应的多于一种类型的相关属性,并查询提取的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征,从而根据该相关属性特征训练决策树分类器。
本实施例中,将待检测资源地址进行过滤,可以过滤掉明显不属于恶意资源地址的待检测资源地址,减少负载,并提高检测恶意资源地址的准确率。
图4为一个实施例中步骤S308流程示意图。参照图4,该步骤S308具体包括如下步骤:
S402,获取当前检测的恶意资源地址类型。
其中,当前检测的恶意资源地址类型,是指当前执行恶意资源地址检测方法时需要检测的恶意资源地址的类型。即,恶意资源地址类型为对所述待检测恶意地址进行检测时需要检测的恶意资源地址的类型。恶意资源地址可分为不同的恶意资源地址类型,如仿冒网站类型和诈骗网站类型。仿冒网站类型又可以细分为仿冒购物网站类型、仿冒银行网站 类型和仿冒指定官方网站等。对于不同的恶意资源地址类型,分别训练不同的机器学习分类器进行恶意资源地址检测。
S404,在字符特征和相关属性特征中选择与恶意资源地址类型适配的特征。
对于不同类型的恶意资源地址,不同的特征对恶意资源地址检测的贡献程度不同。服务器可预先存储恶意资源地址类型和相适配特征之间的对应关系,从而在获取到当前检测的恶意资源地址类型后,在字符特征和相关属性特征中,根据该对应关系选择与当前检测的恶意资源地址类型适配的特征。恶意资源地址类型和相适配特征之间的对应关系可根据先验知识进行设定,也可以通过对已知的恶意资源地址进行大数据分析得到。
S406,将选择的特征组合得到多维度特征。
具体地,服务器可按照预设的特征组合顺序,依次将选择的各个特征组合,得到多维度特征。在一个实施例中,服务器还可以根据与当前检测的恶意资源地址类型适配的特征间权重关系,对多维度特征中各维度的特征进行加权处理。通过加权处理使得多维度特征更加适合当前检测的恶意资源地址类型。
举例说明,对于中奖诈骗类型的资源地址,待检测资源地址是否包括预设可疑关键词的字符特征就不太能起作用,在选择特征时可将其剔除。而对于仿冒网站,待检测资源地址是否包括预设可疑关键词的字符特征就可以起到很好的作用,需要选择该特征构成多维度特征。
本实施例中,细分不同的恶意资源地址类型,对于每种恶意资源地址类型,全量的特征在检测恶意资源地址时未必都能起到作用,甚至会起到相反的作用,因此选择与当前检测的恶意资源地址类型适配的特征可以更加准确、有效地进行恶意资源地址检测。
在一个实施例中,该恶意资源地址检测方法还包括根据漏报或误报 的恶意资源地址更新恶意相关属性库的步骤。本实施例中步骤S310中可以采用机器学习分类器判断所述待检测资源地址是否为恶意资源地址。参照图5,该根据漏报或误报的恶意资源地址更新恶意相关属性库的步骤具体包括如下步骤:
S502,收集在采用机器学习分类器判断待检测资源地址是否为恶意资源地址时漏报或误报的恶意资源地址。
其中,漏报的恶意资源地址是指原本是恶意资源地址却通过机器学习分类器被判断为非恶意资源地址;误报的恶意资源地址是指原本是非恶意资源地址却通过机器学习分类器被判断为恶意资源地址。
具体地,漏报的恶意资源地址可通过人工举报途径得到,也可以采用不同的机器学习分类器且针对相同待检测资源地址的恶意资源地址检测结果进行交叉比对得到。比如相同的待检测资源地址通过机器学习分类器A、B和C,恶意资源地址检测结果依次是:恶意资源地址、非恶意资源地址以及非恶意资源地址,则可将该待检测资源地址作为机器学习分类器B和C漏报的恶意资源地址。误报的恶意资源地址可通过人工申诉或人工检查得到。
S504,获取漏报或误报的恶意资源地址的相关属性。具体地,服务器可通过大数据分析采集漏报或误报的恶意资源地址的相关属性。
S506,根据采集的相关属性更新相应的恶意相关属性库。即,根据所述漏报或误报的恶意资源地址的相关属性,更新该相关属性所属的相关属性类型对应的恶意相关属性库。
具体地,对于漏报的恶意资源地址,服务器可将采集的漏报的恶意资源地址的相关属性添加到相应的恶意相关属性库中。对于误报的恶意资源地址,服务器可将误报的恶意资源地址的相关属性从相应的恶意相关属性库中删除。
本实施例中,通过漏报或者误报的恶意资源地址,对恶意相关属性 库进行更新,可以避免后续漏报或误报情况的蔓延,提高了检测恶意资源地址的准确率。
在一个实施例中,该恶意资源地址检测方法还包括根据漏报或误报的恶意资源地址更新机器学习分类器的步骤。参照图6,该根据漏报或误报的恶意资源地址更新机器学习分类器的步骤具体包括如下步骤:
S602,获取漏报或误报的恶意资源地址的字符特征。
其中,字符特征可以包括漏报或误报的恶意资源地址的总长度,漏报或误报的恶意资源地址中的单词总数,漏报或误报的恶意资源地址是否包括预设可疑关键词,漏报或误报的恶意资源地址中主机地址的长度与漏报或误报的恶意资源地址的总长度的比值,以及,漏报或误报的恶意资源地址中字符出现频率与恶意资源地址库中相应字符出现频率之间的KL散度中的一种或几种的组合。
S604,查询漏报或误报的恶意资源地址的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征。即,获取所述漏报或误报的恶意资源地址的相关属性,在该相关属性所属的相关属性类型所对应的恶意相关属性库中查询该相关属性,并根据查询结果生成所述漏报或误报的恶意资源地址对应的相关属性特征。
具体地,服务器可获取漏报或误报的恶意资源地址的相关属性,从而查询该相关属性所属的相关属性类型对应的恶意相关属性库,判断是否能够命中恶意相关属性库,根据是否命中恶意相关属性库的查询结果生成与漏报或误报的恶意资源地址该类型相关属性相应的相关属性特征。
S606,将漏报或误报的恶意资源地址的字符特征以及与漏报或误报的恶意资源地址相应的相关属性特征组合得到相应的多维度特征。即,将所述漏报或误报的恶意资源地址的字符特征以及相关属性特征进行组合得到所述漏报或误报的恶意资源地址的多维度特征。
具体地,服务器可按照预设的特征组合顺序,依次将字符特征和相关属性特征组合,得到多维度特征。在一个实施例中,服务器还可以获取当前的机器学习分类器所对应的恶意资源地址类型,从而在字符特征和相关属性特征中选择与该恶意资源地址类型适配的特征。
S608,根据与漏报或误报的恶意资源地址相应的多维度特征更新机器学习分类器。即,根据所述漏报或误报的恶意资源地址对应的所述多维度特征更新所述机器学习分类器。
本实施例中,产生漏报或误报的恶意资源地址时,根据漏报或误报的恶意资源地址对机器学习分类器进行更新,更新后检测恶意资源地址的准确率得以提升。
在一个实施例中,该恶意资源地址检测方法还包括:待检测资源地址被判断为恶意资源地址,将待检测资源地址加入恶意资源地址库中;其中,恶意资源地址库用于对针对恶意资源地址库中的恶意资源地址的资源访问请求进行拦截。
具体地,终端在根据某资源地址发起资源访问请求时,先查询该资源地址是否属于恶意资源地址库,若属于则对该资源访问请求进行拦截;若不属于则发出该资源访问请求。终端具体可从服务器或者本地查询某资源地址是否属于恶意资源地址库,本地的恶意资源地址库可从服务器定期同步得到。
本实施例中,将待检测资源地址加入恶意资源地址库中,从而可以根据该恶意资源地址库,对针对恶意资源地址库中的恶意资源地址的资源访问请求进行拦截,保证资源访问安全。
在一个实施例中,该恶意资源地址检测方法还包括:当待检测资源地址被判断为恶意资源地址,且待检测资源地址不属于预设无恶意资源地址库时,将待检测资源地址加入恶意资源地址库中;其中,恶意资源地址库用于对针对恶意资源地址库中的恶意资源地址的资源访问请求 进行拦截。
具体地,服务器在判定待检测资源地址为恶意资源地址时,可继续判断待检测资源地址是否属于预设无恶意资源地址库。预设无恶意资源地址库是预设的用于防误报处理的非恶意资源地址构成的集合。若待检测资源地址属于预设无恶意资源地址库,则服务器不再处理该待检测资源地址。若待检测资源地址不属于预设无恶意资源地址库,则服务器可将待检测资源地址加入恶意资源地址库中,使得检测出的恶意资源地址可以用于对相应资源访问请求进行拦截。
本实施例中,由于机器学习分类器分类准确率难以达到100%,因此采用机器学习分类器检测出的恶意资源地址会存在误报的可能,而恶意资源地址库用来对资源访问请求进行拦截,发生误报可能会影响到正常的资源访问。而在待检测资源地址不属于预设无恶意资源地址库时,将待检测资源地址加入恶意资源地址库中,可以防止误报的发生,避免误报的恶意资源地址影响到正常的资源访问。
下面用一个具体应用场景来说明上述恶意资源地址检测方法的原理。参照图7,服务器可将恶意资源地址库和非恶意资源地址库作为训练样本库,利用恶意资源地址库中恶意资源地址的相关属性构成相关属性库,生成训练样本中资源地址的字符特征和相关属性特征,并按照相应的恶意资源地址类型从生成的字符特征和相关属性特征中选择特征构成多维度特征。服务器采用与训练样本中资源地址相应的多维度特征训练得到机器学习分类器。
进一步地,服务器接收传入的待检测资源地址,并采用决策树分类器过滤掉非恶意资源地址的待检测资源地址,对过滤后剩余的待检测资源地址提取字符特征和相关属性特征,按照当前检测的恶意资源地址类型从提取的字符特征和相关属性特征中选择特征构成多维度特征。服务器将与待检测资源地址相应的多维度特征输入与当前检测的恶意资源 地址类型适配的机器学习分类器,由机器学习分类器输出待检测资源地址是否为恶意资源地址的恶意资源地址检测结果。
更进一步地,服务器可对恶意资源地址检测结果进行防误报处理。服务器具体可在待检测资源地址被判断为恶意资源地址时,判断待检测资源地址是否属于预设无恶意资源地址库,若不属于预设无恶意资源地址库则将待检测资源地址加入恶意资源地址库。服务器还可以在待检测资源地址被判断为恶意资源地址时,判断待检测资源地址的指定特征是否符合无恶意资源地址的指定特征条件,如搜索量或者点击量或者热度超过预设值,若不符合则将待检测资源地址加入恶意资源地址库。
服务器还可以根据人工申诉确定误报的恶意资源地址,根据人工举报确定漏报的恶意资源地址,从而根据误报和误报的资源地址更新相关属性库和机器学习分类器。服务器还可以通过比检测恶意资源地址时所用的机器学习分类器概率判断条件更为宽松的另一种机器学习分类器监控恶意资源地址,如条件阈值小于检测恶意资源地址时所用的机器学习分类器的条件阈值的另一种机器学习分类器,该另一种机器学习分类器的条件阈值比如可以是0.5。该另一种机器学习分类器判断待检测资源地址为恶意资源地址的准确率,低于检测恶意资源地址时所用的机器学习分类器判断待检测资源地址为恶意资源地址的准确率;该另一种机器学习分类器监控到恶意资源地址覆盖率,高于检测恶意资源地址时所用的机器学习分类器检测到恶意资源地址的覆盖率。通过另一种机器学习分类器监控恶意资源地址,可以发现更多的恶意资源地址,保证恶意资源地址检测的覆盖率。
图8为一个实施例中恶意资源地址检测装置800(例如,服务器)中的指令模块的结构框图。该恶意资源地址检测装置800包括:
至少一个存储器(例如,图2中的非易失性存储介质);
至少一个处理器;其中,
所述至少一个存储器存储有至少一个指令模块,经配置由所述至少一个处理器执行;其中,
参照图8,所述至少一个指令模块包括:
数据接入模块810、特征提取模块820和检测模块830。
数据接入模块810,用于获取待检测资源地址。
特征提取模块820,用于获取待检测资源地址的字符特征。查询待检测资源地址的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征。用于将字符特征和相关属性特征组合得到多维度特征。即,用于获取所述待检测资源地址的字符特征,获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,根据查询结果生成所述待检测资源地址对应的相关属性特征,以及,将所述字符特征和所述相关属性特征组合得到多维度特征。
检测模块830,用于根据多维度特征判断待检测资源地址是否为恶意资源地址。
上述恶意资源地址检测装置800,利用统计得到的待检测资源地址的字符特征,以及查询恶意相关属性库得到的相关属性特征,组合形成代表待检测资源地址的多维度特征,再利用机器学习分类器对多维度特征进行分类,得到待检测资源地址是否为恶意资源地址的检测结果。结合了待检测资源地址自身的字符特征以及与待检测资源地址相应的相关属性,相比仅依赖于网络爬虫爬取待检测资源地址对应的资源进行恶意资源地址检测的方式,能够更加有效地检测出恶意资源地址。
图9为另一个实施例中恶意资源地址检测装置800中的指令模块的结构框图。参照图9,该恶意资源地址检测装置800中的指令模块还包括:过滤模块840,用于判断所述待检测资源地址为非恶意资源地址或 可疑资源地址;当所述待检测资源地址为可疑资源地址时通知特征提取模块820。即,用于滤除所述待检测资源地址中的非恶意资源地址。
特征提取模块820还用于当待检测资源地址为可疑资源地址时,获取待检测资源地址的字符特征,以及查询待检测资源地址的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征。即,用于将过滤后剩余的待检测资源地址作为可疑资源地址,并对所述可疑资源地址执行所述获取待检测资源地址的字符特征和所述获取所述待检测资源地址的相关属性的步骤。
本实施例中,对待检测资源地址进行过滤,可以过滤掉明显不属于恶意资源地址的待检测资源地址,减少负载,并提高检测恶意资源地址的准确率。
在一个实施例中,字符特征包括待检测资源地址的总长度,待检测资源地址中的单词总数,待检测资源地址是否包括预设可疑关键词,待检测资源地址中主机地址的长度与待检测资源地址的总长度的比值,以及,待检测资源地址中字符出现频率与恶意资源地址库中相应字符出现频率之间的KL散度中的一种或几种的组合。
在一个实施例中,待检测资源地址的相关属性包括待检测资源地址的传播渠道信息、网页模板信息、网站注册人信息以及网际协议地址中的一种或几种的组合。
在一个实施例中,特征提取模块820还用于获取当前检测的恶意资源地址类型;在字符特征和相关属性特征中选择与恶意资源地址类型适配的特征;将选择的特征组合得到多维度特征。即特征提取模块820所述特征提取模块用于获取恶意资源地址类型,所述恶意资源地址类型为对所述待检测恶意地址进行检测时需要检测的恶意资源地址的类型;在所述字符特征和所述相关属性特征中选择与所述恶意资源地址类型适配的特征;将选择的特征进行组合得到所述多维度特征。
本实施例中,细分不同的恶意资源地址类型,对于每种恶意资源地址类型,全量的特征在检测恶意资源地址时未必都能起到作用,甚至会起到相反的作用,因此选择与当前检测的恶意资源地址类型适配的特征可以更加准确、有效地进行恶意资源地址检测。
在一个实施例中,检测模块830还用于采用机器学习分类器并根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。
恶意资源地址检测装置800还包括:漏报或误报收集模块850和恶意相关属性库更新模块860。
漏报或误报收集模块850,用于收集在采用机器学习分类器判断待检测资源地址是否为恶意资源地址时漏报或误报的恶意资源地址。
恶意相关属性库更新模块860,用于获取漏报或误报的恶意资源地址的相关属性;根据采集的相关属性更新相应的恶意相关属性库。即恶意相关属性库更新模块860用于获取所述漏报或误报的恶意资源地址的相关属性;根据所述漏报或误报的恶意资源地址的相关属性,更新该相关属性所属的相关属性类型对应的恶意相关属性库。
本实施例中,通过漏报或者误报的恶意资源地址,对恶意相关属性库进行更新,可以避免后续漏报或误报情况的蔓延,提高了检测恶意资源地址的准确率。
在一个实施例中,恶意资源地址检测装置800还包括:机器学习分类器更新模块870,用于获取漏报或误报的恶意资源地址的字符特征;查询漏报或误报的恶意资源地址的相关属性是否属于相应的恶意相关属性库,得到相应的相关属性特征;将漏报或误报的恶意资源地址的字符特征以及与漏报或误报的恶意资源地址相应的相关属性特征组合得到相应的多维度特征;根据与漏报或误报的恶意资源地址相应的多维度特征更新机器学习分类器。即,机器学习分类器更新模块用于获取所述漏报或误报的恶意资源地址的字符特征;获取所述漏报或误报的恶意资 源地址的相关属性,在该相关属性所属的相关属性类型所对应的恶意相关属性库中查询该相关属性,并根据查询结果生成所述漏报或误报的恶意资源地址对应的相关属性特征;将所述漏报或误报的恶意资源地址的字符特征和相关属性特征进行组合得到所述漏报或误报的恶意资源地址的多维度特征;根据所述漏报或误报的恶意资源地址的多维度特征,更新所述机器学习分类器。
本实施例中,产生漏报或误报的恶意资源地址时,根据漏报或误报的恶意资源地址对机器学习分类器进行更新,更新后检测恶意资源地址的准确率得以提升。
在一个实施例中,恶意资源地址检测装置800还包括:恶意资源地址库管理模块880,用于当待检测资源地址被判断为恶意资源地址时,将待检测资源地址加入恶意资源地址库中;其中,恶意资源地址库用于对针对恶意资源地址库中的恶意资源地址的资源访问请求进行拦截。
本实施例中,将待检测资源地址加入恶意资源地址库中,从而可以根据该恶意资源地址库,对针对恶意资源地址库中的恶意资源地址的资源访问请求进行拦截,保证资源访问安全。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,该存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
本发明实施例还提供一种计算机可读存储介质,其内存储有计算机可读指令或程序,所述计算机可读指令或程序被处理器执行前述恶意资源地址检测方法。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只 要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种恶意资源地址检测方法,所述方法包括:
    获取待检测资源地址;
    获取所述待检测资源地址的字符特征;
    获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,并根据查询结果生成所述待检测资源地址对应的相关属性特征;
    将所述字符特征和所述相关属性特征组合得到多维度特征;
    根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    滤除所述待检测资源地址中的非恶意资源地址;
    将过滤后剩余的待检测资源地址作为可疑资源地址,并对所述可疑资源地址执行所述获取待检测资源地址的字符特征和所述获取所述待检测资源地址的相关属性的步骤。
  3. 根据权利要求1所述的方法,其中,所述字符特征包括所述待检测资源地址的总长度,所述待检测资源地址中的单词总数,所述待检测资源地址是否包括预设可疑关键词,所述待检测资源地址中主机地址的长度与所述待检测资源地址的总长度的比值,以及,所述待检测资源地址中字符出现频率与恶意资源地址库中相应字符出现频率之间的KL散度中的一种或几种的组合。
  4. 根据权利要求1所述的方法,其中,所述待检测资源地址的相 关属性包括所述待检测资源地址的传播渠道信息、网页模板信息、网站注册人信息以及网际协议地址中的一种或几种的组合。
  5. 根据权利要求1所述的方法,其中,所述将所述字符特征和所述相关属性特征组合得到多维度特征的步骤包括:
    获取恶意资源地址类型,所述恶意资源地址类型为对所述待检测恶意地址进行检测时需要检测的恶意资源地址的类型;
    在所述字符特征和所述相关属性特征中选择与所述恶意资源地址类型适配的特征;
    将选择的特征进行组合得到所述多维度特征。
  6. 根据权利要求1所述的方法,其中,所述根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址,包括:采用机器学习分类器判断所述待检测资源地址是否为恶意资源地址;
    所述方法还包括:
    收集在采用所述机器学习分类器判断所述待检测资源地址是否为恶意资源地址时漏报或误报的恶意资源地址;
    获取所述漏报或误报的恶意资源地址的相关属性;
    根据所述漏报或误报的恶意资源地址的相关属性,更新该相关属性所属的相关属性类型对应的恶意相关属性库。
  7. 根据权利要求6所述的方法,其中,所述方法还包括:
    获取所述漏报或误报的恶意资源地址的字符特征;
    获取所述漏报或误报的恶意资源地址的相关属性,在该相关属性所属的相关属性类型所对应的恶意相关属性库中查询该相关属性,并根据查询结果生成所述漏报或误报的恶意资源地址对应的相关属性特征;
    将所述漏报或误报的恶意资源地址的字符特征以及相关属性特征进行组合得到所述漏报或误报的恶意资源地址的多维度特征;
    根据所述漏报或误报的恶意资源地址对应的所述多维度特征更新所述机器学习分类器。
  8. 根据权利要求1所述的方法,其中,所述方法还包括:
    当所述待检测资源地址被判断为恶意资源地址时,将所述待检测资源地址加入恶意资源地址库中;
    其中,所述恶意资源地址库用于对针对所述恶意资源地址库中的恶意资源地址的资源访问请求进行拦截。
  9. 一种恶意资源地址检测装置,包括:
    至少一个存储器;
    至少一个处理器;其中,
    所述至少一个存储器存储有至少一个指令模块,经配置由所述至少一个处理器执行;其中,
    所述至少一个指令模块包括:
    所述装置包括:
    数据接入模块,用于获取待检测资源地址;
    特征提取模块,用于获取所述待检测资源地址的字符特征,获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,根据查询结果生成所述待检测资源地址对应的相关属性特征,以及,将所述字符特征和所述相关属性特征组合得到多维度特征;
    检测模块,用于根据所述多维度特征判断所述待检测资源地址是否为恶意资源地址。
  10. 根据权利要求9所述的装置,其中,所述装置还包括:
    过滤模块,用于滤除所述待检测资源地址中的非恶意资源地址;
    所述特征提取模块用于将过滤后剩余的待检测资源地址作为可疑资源地址,并对所述可疑资源地址执行所述获取待检测资源地址的字符特征和所述获取所述待检测资源地址的相关属性的步骤。
  11. 根据权利要求9所述的装置,其中,所述字符特征包括所述待检测资源地址的总长度,所述待检测资源地址中的单词总数,所述待检测资源地址是否包括预设可疑关键词,所述待检测资源地址中主机地址的长度与所述待检测资源地址的总长度的比值,以及,所述待检测资源地址中字符出现频率与恶意资源地址库中相应字符出现频率之间的KL散度中的一种或几种的组合。
  12. 根据权利要求9所述的装置,其中,所述待检测资源地址的相关属性包括所述待检测资源地址的传播渠道信息、网页模板信息、网站注册人信息以及网际协议地址中的一种或几种的组合。
  13. 一种恶意资源地址检测方法,由服务器执行,所述方法包括:
    获取待检测资源地址;
    获取所述待检测资源地址的字符特征;
    获取所述待检测资源地址的相关属性,在所述相关属性所属的相关属性类型所对应的恶意相关属性库中查询所述相关属性,并根据查询结果生成所述待检测资源地址对应的相关属性特征;
    将所述字符特征和所述相关属性特征组合得到多维度特征;
    根据所述多维度特征判断所述待检测资源地址是否为恶意资源地 址。
  14. 根据权利要求13所述的方法,其中,所述方法还包括:
    滤除所述待检测资源地址中的非恶意资源地址;
    将过滤后剩余的待检测资源地址作为可疑资源地址,并对所述可疑资源地址执行所述获取待检测资源地址的字符特征和所述获取所述待检测资源地址的相关属性的步骤。
  15. 一种计算机可读存储介质,其内存储有计算机可读指令或程序,所述计算机可读指令或程序被处理器执行如权利要求1-8任一项所述的方法。
PCT/CN2017/105796 2016-10-31 2017-10-12 恶意资源地址检测方法和装置、存储介质 WO2018077035A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610978043.6 2016-10-31
CN201610978043.6A CN108023868B (zh) 2016-10-31 2016-10-31 恶意资源地址检测方法和装置

Publications (1)

Publication Number Publication Date
WO2018077035A1 true WO2018077035A1 (zh) 2018-05-03

Family

ID=62024511

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/105796 WO2018077035A1 (zh) 2016-10-31 2017-10-12 恶意资源地址检测方法和装置、存储介质

Country Status (2)

Country Link
CN (1) CN108023868B (zh)
WO (1) WO2018077035A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111556042A (zh) * 2020-04-23 2020-08-18 杭州安恒信息技术股份有限公司 恶意url的检测方法、装置、计算机设备和存储介质
US11290479B2 (en) * 2018-08-11 2022-03-29 Rapid7, Inc. Determining insights in an electronic environment
CN116260660A (zh) * 2023-05-15 2023-06-13 杭州美创科技股份有限公司 网页木马后门识别方法及系统

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992969B (zh) * 2019-03-25 2023-03-21 腾讯科技(深圳)有限公司 一种恶意文件检测方法、装置及检测平台
CN110175278B (zh) * 2019-05-24 2022-02-25 新华三信息安全技术有限公司 网络爬虫的检测方法及装置
CN110765393A (zh) * 2019-09-17 2020-02-07 微梦创科网络科技(中国)有限公司 基于向量化和逻辑回归识别有害url的方法及装置
CN111177596B (zh) * 2019-12-25 2023-08-25 微梦创科网络科技(中国)有限公司 一种基于lstm模型的url请求分类方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120102545A1 (en) * 2010-10-20 2012-04-26 Mcafee, Inc. Method and system for protecting against unknown malicious activities by determining a reputation of a link
CN102739679A (zh) * 2012-06-29 2012-10-17 东南大学 一种基于url分类的钓鱼网站检测方法
CN103179095A (zh) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 一种检测钓鱼网站的方法及客户端装置
CN103685308A (zh) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 一种钓鱼网页的检测方法及系统、客户端、服务器
CN104219230A (zh) * 2014-08-21 2014-12-17 腾讯科技(深圳)有限公司 识别恶意网站的方法及装置
CN104735074A (zh) * 2015-03-31 2015-06-24 江苏通付盾信息科技有限公司 一种恶意url检测方法及其实现系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692639A (zh) * 2009-09-15 2010-04-07 西安交通大学 一种基于url的不良网页识别方法
CN102932348A (zh) * 2012-10-30 2013-02-13 常州大学 一种钓鱼网站的实时检测方法及系统
CN103475673B (zh) * 2013-09-30 2018-04-13 北京猎豹网络科技有限公司 钓鱼网站识别方法、装置及客户端
CN103491543A (zh) * 2013-09-30 2014-01-01 北京奇虎科技有限公司 通过无线终端检测恶意网址的方法、无线终端
CN104217160B (zh) * 2014-09-19 2017-11-28 中国科学院深圳先进技术研究院 一种中文钓鱼网站检测方法及系统
CN104899508B (zh) * 2015-06-17 2018-12-07 中国互联网络信息中心 一种多阶段钓鱼网站检测方法与系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120102545A1 (en) * 2010-10-20 2012-04-26 Mcafee, Inc. Method and system for protecting against unknown malicious activities by determining a reputation of a link
CN103179095A (zh) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 一种检测钓鱼网站的方法及客户端装置
CN102739679A (zh) * 2012-06-29 2012-10-17 东南大学 一种基于url分类的钓鱼网站检测方法
CN103685308A (zh) * 2013-12-25 2014-03-26 北京奇虎科技有限公司 一种钓鱼网页的检测方法及系统、客户端、服务器
CN104219230A (zh) * 2014-08-21 2014-12-17 腾讯科技(深圳)有限公司 识别恶意网站的方法及装置
CN104735074A (zh) * 2015-03-31 2015-06-24 江苏通付盾信息科技有限公司 一种恶意url检测方法及其实现系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11290479B2 (en) * 2018-08-11 2022-03-29 Rapid7, Inc. Determining insights in an electronic environment
CN111556042A (zh) * 2020-04-23 2020-08-18 杭州安恒信息技术股份有限公司 恶意url的检测方法、装置、计算机设备和存储介质
CN111556042B (zh) * 2020-04-23 2022-12-20 杭州安恒信息技术股份有限公司 恶意url的检测方法、装置、计算机设备和存储介质
CN116260660A (zh) * 2023-05-15 2023-06-13 杭州美创科技股份有限公司 网页木马后门识别方法及系统

Also Published As

Publication number Publication date
CN108023868B (zh) 2021-02-02
CN108023868A (zh) 2018-05-11

Similar Documents

Publication Publication Date Title
WO2018077035A1 (zh) 恶意资源地址检测方法和装置、存储介质
Rao et al. Detection of phishing websites using an efficient feature-based machine learning framework
Xiang et al. Cantina+ a feature-rich machine learning framework for detecting phishing web sites
Ramesh et al. An efficacious method for detecting phishing webpages through target domain identification
US10778702B1 (en) Predictive modeling of domain names using web-linking characteristics
JP4906273B2 (ja) 外部データを使用した検索エンジンスパムの検出
Khan et al. Defending malicious script attacks using machine learning classifiers
Gowtham et al. A comprehensive and efficacious architecture for detecting phishing webpages
US9621566B2 (en) System and method for detecting phishing webpages
CN104125209B (zh) 恶意网址提示方法和路由器
Rao et al. Two level filtering mechanism to detect phishing sites using lightweight visual similarity approach
James et al. Detection of phishing URLs using machine learning techniques
Dong et al. An adaptive system for detecting malicious queries in web attacks
CN108737423B (zh) 基于网页关键内容相似性分析的钓鱼网站发现方法及系统
CN110830490B (zh) 基于带对抗训练深度网络的恶意域名检测方法及系统
US10958684B2 (en) Method and computer device for identifying malicious web resources
Bannur et al. Judging a site by its content: learning the textual, structural, and visual features of malicious web pages
Tan et al. Phishing website detection using URL-assisted brand name weighting system
Marchal et al. PhishScore: Hacking phishers' minds
Jain et al. Detection of phishing attacks in financial and e-banking websites using link and visual similarity relation
Zhu et al. An effective neural network phishing detection model based on optimal feature selection
Khan Detection of phishing websites using deep learning techniques
TWI397833B (zh) 偵測網路釣魚網頁的方法及系統
Ayub et al. Urlcam: Toolkit for malicious url analysis and modeling
Swathi et al. Detection of Phishing Websites Using Machine Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17865199

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17865199

Country of ref document: EP

Kind code of ref document: A1