CN111414621B

CN111414621B - Malicious webpage file identification method and device

Info

Publication number: CN111414621B
Application number: CN202010221911.2A
Authority: CN
Inventors: 刘卓龙
Original assignee: Xiamen Wangsu Co Ltd
Current assignee: Xiamen Wangsu Co Ltd
Priority date: 2020-03-26
Filing date: 2020-03-26
Publication date: 2022-07-08
Anticipated expiration: 2040-03-26
Also published as: CN111414621A

Abstract

The embodiment of the invention provides a method and a device for identifying malicious webpage files, wherein the method comprises the following steps: determining feature data of each dimensionality of a webpage file to be identified; each dimension comprises a combined dimension and a single dimension; wherein the combined dimension comprises a plurality of feature data; a single dimension is that there is only one feature data; aiming at the combined dimension in each dimension, obtaining fusion feature data of the combined dimension by a plurality of feature data of the combined dimension through a first machine learning model; obtaining a preliminary identification result of whether the webpage file to be identified is a malicious webpage file or not through a rule engine; and (4) passing the preliminary identification result, the fusion characteristic data of the combined dimensionality and the characteristic data of the single dimensionality through a second machine learning model to obtain a final result of whether the webpage file to be identified is a malicious webpage file. By adopting the method, the accuracy of malicious webpage file identification is improved, and the safety of the computer environment is greatly improved.

Description

Malicious webpage file identification method and device

Technical Field

The present application relates to the field of network security technologies, and in particular, to a malicious webpage file identification method and apparatus.

Background

With the rapid development of communication networks, the internet is rapidly popularized, and until now, the internet has a close or even inseparable relationship with life. However, it is followed by network security issues that almost everyone faces. For example, there is a web script trojan (webshell), which is a command execution environment in the form of malicious web files such as asp, php, jsp, etc., and can be used as a web backdoor; after a malicious person invades a website, the malicious webpage files and the normal webpage files in the website server directory are mixed together, so that the normal operation of a computer is damaged or the privacy is stolen. Correspondingly, in order to protect the computer from being damaged by the malicious webpage files, the malicious webpage files need to be identified, so that corresponding measures are taken to prevent the malicious webpage files from damaging the computer or leaking information.

In the prior art, malicious webpage files are identified mainly by two methods: one is to match keywords by a regular matching method and detect malicious web page files; the method has low false alarm rate but high false alarm rate; usually, the false alarm rate can be controlled within 1-3%, but the false alarm rate can reach 20% or even more than 30%. The other method is to detect the malicious web page files by machine learning algorithm, for example, by Random Forest (RF) or deep learning network; although the false alarm rate of the method can be reduced to be within 10 percent, the false alarm rate is higher and is about 5 to 10 percent; in addition, in practical application, non-malicious webpage file samples account for most of the total amount of the samples, so that a large amount of false alarms can be generated at a false alarm rate of 5-10%, and difficulty is brought to subsequent screening.

Therefore, there is a need for a method and an apparatus for identifying malicious web documents, which are used to increase the accuracy of identifying malicious web documents and increase the security of computer environment.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying malicious webpage files, which are used for increasing the accuracy of identifying the malicious files and increasing the safety of a computer environment.

In a first aspect, an embodiment of the present invention provides a method for identifying a malicious webpage file, where the method includes:

determining characteristic data of each dimension of a webpage file to be identified; the dimensions comprise a combined dimension and a single dimension; the combined dimension comprises a plurality of feature data; the single dimension has only one feature data; aiming at a combined dimension in each dimension, obtaining fusion feature data of the combined dimension by using a plurality of feature data of the combined dimension through a first machine learning model; obtaining a preliminary identification result of whether the webpage file to be identified is a malicious webpage file or not through a rule engine; and obtaining a final identification result of whether the webpage file to be identified is a malicious webpage file or not by passing the preliminary identification result, the fusion characteristic data of the combined dimension and the characteristic data of the single dimension through a second machine learning model.

By adopting the method, the combined dimension and the single dimension in each dimension of the webpage file to be identified can be obtained by determining the feature data of each dimension of the webpage file to be identified, so that the fusion feature data of the combined dimension can be obtained by a plurality of feature data of the combined dimension in each dimension through the first machine learning model. In this way, it is possible to prevent that the feature data of a single dimension is submerged when training through the machine learning model and important feature data is lost because the number of the plurality of feature data of the combined dimension is too large in the number of the feature data of each dimension. And obtaining a final result of whether the webpage file to be identified is the malicious webpage file by training a primary identification result of whether the webpage file to be identified is the malicious webpage file obtained by the rule engine, the fusion characteristic data and the single-dimensional characteristic data through a second machine learning model. Therefore, the method not only prevents the feature data of a single dimension from being submerged, but also integrates two identification methods of machine learning and a rule engine, improves the accuracy of malicious webpage file identification, and greatly increases the safety of a computer environment.

In one possible design, the web page file to be identified is obtained by the following method:

acquiring a webpage file; determining that the webpage file contains a character string and the character string does not contain a high-risk keyword, and determining that the webpage file is the webpage file to be identified; the high-risk keywords are used for indicating that the webpage files are malicious webpage files.

By adopting the method, the webpage file is obtained, and the webpage file is determined to be the webpage file to be identified if the webpage file contains the character string and the character string does not contain the high-risk keyword. Therefore, the webpage files of the empty texts, the non-malicious webpage files which are not the empty texts and do not contain character strings, and the malicious webpage files which contain the character strings and contain high-risk keywords can be screened out. That is, a part of easily identifiable web documents can be screened out by a simple and quick method, so that the documents are prevented from being mixed in the difficultly identifiable web documents and occupying subsequent resources of a rule engine and machine learning. The pressure of the rule engine and machine learning is reduced, and the time cost is reduced.

In one possible design, determining that the string does not contain a high risk keyword includes:

decoding the webpage file to obtain a decoded text; extracting a code from the decoded text; determining that the high-risk keyword is not included in the code.

By adopting the method, the webpage file is decoded to obtain the decoded text. Therefore, whether the high-risk keywords are contained in the webpage file or not is conveniently judged subsequently. And if the code is extracted from the decoded text, determining that the code does not contain the high-risk keywords. Therefore, the preliminary screening of the webpage files is realized, and the webpage files to be identified which need to be further identified are determined. The problems that malicious webpage files containing high-risk keywords in codes are mixed with webpage files to be identified, which do not contain the high-risk keywords, the pressure of subsequent rule engine and machine learning is increased, and the time cost is increased are solved.

In one possible design, further comprising:

if no code is extracted from the decoding text, determining that the webpage file is a non-malicious webpage file; and/or if the code is determined to contain the high-risk keywords, determining that the webpage file is a malicious webpage file.

By adopting the method, the non-malicious webpage files with codes not extracted from the decoded text and the malicious webpage files with codes containing high-risk keywords are screened out. Therefore, the webpage file which is easy to identify partially can be identified and confirmed simply and quickly, and whether the webpage file is a malicious webpage file or not can be determined.

In one possible design, the method further includes:

and if the webpage file does not contain the character string, determining that the webpage file is a non-malicious webpage file.

By adopting the method, the webpage file without the character string is determined as the non-malicious webpage file. Therefore, the identification of the non-malicious webpage files can be quickly and effectively realized.

In one possible design, determining whether the web page file contains a character string includes:

determining whether the file type of the webpage file is a text type or not according to the file header of the webpage file; if the type of the text is the text type, determining that the character string is contained; if the webpage file is not of the text type, determining whether the webpage file has an identifiable coding format; if the code format is recognizable, determining that the character string is contained; if the webpage file is not of a text type and does not have a recognizable coding format, whether the webpage file contains the character string or not is searched.

By adopting the method, whether the file type of the webpage file is the text type is determined according to the file header of the webpage file. In this manner, the text type of the web page file can be quickly determined. If the type of the text is the text type, the character string is determined to be contained. Therefore, whether the webpage file is possibly malicious webpage text or webpage text to be identified can be judged according to the character string content. If not, determining whether the webpage file has a recognizable coding format. If the webpage file has the recognizable code format, the webpage file is determined to contain the character string. If the webpage file is not of a text type and does not have a recognizable coding format, whether the webpage file contains the character string or not is searched. Therefore, whether the character string exists in the non-text webpage file or not can be judged by a method for identifying the coding format, so that the malicious webpage file pretended to be of the non-text type is prevented from damaging the computer security environment, and the accuracy of identifying the webpage file is improved; in addition, for the webpage file with the recognizable coding format and the character string or the webpage file without the recognizable coding format and the character string, whether the webpage file is a malicious webpage file or not can be still judged according to the content of the character string; if the webpage file without the recognizable code format has no character string, the webpage file is a non-malicious webpage file; therefore, the simple and quick preliminary screening of the webpage files is completed.

In one possible design, the rule engine is implemented by a first regular match;

determining that the character string does not contain high-risk keywords, including: determining that the character string does not contain high-risk keywords through second regular matching, wherein the complexity of the second regular matching is smaller than that of the first regular matching; extracting a code from the decoded text, comprising: extracting a code from the decoded text by a third regular match having a complexity less than the complexity of the first regular match.

By adopting the method, the codes are extracted through the third regular matching, so that the second regular matching can determine whether the high-risk keywords are contained in the second regular matching according to the extracted codes, and the webpage file to be identified can be determined simply and quickly. And then, carrying out finer matching judgment on the webpage file to be identified through the first regular matching so as to determine whether the webpage file to be identified is a malicious webpage file. Therefore, the accuracy of webpage file identification is increased by increasing the complexity of regular matching layer by layer, and the webpage files are identified in high accuracy in less time.

In a second aspect, an embodiment of the present invention provides an apparatus for identifying a malicious web page file, where the apparatus includes:

the determining unit is used for determining feature data of each dimension of the webpage file to be identified; the dimensions comprise a combined dimension and a single dimension; the combined dimension comprises a plurality of feature data; the single dimension has only one feature data;

the processing unit is used for obtaining fusion feature data of the combined dimension by the aid of a first machine learning model according to the multiple feature data of the combined dimension aiming at the combined dimension in the dimensions; obtaining a preliminary identification result of whether the webpage file to be identified is a malicious webpage file or not through a rule engine;

and the processing unit is further used for obtaining a final identification result of whether the webpage file to be identified is a malicious webpage file or not by passing the preliminary identification result, the combined dimension fusion feature data and the single dimension feature data through a second machine learning model.

In one possible design, the web page file to be identified is obtained by: acquiring a webpage file; determining that the webpage file contains a character string and the character string does not contain a high-risk keyword, and determining that the webpage file is the webpage file to be identified; the high-risk keywords are used for indicating that the webpage files are malicious webpage files.

In one possible design, the processing module is further to: and if the webpage file does not contain the character string, determining that the webpage file is a non-malicious webpage file.

In a possible design, it is determined whether the web page file contains a character string by: determining whether the file type of the webpage file is a text type or not according to the file header of the webpage file; if the type of the text is the text type, determining that the character string is contained; if the webpage file is not of the text type, determining whether the webpage file has an identifiable coding format; if the code format is recognizable, determining that the character string is contained; if the webpage file is not of a text type and does not have a recognizable coding format, whether the webpage file contains the character string or not is searched.

In a third aspect, an embodiment of the present application further provides a computing device, including: a memory for storing program instructions; a processor for calling program instructions stored in said memory to execute the method as described in the various possible designs of the first aspect according to the obtained program.

In a fourth aspect, embodiments of the present application further provide a computer-readable non-transitory storage medium, including computer-readable instructions, which, when read and executed by a computer, cause the computer to perform the method as set forth in the various possible designs of the first aspect.

These and other implementations of the present application will be more readily understood from the following description of the embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic structural diagram of malicious web page file identification according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for identifying a malicious web page file according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for identifying malicious web files according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a method for identifying malicious web files according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an apparatus for identifying a malicious web page file according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a system architecture for identifying malicious web documents according to an embodiment of the present invention, in which a web server 101 screens out web documents to be identified from web documents; sending the webpage file to be identified to the rule engine server 102, and identifying the webpage file to be identified by the rule engine server 102 to obtain a preliminary identification result; the webpage server 101 sends the webpage file to be identified to the feature engineering server 103, the feature engineering server 103 determines a plurality of feature data of the combined dimensionality and feature data of a single dimensionality of the webpage file to be identified, sends the plurality of feature data of the combined dimensionality to the feature fusion server 104, and the feature fusion server 104 fuses the plurality of feature data of the combined dimensionality of the webpage file to be identified to obtain fusion feature data; and sending the single-dimensional feature data determined by the feature engineering server 103, the fusion feature data obtained by the feature fusion server 104 and the preliminary identification result obtained by the rule engine server 102 to the machine learning server 105, and performing machine learning by the machine learning server 105 according to the single-dimensional feature data, the fusion feature data and the preliminary identification result to determine whether the webpage file to be identified is a malicious webpage file.

Based on this, an embodiment of the present application provides a method for identifying a malicious web file, as shown in fig. 2, including:

step 201, determining characteristic data of each dimension of a webpage file to be identified; the dimensions comprise a combined dimension and a single dimension; the combined dimension comprises a plurality of feature data; the single dimension has only one feature data;

here, the web page file to be identified is a web page file which needs to be judged whether the web page file is a malicious web page file. Each webpage file to be identified has corresponding feature data of each dimension, for example, feature data of dimensions such as IOC (index of coincidence), TF-IDF (term frequency-inverse text frequency), information entropy, and the like. The feature data is feature description data of the to-be-identified web page file, for example, if the information entropy of the to-be-identified web page file is 0.2, the uncertainty of the information source of the to-be-identified web page file is 0.2. And if the IOC of the webpage file to be identified is 0.067, randomly extracting two identical characters from the webpage file to be identified, wherein the coincidence index of the two identical characters is 0.067. The word frequency-TF of the chip of the webpage file to be identified is 0.07, and the chip appears for 7 times in 100 total words of the webpage file to be identified. If the inverse file frequency-IDF of the chip of the webpage file to be identified is 4, the chip appears in 1000 files of 1000, 000 and 0 files. If the TF of "cleaning" of the web document to be recognized is 0.03, the "cleaning" occurs 3 times in 100 total words of the web document to be recognized. If the IDF of "cleaning" of the web page file to be recognized is 4, then "cleaning" occurs in 1000 files out of 1000, 000, 0 files. If the TF of the "application" of the web page file to be recognized is 0.10, the "application" appears 10 times in 100 total words of the web page file to be recognized. If the IDF of "cleaning" of the web page file to be recognized is 1, then "application" appears in 1000,000, 0 files out of 1000,000, 0 files. The combined dimension includes a plurality of feature data, that is, a plurality of feature data are used to describe features of one dimension of the to-be-identified web page file, as in the above example, the plurality of feature data, the TF and IDF of the "chip", the TF and IDF of the "cleaning", and the TF and IDF of the "application" collectively describe the dimension of the TF-IDF of the to-be-identified web page file. The single dimension is only one feature data, that is, one feature data is used to describe the feature of one dimension of the to-be-identified web file, and the information entropy of the to-be-identified web file is described as the dimension of the information entropy of the to-be-identified web file if the information entropy is 0.2 in the above example; the dimension of the IOC of the web file to be identified is described as 0.067 in the above example.

Step 202, aiming at a combined dimension in the dimensions, obtaining fused feature data of the combined dimension by using a plurality of feature data of the combined dimension through a first machine learning model;

and fusing a plurality of feature data of the combined dimension in the webpage file to be identified through the first machine learning model to obtain fused feature data. The first machine learning model may be various types of machine model algorithms. Such as Random Forest (RF), deep learning networks, etc. In the above example, the combined dimensions TF-IDF included in the web page file to be identified includes "chip" TF and IDF, "cleaning" TF and IDF, "application" TF and IDF6 feature data, and the first machine learning model is performed on the 6 feature data, so that the TF-IDF of the "chip" in the web page file to be identified is 0.07 × 4 ═ 0.21, the "cleaning" TF-IDF is 0.03 × 4 ═ 0.12, and the "application" TF-IDF is 0.10 × 1 ═ 0.10, so that the TF-IDF dimension feature of the web page file to be identified is a three-dimensional vector [0.21,0.12,0.10], and after prediction is performed by the first machine learning model, a prediction result of 0 or 1 is obtained, and the fused feature data is 1, assuming that the prediction result is 1.

Step 203, obtaining a preliminary identification result of whether the webpage file to be identified is a malicious webpage file through a rule engine;

here, the rule engine is a method that can search keywords and perform corresponding logical operations between the keywords, such as regular matching, string-like find function and logical function, and the like. Thus, the rule engine may make a preliminary determination as to whether the web page file to be identified is a malicious web page file, to obtain a preliminary identification result, for example, if the web page file to be identified is a malicious web page file, the preliminary identification result is 1, and if the web page file to be identified is a non-malicious web page file, the preliminary identification result is 0.

And 204, passing the preliminary identification result, the fusion feature data of the combined dimension and the feature data of the single dimension through a second machine learning model to obtain a final identification result of whether the webpage file to be identified is a malicious webpage file.

Here, the second machine learning model may be various kinds of machine model algorithms. Such as Random Forest (RF), deep learning networks, etc. The first type of machine learning model may be the same as or different from the second type of machine learning model. And (3) passing the preliminary identification result 1 or 0 in the step 203, the fusion characteristic data 1 of the combined dimension TF-IDF, the characteristic data 0.2 of the single dimension information entropy and the characteristic data 0.067 of the single dimension IOC through a second machine learning model to obtain a final result of whether the webpage file to be identified is a malicious webpage file.

By adopting the method, the combined dimensionality and the single dimensionality in each dimensionality of the webpage file to be identified can be obtained by determining the characteristic data of each dimensionality of the webpage file to be identified, and therefore, the multiple characteristic data of the combined dimensionality in each dimensionality can be used for obtaining the fusion characteristic data of the combined dimensionality through the first machine learning model. In this way, it is possible to prevent that the feature data of a single dimension is submerged when training through the machine learning model and important feature data is lost because the number of the plurality of feature data of the combined dimension is too large in the number of the feature data of each dimension. And obtaining a final result of whether the webpage file to be identified is the malicious webpage file by training a primary identification result of whether the webpage file to be identified is the malicious webpage file obtained by the rule engine, the fusion characteristic data and the single-dimensional characteristic data through a second machine learning model. Therefore, the method not only prevents the feature data of a single dimension from being submerged, but also integrates two identification methods of machine learning and a rule engine, improves the accuracy of malicious webpage file identification, and greatly increases the safety of a computer environment.

Based on the above method flows, an embodiment of the present application provides a method flow for identifying malicious web documents, as shown in fig. 3, before performing the method flow in fig. 2, the web documents may be subjected to corresponding screening preprocessing to obtain the web documents to be identified, where the method includes:

301, acquiring a webpage file;

here, for example, the web page file 1:

00001111

and 2, web page file:

GIF89a

<？php

$do＝'todo'；

$$do＝$_POST['dapeng']；

eval(`/**123**/`.$todo)；

？>

and 3, web page file:

GIF89a

<？phpinfo()

$do＝'todo'；

$$do＝$_POST['dapeng']；

eval(`/**123**/`.$todo)；

？>

step 302, determining that the webpage file contains a character string and the character string does not contain a high-risk keyword, and determining that the webpage file is the webpage file to be identified; the high-risk keywords are used for indicating that the webpage files are malicious webpage files.

Here, the high-risk keywords are keywords that are often used by malicious web page files, such as phpinfo (), IFRAME, trojan. The character string is a text character string of a non-binary character string. If it is determined that the web page file contains the character string and the character string does not contain the high-risk keyword, the web page file cannot be identified by using the judgment method in fig. 3, and the web page file is determined to be the web page file to be identified, and the web page file can be further identified by using the method flow in fig. 2. For example, in step 301, the web page file 2 contains a character string, but the character string does not contain a high-risk keyword.

Wherein determining that the character string does not contain high-risk keywords comprises: decoding the webpage file to obtain a decoded text; extracting a code from the decoded text; determining that the high-risk keyword is not included in the code. That is, before determining whether the web page file contains the high-risk keyword, the character string of the web page file needs to be correspondingly decoded to obtain a decoded file; so that the corresponding rule engine can read the decoding file of the webpage file and extract the code to further determine that the extracted code does not contain the high-risk keywords.

If the code is not extracted from the decoding text, determining that the webpage file is a non-malicious webpage file; and/or if the code is determined to contain the high-risk keywords, determining that the webpage file is a malicious webpage file. That is, before determining whether the web page document contains the high-risk keyword, correspondingly decoding the character string of the web page document to obtain a decoded document; if the rule engine cannot read the webpage file and/or extract the code, the webpage file is determined to be a non-malicious webpage file. If the rule engine extracts that the code contains the high-risk keywords, the webpage file is a malicious webpage file; if the web page file 3 in step 301 contains the high-risk keyword phpinfo (), the web page file is a malicious web page file with a high probability.

And if the webpage file does not contain the character string, determining that the webpage file is a non-malicious webpage file. That is, if the web page file is an empty file or a web page file containing no character string, the web page file is a non-malicious web page file, for example, if the web page file 1 in step 301 contains no text character string, the web page file is determined to be a non-malicious web page file.

Determining whether the webpage file contains a character string or not by the following method, including: determining whether the file type of the webpage file is a text type or not according to the file header of the webpage file; if the type of the text is the text type, determining that the character string is contained; if the webpage file is not of the text type, determining whether the webpage file has an identifiable coding format; if the code format is recognizable, determining that the character string is contained; if the webpage file is not of a text type and does not have a recognizable coding format, whether the webpage file contains the character string or not is searched. That is, the web page file can be screened once by determining whether the web page file is a text file; if the webpage file is a text file, the webpage file comprises a character string, the character string in the webpage file can be correspondingly decoded subsequently, a decoded code is extracted, whether a high-risk keyword is contained in the decoded code is judged, and whether the webpage file is a malicious webpage file or a webpage file to be identified is judged. If the webpage file is a non-text file but has a recognizable coding format, determining that the webpage file contains a character string, wherein the webpage file is a non-malicious webpage file which is probably disguised by a malicious webpage file; therefore, decoding is carried out according to the corresponding coding format, the decoded code is extracted, and whether the decoded code contains high-risk keywords or not is judged so as to judge whether the webpage file is a malicious webpage file or a webpage file to be identified. If the webpage file is a non-text file and does not have a recognizable coding format, further judging whether the webpage file contains a character string, and if the webpage file contains the character string, the webpage file is also a non-malicious webpage file which is probably disguised by a malicious webpage file; and because the coding format cannot be identified, decoding the code according to a general decoding character set or a general decoding code, such as Latin-1, and extracting the decoded code, and judging whether the decoding code contains high-risk keywords to judge whether the webpage file is a malicious webpage file or a webpage file to be identified. The speed of decoding the character string with the recognizable coding format through the corresponding coding format is faster than the speed of decoding the character string with the unrecognizable coding format through a universal decoding character set or a universal decoding code; for example, if the character string is in ASCII encoding format, the decoding speed by the ASCII encoding format decoding method is faster than the decoding speed by Latin-1, which cannot recognize the encoding format. Therefore, in order to increase the speed of identifying the web page file, the decoding format of the character string is first determined.

Wherein the rule engine is implemented by a first canonical match; determining that the character string does not contain high-risk keywords, including: determining that the character string does not contain high-risk keywords through second regular matching, wherein the complexity of the second regular matching is smaller than that of the first regular matching; extracting a code from the decoded text, comprising: extracting a code from the decoded text by a third regular match having a complexity less than the complexity of the first regular match. That is to say, after the character strings in the web page file are decoded, the third regular matching may be performed to extract codes from the decoded text, and then the second regular matching is performed to determine whether the extracted codes contain high-risk keywords, if not, the web page file cannot determine whether the extracted codes are malicious web page files or non-malicious web page files, and the extracted codes are the web page file to be identified, and the third regular matching is performed to further analyze and match whether the web page file to be identified is a malicious web page file. Wherein, the first regular matching can be a regular expression responsible for extracting codes; the second regular matching may be a simpler regular expression responsible for matching the high-risk keywords; the third regular matching may be a more complex regular expression, which not only matches the high-risk keywords, but also may perform an operation on the logical relationship between the keywords to determine whether the keywords after the corresponding logical operation may form high-risk keywords or malicious sentences, and the like, so as to determine whether the to-be-identified web page file is a malicious web page file.

Based on the method flows of fig. 2 and fig. 3, an embodiment of the present application provides a method flow for identifying a malicious web page file, as shown in fig. 4, including:

step 401, acquiring a webpage file to be identified, and judging whether the webpage file is a blank text file, if so, executing step 406 to determine that the webpage file is a non-malicious webpage file; if not, step 402 is performed.

Step 402, determining whether the web page file is a text file, if so, executing step 403, and if not, executing step 404.

Step 403, if the web page file is a text file, the web page text contains a character string, and the encoding format of the web page file is determined; and step 407 is performed.

Step 404, judging whether the web page file is a text file or not, and judging whether the web page file has an identifiable coding format or not; if the web page file does not have the recognizable code format, go to step 405; if the web page file has the recognizable code format, step 407 is executed.

Step 405, the web page file does not have a recognizable code format, and whether the web page file contains a character string is further judged; if not, go to step 406 to determine that the web page file is a non-malicious web page file; if the character string is included, step 407 is executed.

Step 407, correspondingly decoding the web page file which is a text file/the web page file which is a non-text file but has an identifiable coding mode/the web page file which is a non-text file but has a character string, and obtaining a decoded text.

And step 408, extracting codes from the decoded text through a third regular matching, and if the codes cannot be extracted, executing step 406 to confirm that the webpage file is a non-malicious webpage file. If the code can be extracted, step 409 is performed.

Step 409, determining whether the extracted code contains high-risk keywords through second regular matching; if yes, go to step 410 to determine that the web page file is a malicious web page file; if not, go to step 411.

Step 411, determining that the web page file cannot be determined to be a malicious web page file or a non-malicious web page file through the above process, and then determining that the web page file is a web page file to be identified, which needs to be further analyzed and determined.

And step 412, performing first regular matching on the webpage file to be identified to obtain a primary identification result.

And step 413, determining a plurality of feature data of the combined dimensions and feature data of a single dimension of the webpage file to be identified.

And 414, training the plurality of feature data of the combined dimension of the to-be-identified webpage file obtained in the step 413 through first machine learning to obtain fused feature data.

And 415, extracting the feature data corresponding to each single dimension of the webpage file to be identified.

And step 416, inputting the preliminary identification result of the to-be-identified web page file obtained in step 412, the fused feature data obtained in step 414, and the feature data corresponding to each single dimension obtained in step 415 into a second machine learning model, wherein the second machine learning model determines whether the to-be-identified web page file is a final result of a malicious file.

It should be noted that the sequence of the above-mentioned flows is not exclusive, and step 412 and step 413 may be executed first, and then step 412 is executed.

Based on the same concept, an embodiment of the present invention provides an apparatus for identifying a malicious web page file, and fig. 5 is a schematic diagram of the apparatus for identifying a malicious web page file provided in an embodiment of the present invention, as shown in fig. 5, the apparatus includes:

a determining unit 501, configured to determine feature data of each dimension of a to-be-identified web page file; the dimensions comprise a combined dimension and a single dimension; the combined dimension comprises a plurality of feature data; the single dimension has only one feature data;

a processing unit 502, configured to, for a combined dimension in the dimensions, obtain, through a first machine learning model, fused feature data of the combined dimension from a plurality of feature data of the combined dimension; obtaining a preliminary identification result of whether the webpage file to be identified is a malicious webpage file or not through a rule engine;

the processing unit 502 is further configured to obtain a final recognition result of whether the to-be-recognized webpage file is a malicious webpage file by passing the preliminary recognition result, the fusion feature data of the combined dimensions, and the feature data of the single dimension through a second machine learning model.

In a possible design, the processing unit 502 is further configured to obtain the web page file to be identified by: acquiring a webpage file; determining that the webpage file contains a character string and the character string does not contain a high-risk keyword, and determining that the webpage file is the webpage file to be identified; the high-risk keywords are used for indicating that the webpage files are malicious webpage files.

In one possible design, the processing unit 502 is specifically configured to determine that the character string does not contain a high-risk keyword, including: decoding the webpage file to obtain a decoded text; extracting a code from the decoded text; determining that the high-risk keyword is not included in the code.

In a possible design, the processing unit 502 is further configured to determine that the web page file is a non-malicious web page file if no code is extracted from the decoded text; and/or if the code is determined to contain the high-risk keywords, determining that the webpage file is a malicious webpage file.

In a possible design, the processing unit 502 is further configured to determine that the web page file is a non-malicious web page file if it is determined that the web page file does not contain the character string.

In one possible design, the processing unit 502 is specifically configured to determine whether the web page file contains a character string by: determining whether the file type of the webpage file is a text type or not according to the file header of the webpage file; if the type of the text is the text type, determining that the character string is contained; if the webpage file is not of the text type, determining whether the webpage file has an identifiable coding format; if the code format is recognizable, determining that the character string is contained; if the webpage file is not of a text type and does not have a recognizable coding format, whether the webpage file contains the character string or not is searched.

In a possible design, the processing unit 502 is specifically configured to determine that the character string does not contain a high-risk keyword through a second regular matching, where a complexity of the second regular matching is smaller than a complexity of the first regular matching; extracting a code from the decoded text, comprising: and extracting codes from the decoded text through a third regular matching, wherein the complexity of the third regular matching is less than that of the first regular matching.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for identifying malicious webpage files, which is characterized by comprising the following steps:

determining characteristic data of each dimension of a webpage file to be identified; the dimensions comprise a combined dimension and a single dimension; the combined dimension comprises a plurality of feature data; the single dimension has only one feature data;

aiming at a combined dimension in the dimensions, obtaining fusion feature data of the combined dimension by a plurality of feature data of the combined dimension through a first machine learning model;

obtaining a preliminary identification result of whether the webpage file to be identified is a malicious webpage file or not through a rule engine;

and obtaining a final identification result of whether the webpage file to be identified is a malicious webpage file or not by passing the preliminary identification result, the fusion characteristic data of the combined dimension and the characteristic data of the single dimension through a second machine learning model.

2. The method of claim 1, wherein the obtaining of the web page file to be identified comprises:

acquiring a webpage file;

determining that the webpage file contains a character string and the character string does not contain a high-risk keyword, and determining that the webpage file is the webpage file to be identified; the high-risk keywords are used for indicating that the webpage files are malicious webpage files.

3. The method of claim 2,

determining that the character string does not contain high-risk keywords, including:

decoding the webpage file to obtain a decoded text;

extracting a code from the decoded text;

determining that the high-risk keyword is not included in the code.

4. The method of claim 3, further comprising:

if no code is extracted from the decoding text, determining that the webpage file is a non-malicious webpage file; and/or

And if the code is determined to contain the high-risk keywords, determining that the webpage file is a malicious webpage file.

5. The method of claim 3 or 4, further comprising:

6. The method of any one of claims 2 to 4, wherein determining whether the web page file contains a character string comprises:

determining whether the file type of the webpage file is a text type or not according to the file header of the webpage file; if the type of the text is the text type, determining that the character string is contained;

if the webpage file is not of the text type, determining whether the webpage file has an identifiable coding format; if the code format is recognizable, determining that the character string is contained;

if the webpage file is not of a text type and does not have a recognizable coding format, whether the webpage file contains the character string or not is searched.

7. The method of claim 5, wherein the rule engine is implemented by a first canonical match;

determining that the character string does not contain high-risk keywords through second regular matching, wherein the complexity of the second regular matching is smaller than that of the first regular matching;

extracting a code from the decoded text, comprising:

extracting a code from the decoded text by a third regular match having a complexity less than the complexity of the first regular match.

8. An apparatus for identifying malicious web files, the apparatus comprising:

9. The apparatus of claim 8, wherein the web page file to be identified is obtained by:

acquiring a webpage file;

10. The apparatus as recited in claim 9, said processing unit to further:

11. The apparatus of claim 9, wherein determining whether the web page file contains a character string is performed by:

12. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to perform the method of any of claims 1 to 7 in accordance with the obtained program.

13. A computer-readable non-transitory storage medium including computer-readable instructions which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 7.