CN113190847A - Confusion detection method, device, equipment and storage medium for script file - Google Patents

Confusion detection method, device, equipment and storage medium for script file Download PDF

Info

Publication number
CN113190847A
CN113190847A CN202110401461.XA CN202110401461A CN113190847A CN 113190847 A CN113190847 A CN 113190847A CN 202110401461 A CN202110401461 A CN 202110401461A CN 113190847 A CN113190847 A CN 113190847A
Authority
CN
China
Prior art keywords
script file
confusion
detection
different
script
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110401461.XA
Other languages
Chinese (zh)
Inventor
闫华
位凯志
古亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN202110401461.XA priority Critical patent/CN113190847A/en
Publication of CN113190847A publication Critical patent/CN113190847A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for detecting confusion of script files, wherein the method comprises the following steps: obtaining M different confusion detection models and script files to be detected; detecting the script file by using each confusion detection model in M different confusion detection models to obtain M detection results; and when the M detection results are consistent, determining whether the script file is an obfuscated script file or not based on the detection results. Therefore, M detection results are obtained by detecting the script file through the M confusion detection models, and when the M detection results are consistent, the script file is determined to be a confusion script file or a non-confusion script file according to the detection results. The confusion detection model can be repeatedly trained by using the script files with consistent detection results subsequently, so that the script file to be detected can be accurately detected as the confusion script file or the non-confusion script file, and the problem that only the specific script file category can be determined in the prior art, namely the problem of overfitting of the confusion detection model is solved.

Description

Confusion detection method, device, equipment and storage medium for script file
Technical Field
The present application relates to computer security technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting confusion of a script file.
Background
In order to avoid the searching and killing of a firewall and an antivirus system, hackers often implement deformation on malicious script files through an obfuscation technology. From the perspective of security, attack and defense, the antivirus system needs to be able to determine whether a given script file is processed by the obfuscation technique.
The object industry provides a method for recognizing a confusion script file aiming at a JaveScript script file, namely, a confusion tool (such as a JaveScript Obfuscator) is used or a manual labeling mode is used for obtaining a JaveScript script file sample with a confusion label and a non-confusion label, a script file confusion detection model is trained, and then the trained script file confusion detection model is used for detecting whether the JaveScript script file to be detected is subjected to confusion processing. However, the labeled samples generated using the obfuscation tool in the above scheme tend to cause the script file to obfuscate the overfitting of the detection model, i.e. only a specific JaveScript script file can be identified.
Disclosure of Invention
In order to solve the foregoing technical problem, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for detecting confusion of a script file.
The technical scheme of the application is realized as follows:
in a first aspect, a method for detecting confusion of a script file is provided, where the method includes:
obtaining M different confusion detection models and obtaining a first script file to be detected; wherein M is a positive integer greater than or equal to 2;
detecting the script file by using each confusion detection model in the M different confusion detection models to obtain M detection results;
and when the M detection results are consistent, determining whether the script file is an obfuscated script file or not based on the detection results.
In the foregoing solution, after determining whether the script file is an obfuscated script file based on the detection result, the method further includes: taking the detection result as the label information of the script file; the tag information is used for indicating whether the script file is an obfuscated script file.
In the foregoing solution, after the detecting result is used as the tag information of the script file, the method further includes: storing the script file and the corresponding label information in a script file set with label information; and retraining the M different confusion detection models by using the script file and the label information to obtain M different confusion detection models after training.
In the above scheme, the method further comprises: and when the M detection results are inconsistent, the script file is used as the script file to be detected again to wait for the next detection again.
In the above scheme, the M different confusion detection models are generated based on different feature engineering training, respectively.
In the above scheme, the M different confusion detection models are generated based on different machine learning algorithms respectively.
In the above scheme, the M different confusion detection models are generated based on different feature engineering and different machine learning algorithm training, respectively.
In the above scheme, the script file includes a JavaScript file.
In a second aspect, an apparatus for detecting confusion of a script file is provided, the apparatus comprising:
the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring M different confusion detection models and acquiring a first script file to be detected; wherein M is a positive integer greater than or equal to 2;
the detection unit is used for detecting the first script file by using each confusion detection model in the M different confusion detection models to obtain M detection results;
and the determining unit is used for determining whether the script file is an obfuscated script file or not based on the detection result when the M detection results are consistent.
In a third aspect, an apparatus for detecting confusion of a script file is provided, including: a processor and a memory configured to store a computer program operable on the processor, wherein the processor is configured to perform the steps of the aforementioned method when executing the computer program.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the aforementioned method.
The application discloses a method for detecting confusion of script files, which comprises the following steps: acquiring M different confusion detection models and acquiring a script file to be detected; wherein M is a positive integer greater than or equal to 2; detecting the script file by using each confusion detection model in M different confusion detection models to obtain M detection results; and determining the label information of the script file by comparing the M detection results. Therefore, M detection results are obtained by detecting the script file through the M confusion detection models, and when the M detection results are consistent, the script file is determined to be a confusion script file or a non-confusion script file according to the detection results. The confusion detection model can be repeatedly trained by using the script files with consistent detection results subsequently, so that the script file to be detected can be accurately detected as the confusion script file or the non-confusion script file, and the problem that only the specific script file category can be determined in the prior art, namely the problem of overfitting of the confusion detection model is solved.
Drawings
FIG. 1 is a first flowchart illustrating a method for detecting confusion of a script file according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a first process of training 2 different confusion detection models in the embodiment of the present application;
FIG. 3 is a second flowchart illustrating a method for detecting confusion of a script file according to an embodiment of the present disclosure;
FIG. 4 is a diagram illustrating a second process of training 2 different confusion detection models in the embodiment of the present application;
FIG. 5 is a schematic structural diagram of an confusion detection apparatus for script files in the embodiment of the present application;
fig. 6 is a schematic structural diagram of the confusion detection device component of the script file in the embodiment of the present application.
Detailed Description
So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.
The confusion detection method is characterized in that confusion detection of script files is performed in a semi-supervised learning mode, specifically, M preset detection models are trained by the script files with label information to obtain M confusion detection models, then M confusion detection models are used for detecting a large amount of script files without label information, and the confusion detection models are repeatedly trained according to the script files and the label information.
Fig. 1 is a first flowchart of an confusion detection method for a script file in an embodiment of the present application, and as shown in fig. 1, the confusion detection method for the script file may include the specific steps of:
step 101: acquiring M different confusion detection models and acquiring a script file to be detected; wherein M is a positive integer greater than or equal to 2;
it should be noted that the script file to be detected refers to a script file without label information, that is, the script file is not marked to be an obfuscated script file or a non-obfuscated script file. Here, the script file may be a JavaScript script file.
It should be noted that the M confusion detection models are all trained detection models, and are used to detect whether the script file is a confusion script file.
In some embodiments, the method further comprises: and training a confusion detection model. That is, before the obtaining M different confusion detection models, the method further comprises: acquiring a script file set with label information; the script file set comprises at least one script file to be trained and label information of each script file to be trained; performing feature extraction on each script file to be trained in the script file set by using M feature extraction models to obtain M feature extraction results corresponding to each script file to be trained; and training M preset detection models based on the M feature extraction results and the label information to obtain M different confusion detection models.
It should be noted that M different confusion detection models can be generated by training based on M feature extraction models constructed by different feature engineering and M preset detection models constructed by the same machine learning algorithm; or, the feature extraction model constructed based on the M same feature projects and the preset detection model constructed based on the M different machine learning algorithms are generated by training; or, the feature extraction model constructed based on M different feature projects and the preset detection model constructed based on M different machine learning algorithms may be generated by training.
The different feature engineering may be, for example, two or more of lexical features, grammatical features, black and white word features, and the like, and the different machine learning algorithm may be, for example, two or more of support vector machine algorithm, lifting tree algorithm, random forest algorithm, deep learning algorithm, neural network algorithm, and the like.
Specifically, M feature extraction models are used for extracting features of each script file to be trained in the script file set to obtain M feature extraction results corresponding to each script file to be trained, then the M feature extraction results are input into corresponding M preset detection models to obtain predicted label information of the corresponding script file to be trained, parameters in the M preset detection models are adjusted respectively according to the predicted label information and the label information (namely real label information), and if the accuracy of the predicted label information is higher than or equal to the preset accuracy, M different confusion detection models are obtained. And if the accuracy of the predicted tag information is lower than the preset accuracy, continuously adjusting the parameters in the M preset detection models until the accuracy of the predicted tag information is higher than or equal to the preset accuracy.
The above-mentioned label information (i.e., the real label information) of each script file to be trained can be used to mark corresponding label information on each script file to be trained in the script file set in a manual labeling manner, and the label information is used to indicate whether the script file is an obfuscated script file. For example, the tag information includes an obfuscated identifier and a non-obfuscated identifier, or the tag information indicates whether the script file is an obfuscated script file when the tag information includes the obfuscated identifier, and indicates whether the script file is a non-obfuscated script file when the tag information does not include the obfuscated identifier.
Here, if the value of M is 2, the feature extraction model includes a feature extraction model constructed by lexical features and a feature extraction model constructed by black and white words. Specifically, a lexical analysis is performed on each script file to be trained according to a feature extraction model constructed according to the lexical features to generate a word stream corresponding to each script file to be trained, wherein each word stream comprises various types of words, for example: the method comprises the steps of identifying identifiers, assigning operators, word size, keywords and the like, counting the number corresponding to different types of words in each word stream, and correspondingly filling the corresponding word number according to a word type sequence in a preset first feature extraction result rule to form a row matrix with fixed dimensions, wherein the row matrix is a corresponding feature extraction result. And (3) performing black and white word analysis on each script file to be trained aiming at the feature extraction model constructed by the black and white words, extracting all the black words and white words in each script file to be trained, respectively counting the number of the black words and the number of the white words in each script file to be trained, and correspondingly filling the corresponding numbers according to the word type sequence in a preset second feature extraction result rule to form a 1-by-2 row matrix, wherein the row matrix is the corresponding feature extraction result.
Based on the above training of 2 different confusion detection models, the present application provides a training method, fig. 2 is a schematic diagram of a first process of training 2 different confusion detection models in this embodiment, as shown in fig. 2, the specific steps are as follows:
step 201: acquiring a script file set with label information;
here, the script file set includes at least one script file to be trained, and label information of each script file to be trained. The label information is used for indicating whether the script file is an obfuscated script file.
Step 202: performing lexical feature extraction on each script file to be trained in the script file set to obtain a first feature vector corresponding to each script file to be trained;
and performing lexical analysis on each script file to be trained by using a feature extraction model constructed by the lexical features to generate a word stream corresponding to each script file to be trained. Wherein each word stream includes a plurality of types of words.
And then counting the corresponding number of different types of words in each word stream, and correspondingly filling the corresponding number of words according to the word type sequence in a preset first feature extraction result rule to form a row matrix (namely a first feature vector/feature extraction result) with a fixed dimension.
Step 203: training a model 1 based on a first feature vector and label information corresponding to each script file to be trained;
here, the model 1 is any one of the M preset detection models mentioned above.
Step 204: obtaining a trained model 1;
the model 1 in this step is the confusion detection model obtained after the preset detection model is trained in the previous step.
Step 205: black and white word extraction is carried out on each script file to be trained in the script file set to obtain a second feature vector corresponding to each script file to be trained;
and performing black-and-white word analysis on each script file to be trained by using the black-and-white word extraction model, and extracting all black words and white words in each script file to be trained.
And then respectively counting the number of black words and the number of white words in each script analysis file to be trained, and correspondingly filling the corresponding numbers according to the word type sequence in the preset second feature extraction result rule to form a 1-by-2 row matrix (namely a second feature vector).
Step 206: training a model 2 based on the second feature vector and the label information corresponding to each script file to be trained;
here, the model 2 is any one of the M preset detection models mentioned above.
It should be noted that the model 1 and the model 2 may be models composed of feature extraction models constructed by different feature engineering and preset detection models constructed by the same machine learning algorithm; or, the model can also be a model consisting of a feature extraction model constructed by the same feature engineering and a preset detection model constructed by different machine learning algorithms; or the model can be composed of feature extraction models constructed by different feature engineering and preset detection models constructed by different machine learning algorithms.
Step 207: a trained model 2 is obtained.
The model 2 in this step is the confusion detection model obtained after the preset detection model is trained in the previous step.
Step 102: detecting the script file by using each confusion detection model in the M different confusion detection models to obtain M detection results;
it should be noted that the detection result is used to indicate whether the script file is an obfuscated script file.
Here, the detection results of different confusion detection models for the script file may be the same or different.
Step 103: and when the M detection results are consistent, determining whether the script file is an obfuscated script file or not based on the detection results.
That is, when the detection result of each confusion detection model for detecting the script file is consistent, the type of the script file can be determined. If the detection result is confusion, determining the script file as a confusion script file; and if the detection result is non-confusable, determining the script file as a non-confusable script file.
In some embodiments, after determining whether the script file is an obfuscated script file based on the detection result, the method further comprises: taking the detection result as the label information of the script file; the tag information is used for indicating whether the script file is an obfuscated script file.
In some embodiments, after the detecting result is used as the tag information of the script file, the method further includes: storing the script file and the corresponding label information in a script file set with label information; and retraining the M different confusion detection models by using the script file and the label information to obtain M different confusion detection models after training.
In some embodiments, the method further comprises: and when the M detection results are inconsistent, the script file is used as the script file to be detected again to wait for the next detection again.
Here, the execution subject of steps 101 to 103 may be a processor of the confusion detection apparatus of the script file.
Therefore, M detection results are obtained by detecting the script file through the M confusion detection models, and when the M detection results are consistent, the script file is determined to be a confusion script file or a non-confusion script file according to the detection results. The confusion detection model can be repeatedly trained by using the script files with consistent detection results subsequently, so that the script file to be detected can be accurately detected as the confusion script file or the non-confusion script file, and the problem that only the specific script file category can be determined in the prior art, namely the problem of overfitting of the confusion detection model is solved.
Fig. 3 is a second flow diagram of the confusion detection method for the script file in the embodiment of the present application, and as shown in fig. 3, the confusion detection method for the script file may specifically include:
step 301: acquiring M different confusion detection models and acquiring a script file to be detected; wherein M is a positive integer greater than or equal to 2;
it should be noted that the script file to be detected refers to a script file without label information, that is, the script file is not marked to be an obfuscated script file or a non-obfuscated script file. Here, the script file may be a JavaScript script file.
It should be noted that the M confusion detection models are all trained detection models, and are used to detect whether the script file is a confusion script file.
In some embodiments, the method further comprises: and training a confusion detection model. That is, before the obtaining M different confusion detection models, the method further comprises: acquiring a script file set with label information; the script file set comprises at least one script file to be trained and label information of each script file to be trained; performing feature extraction on each script file to be trained in the script file set by using M feature extraction models to obtain M feature extraction results corresponding to each script file to be trained; and training M preset detection models based on the M feature extraction results and the label information to obtain M different confusion detection models.
In some embodiments, the M different confusion detection models are generated based on different machine learning algorithm training, respectively.
In some embodiments, the M different confusion detection models are generated based on different machine learning algorithm training, respectively.
In some embodiments, the M different confusion detection models are generated based on different feature engineering and different machine learning algorithm training, respectively.
Step 302: detecting the script file by using each confusion detection model in the M different confusion detection models to obtain M detection results;
it should be noted that the detection result is used to indicate whether the script file is an obfuscated script file.
Here, the detection results of different confusion detection models for the script file may be the same or different.
Step 303: when the M detection results are consistent, determining whether the script file is an obfuscated script file or not based on the detection results;
that is, when the detection result of each confusion detection model for detecting the script file is consistent, the type of the script file can be determined. If the detection result is confusion, determining the script file as a confusion script file; and if the detection result is non-confusable, determining the script file as a non-confusable script file.
In some embodiments, after determining whether the script file is an obfuscated script file based on the detection result, the method further comprises: taking the detection result as the label information of the script file; the tag information is used for indicating whether the script file is an obfuscated script file.
That is, if the detection result is an obfuscation, the script file is determined to be an obfuscated script file, that is, the tag information of the script file is an obfuscation (i.e., a detection result); if the detection result is non-confusable, the script file is determined to be a non-confusable script file, that is, the tag information of the script file is non-confusable (i.e., the detection result).
In some embodiments, after the detecting result is used as the tag information of the script file, the method further includes: storing the script file and the corresponding label information in a script file set with label information; and retraining the M different confusion detection models by using the script file and the label information to obtain M different confusion detection models after training.
That is to say, after the tag information of the script file is detected, the script file and the tag information are stored in the script file set with the tag information, so that the subsequent training of the M different confusion detection models through the script file and the tag information is facilitated.
In practical applications, after detecting the label information of a plurality of script files, the M different confusion detection models are trained based on the plurality of script files and the corresponding label information to obtain the trained M different confusion detection models. And after the label information of a plurality of script files of the next batch is detected, training the M different confusion detection models trained in the previous step to obtain new M different confusion detection models, and so on.
Step 304: and when the M detection results are inconsistent, the script file is used as the script file to be detected again to wait for the next detection again.
That is, when the M detection results include the confusion script file and the non-confusion script file, it is impossible to determine whether the script file is the confusion script file, and the script file needs to be re-used as the script file to be detected to wait for the next re-detection.
Illustratively, if the value of M is 4, 4 different confusion detection models detect the script file to obtain 4 detection results, which are a confusion script file, a confusion script file and a non-confusion script file, respectively, and since the 4 detection results are inconsistent, it cannot be determined whether the script file is a confusion script file. And only when the 4 detection results are all the confusion script files or all the non-confusion script files, determining the script files as the confusion script files or the non-confusion script files.
Therefore, M detection results are obtained by detecting the script file through the M confusion detection models, and when the M detection results are consistent, the script file is determined to be a confusion script file or a non-confusion script file according to the detection results. The confusion detection model can be repeatedly trained by using the script files with consistent detection results subsequently, so that the script file to be detected can be accurately detected as the confusion script file or the non-confusion script file, and the problem that only the specific script file category can be determined in the prior art, namely the problem of overfitting of the confusion detection model is solved.
Based on the foregoing embodiment, fig. 4 is a schematic diagram of a second process of training 2 different confusion detection models in the embodiment of the present application, and as shown in fig. 4, the specific steps are as follows:
step 401: acquiring a plurality of script files without label information to form a script file set to be detected;
step 402: obtaining a model 1;
step 403: performing lexical feature extraction on each script file in the script file set to be detected to obtain a third feature vector corresponding to each script file;
step 404: taking the third feature vector corresponding to each script file as the input of the model 1, and outputting a first detection result;
step 405: obtaining a model 2;
step 406: extracting black and white words of each script file in the script file set to be detected to obtain a fourth feature vector corresponding to each script file;
step 407: taking the fourth feature vector corresponding to each script file as the input of the model 2, and outputting a second detection result;
step 408: judging whether the first detection result is consistent with the second detection result; if yes, go to step 409; if not, go to step 401;
here, the two detection results are inconsistent, which means that the tag information of the corresponding script file cannot be determined, and the corresponding script file needs to be reused as the script file to be detected.
If the two detection results are consistent, it indicates that the tag information of the corresponding script file can be determined, and the step 409 is continuously executed.
Step 409: determining label information of a corresponding script file, and storing the label information to a script file set with the label information;
step 410: iteratively training a model 1;
specifically, each script file is detected by using a model 1 to obtain corresponding predicted tag information, the predicted tag information is compared with real tag information, the ratio of the correctly predicted tag information to all the predicted tag information, namely the accuracy of the first predicted tag information, if the accuracy of the first predicted tag information is smaller than a set probability threshold, parameters in the model 1 are modified, and the accuracy of the first predicted tag information is calculated again until the accuracy of the first predicted tag information is larger than or equal to the set probability threshold, the parameters in the model 1 are not modified, and the termination of iteration is indicated. Then execution continues with step 402.
Step 411: model 2 is iteratively trained.
Specifically, each script file is detected by using the model 2 to obtain corresponding predicted tag information, the predicted tag information is compared with the real tag information, the ratio of the correctly predicted tag information to all the predicted tag information, namely, the accuracy of the second predicted tag information is calculated, if the accuracy of the second predicted tag information is smaller than a set probability threshold, the parameters in the model 2 are modified, and the accuracy of the second predicted tag information is calculated again until the accuracy of the second predicted tag information is larger than or equal to the set probability threshold, the parameters in the model 2 are not modified, and the termination of iteration is indicated. Then execution continues at step 405.
Therefore, M detection results are obtained by detecting the script file through the M confusion detection models, and when the M detection results are consistent, the script file is determined to be a confusion script file or a non-confusion script file according to the detection results. The confusion detection model can be repeatedly trained by using the script files with consistent detection results subsequently, so that the script file to be detected can be accurately detected as the confusion script file or the non-confusion script file, and the problem that only the specific script file category can be determined in the prior art, namely the problem of overfitting of the confusion detection model is solved.
An embodiment of the present application provides an apparatus for detecting confusion of a script file, where fig. 5 is a schematic structural diagram of the apparatus for detecting confusion of a script file in an embodiment of the present application, and as shown in fig. 5, the apparatus includes:
an obtaining unit 501, configured to obtain M different confusion detection models, and obtain a first script file to be detected; wherein M is a positive integer greater than or equal to 2;
a detecting unit 502, configured to detect the script file by using each confusion detection model of the M different confusion detection models, respectively, to obtain M detection results;
a determining unit 503, configured to determine whether the script file is an obfuscated script file based on the detection result when the M detection results are consistent.
In some embodiments, after determining whether the script file is an obfuscated script file based on the detection result, the detection result is used as tag information of the script file; the tag information is used for indicating whether the script file is an obfuscated script file.
In some embodiments, after the detection result is used as the tag information of the script file, the script file and the corresponding tag information are stored in a script file set with tag information; and retraining the M different confusion detection models by using the script file and the label information to obtain M different confusion detection models after training.
In some embodiments, the method further comprises: and when the M detection results are inconsistent, the script file is used as the script file to be detected again to wait for the next detection again.
In some embodiments, the M different confusion detection models are generated based on different feature engineering training, respectively.
In some embodiments, the M different confusion detection models are generated based on different machine learning algorithm training, respectively.
In some embodiments, the M different confusion detection models are generated based on different feature engineering and different machine learning algorithm training, respectively.
In some embodiments, the script file comprises a JavaScript file.
Therefore, M detection results are obtained by detecting the script file by using M different confusion detection models, when the M detection results are consistent, the type of the script file can be determined according to the detection results, namely the confusion script file or the non-confusion script file, and when the M detection results are inconsistent, the script file is waited to be detected again until the type of the script file is determined. The confusion detection method can determine any script file category, and solves the problem that only a specific script file category can be determined in the prior art, namely the problem of overfitting of the prior confusion detection model.
An embodiment of the present application provides confusion detection equipment for a script file, where fig. 6 is a schematic structural diagram of the confusion detection equipment for the script file in the embodiment of the present application, and as shown in fig. 6, the confusion detection equipment includes: a processor 601 and a memory 602 configured to store computer programs executable on the processor;
wherein, the processor 601 is configured to execute the steps of the confusion detection method of the script file in the foregoing embodiments when running the computer program.
Of course, in actual practice, the various components of the confusion detection device of the script file are coupled together by a bus system 603, as shown in FIG. 6. It will be appreciated that the bus system 603 is used to enable communications for connections between these components. The bus system 603 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for the sake of clarity the various buses are labeled as bus system 603 in figure 6.
In practical applications, the processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.
The Memory may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD), or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.
The embodiment of the application also provides a computer readable storage medium for storing the computer program.
Optionally, the computer-readable storage medium may be applied to any one of the methods in the embodiments of the present application, and the computer program enables a computer to execute corresponding processes implemented by a processor in each method in the embodiments of the present application, which is not described herein again for brevity.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (11)

1. A method for detecting confusion in a script file, the method comprising:
acquiring M different confusion detection models and acquiring a script file to be detected; wherein M is a positive integer greater than or equal to 2;
detecting the script file by using each confusion detection model in the M different confusion detection models to obtain M detection results;
and when the M detection results are consistent, determining whether the script file is an obfuscated script file or not based on the detection results.
2. The method of claim 1, wherein after determining whether the script file is an obfuscated script file based on the detection result, the method further comprises:
taking the detection result as the label information of the script file; the tag information is used for indicating whether the script file is an obfuscated script file.
3. The method according to claim 2, wherein after the detecting result is used as the tag information of the script file, the method further comprises:
storing the script file and the corresponding label information in a script file set with label information;
and retraining the M different confusion detection models by using the script file and the label information to obtain M different confusion detection models after training.
4. The method of claim 1, further comprising:
and when the M detection results are inconsistent, the script file is used as the script file to be detected again to wait for the next detection again.
5. The method according to any one of claims 1 to 4,
the M different confusion detection models are generated based on different feature engineering training respectively.
6. The method according to any one of claims 1 to 4,
the M different confusion detection models are generated based on different machine learning algorithm training respectively.
7. The method according to any one of claims 1 to 4,
the M different confusion detection models are generated based on different feature engineering and different machine learning algorithm training respectively.
8. The method of claim 1,
the script file comprises a JavaScript file.
9. An apparatus for detecting confusion in a script file, the apparatus comprising:
the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring M different confusion detection models and acquiring a first script file to be detected; wherein M is a positive integer greater than or equal to 2;
the detection unit is used for detecting the first script file by using each confusion detection model in the M different confusion detection models to obtain M detection results;
and the determining unit is used for determining whether the script file is an obfuscated script file or not based on the detection result when the M detection results are consistent.
10. An apparatus for detecting confusion in a script file, the apparatus comprising: a processor and a memory configured to store a computer program capable of running on the processor,
wherein the processor is configured to perform the steps of the method of any one of claims 1 to 8 when running the computer program.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN202110401461.XA 2021-04-14 2021-04-14 Confusion detection method, device, equipment and storage medium for script file Pending CN113190847A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110401461.XA CN113190847A (en) 2021-04-14 2021-04-14 Confusion detection method, device, equipment and storage medium for script file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110401461.XA CN113190847A (en) 2021-04-14 2021-04-14 Confusion detection method, device, equipment and storage medium for script file

Publications (1)

Publication Number Publication Date
CN113190847A true CN113190847A (en) 2021-07-30

Family

ID=76975756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110401461.XA Pending CN113190847A (en) 2021-04-14 2021-04-14 Confusion detection method, device, equipment and storage medium for script file

Country Status (1)

Country Link
CN (1) CN113190847A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761521A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Script file detection method, device, equipment and storage medium based on machine learning
CN114422248A (en) * 2022-01-20 2022-04-29 深信服科技股份有限公司 Attack processing method, system, network security device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120159629A1 (en) * 2010-12-16 2012-06-21 National Taiwan University Of Science And Technology Method and system for detecting malicious script
CN106872867A (en) * 2017-03-10 2017-06-20 华北电力大学 The recognition methods of power equipment partial discharges fault, apparatus and system
CN106961419A (en) * 2017-02-13 2017-07-18 深信服科技股份有限公司 WebShell detection methods, apparatus and system
US20180096148A1 (en) * 2016-09-30 2018-04-05 AVAST Software s.r.o. Detecting malicious scripts
CN107978373A (en) * 2017-11-23 2018-05-01 吉林大学 A kind of semi-supervised biomedical event extraction method based on common training
CN108985064A (en) * 2018-07-16 2018-12-11 中国人民解放军战略支援部队信息工程大学 A kind of method and device identifying malice document
CN110765459A (en) * 2019-10-18 2020-02-07 北京天融信网络安全技术有限公司 Malicious script detection method and device and storage medium
CN111475809A (en) * 2020-04-09 2020-07-31 杭州奇盾信息技术有限公司 Script confusion detection method and device, computer equipment and storage medium
US20200412740A1 (en) * 2019-06-27 2020-12-31 Vade Secure, Inc. Methods, devices and systems for the detection of obfuscated code in application software files

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120159629A1 (en) * 2010-12-16 2012-06-21 National Taiwan University Of Science And Technology Method and system for detecting malicious script
US20180096148A1 (en) * 2016-09-30 2018-04-05 AVAST Software s.r.o. Detecting malicious scripts
CN106961419A (en) * 2017-02-13 2017-07-18 深信服科技股份有限公司 WebShell detection methods, apparatus and system
CN106872867A (en) * 2017-03-10 2017-06-20 华北电力大学 The recognition methods of power equipment partial discharges fault, apparatus and system
CN107978373A (en) * 2017-11-23 2018-05-01 吉林大学 A kind of semi-supervised biomedical event extraction method based on common training
CN108985064A (en) * 2018-07-16 2018-12-11 中国人民解放军战略支援部队信息工程大学 A kind of method and device identifying malice document
US20200412740A1 (en) * 2019-06-27 2020-12-31 Vade Secure, Inc. Methods, devices and systems for the detection of obfuscated code in application software files
CN110765459A (en) * 2019-10-18 2020-02-07 北京天融信网络安全技术有限公司 Malicious script detection method and device and storage medium
CN111475809A (en) * 2020-04-09 2020-07-31 杭州奇盾信息技术有限公司 Script confusion detection method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761521A (en) * 2021-09-02 2021-12-07 恒安嘉新(北京)科技股份公司 Script file detection method, device, equipment and storage medium based on machine learning
CN114422248A (en) * 2022-01-20 2022-04-29 深信服科技股份有限公司 Attack processing method, system, network security device and storage medium

Similar Documents

Publication Publication Date Title
EP3651043A1 (en) Url attack detection method and apparatus, and electronic device
CN109359439B (en) software detection method, device, equipment and storage medium
WO2019083737A1 (en) System and method for analyzing binary code malware classification using artificial neural network techniques
US11048798B2 (en) Method for detecting libraries in program binaries
CN108520180B (en) Multi-dimension-based firmware Web vulnerability detection method and system
CN111753290B (en) Software type detection method and related equipment
US11019096B2 (en) Combining apparatus, combining method, and combining program
CN111368289B (en) Malicious software detection method and device
CN112492059A (en) DGA domain name detection model training method, DGA domain name detection device and storage medium
CN110674479B (en) Abnormal behavior data real-time processing method, device, equipment and storage medium
CN112528284A (en) Malicious program detection method and device, storage medium and electronic equipment
CN113190847A (en) Confusion detection method, device, equipment and storage medium for script file
US20190180032A1 (en) Classification apparatus, classification method, and classification program
CN114553523A (en) Attack detection method and device based on attack detection model, medium and equipment
CN110750789A (en) De-obfuscation method, de-obfuscation device, computer apparatus, and storage medium
CN111783812B (en) Forbidden image recognition method, forbidden image recognition device and computer readable storage medium
CN112784269A (en) Malicious software detection method and device and computer storage medium
CN112651024A (en) Method, device and equipment for malicious code detection
CN114024761B (en) Network threat data detection method and device, storage medium and electronic equipment
US20210097183A1 (en) Data scan system
CN113378161A (en) Security detection method, device, equipment and storage medium
CN112966264A (en) XSS attack detection method, device, equipment and machine-readable storage medium
CN111488574A (en) Malicious software classification method, system, computer equipment and storage medium
CN110263540A (en) A kind of marking code method and device
CN115643044A (en) Data processing method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination