CN110198291B - Webpage backdoor detection method, device, terminal and storage medium - Google Patents

Webpage backdoor detection method, device, terminal and storage medium Download PDF

Info

Publication number
CN110198291B
CN110198291B CN201810226945.3A CN201810226945A CN110198291B CN 110198291 B CN110198291 B CN 110198291B CN 201810226945 A CN201810226945 A CN 201810226945A CN 110198291 B CN110198291 B CN 110198291B
Authority
CN
China
Prior art keywords
detected
classification
sample set
black
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810226945.3A
Other languages
Chinese (zh)
Other versions
CN110198291A (en
Inventor
张壮
董志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201810226945.3A priority Critical patent/CN110198291B/en
Publication of CN110198291A publication Critical patent/CN110198291A/en
Application granted granted Critical
Publication of CN110198291B publication Critical patent/CN110198291B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/145Countermeasures against malicious traffic the attack involving the propagation of malware through the network, e.g. viruses, trojans or worms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a webpage backdoor detection method, a device, a terminal and a storage medium, wherein the method comprises the following steps: acquiring a file to be detected; extracting the characteristics of the file to be detected; inputting the characteristics of the file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model; the generation method of the prediction model comprises the following steps: calculating a word vector of the black and white sample set; colliding the set of black and white samples to obtain classification features capable of classifying the set of black and white samples; extending the classification features using the word vector. The invention can automatically realize the automatic extraction of the characteristics and has semantic information, so that the characteristics are more objective and effective; the method has the advantages of less manual intervention, high efficiency and certain precision in the aspects of detection rate and false alarm rate.

Description

Webpage backdoor detection method, device, terminal and storage medium
Technical Field
The invention relates to the technical field of information security, in particular to a webpage backdoor detection method, a webpage backdoor detection device, a webpage backdoor detection terminal and a webpage backdoor detection storage medium.
Background
webshell, as the name implies: the web refers to a web server, the shell is a script program written by a scripting language, and the webshell is a management tool of the web and has the authority of operating the web server, which is also called webadmin. The webshell is generally used by website administrators for website management, server management and other purposes, but because the webshell is relatively powerful, files can be uploaded and downloaded to view databases. Even some commands related to the system on the server (such as creating a user, modifying and deleting a file) can be invoked, and are usually utilized by hackers, and the hackers upload the webshell written by themselves to the directory of the pages of the web server through some uploading methods, and then perform intrusion through a form of page access, or perform intrusion operation on the server directly through some related tools inserted into a sentence connection.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method, a device, a terminal and a storage medium for detecting a webpage backdoor, which can realize automatic feature selection and have semantic information, so that the features are more objective and effective.
In order to solve the above technical problem, in a first aspect, the present invention provides a method for detecting a backdoor of a web page, including:
acquiring a file to be detected;
extracting the characteristics of the file to be detected;
inputting the characteristics of the file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model;
the generation method of the prediction model comprises the following steps:
calculating a word vector of the black and white sample set;
colliding the set of black and white samples to obtain classification features capable of classifying the set of black and white samples;
extending the classification features using the word vector.
In a second aspect, the present invention provides a web page backdoor detection apparatus, including:
the acquisition module is used for acquiring the file to be detected;
the extraction module is used for extracting the characteristics of the file to be detected;
the detection module is used for inputting the characteristics of the file to be detected into a support vector machine prediction model and outputting a detection result through the prediction model;
the prediction model generation module comprises a feature selection module used for selecting classification features from the sample set, and the feature selection module comprises:
the word vector calculation module is used for calculating word vectors of the black and white sample set;
a black and white sample collision module for colliding the black and white sample set to obtain a classification feature capable of classifying the black and white sample set;
a feature expansion module for expanding the classification features using the word vectors.
In a third aspect, the present invention provides a terminal, including: a processor and a memory, wherein the processor is configured to invoke and execute a program stored in the memory, and the memory is configured to store a program configured to:
acquiring a file to be detected;
extracting the characteristics of the file to be detected;
inputting the characteristics of the file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model;
the generation method of the prediction model comprises the following steps:
calculating a word vector of the black and white sample set;
colliding the set of black and white samples to obtain classification features capable of classifying the set of black and white samples;
extending the classification features using the word vector.
In a fourth aspect, the present invention provides a computer storage medium having computer-executable instructions stored therein, the computer-executable instructions being loaded by a processor and performing the steps of:
acquiring a file to be detected;
extracting the characteristics of the file to be detected;
inputting the characteristics of the file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model;
the generation method of the prediction model comprises the following steps:
calculating a word vector of the black and white sample set;
colliding the set of black and white samples to obtain classification features capable of classifying the set of black and white samples;
extending the classification features using the word vector.
The embodiment of the invention has the following beneficial effects:
the method detects the file to be detected by establishing a prediction model, wherein the establishment of the prediction model comprises the selection of classification characteristics, and specifically, the characteristics are selected by a word vector method. The invention can automatically realize the automatic extraction of the characteristics and has semantic information, so that the characteristics are more objective and effective; the webshell detection capability is improved, and the method is more effective for resisting deformation of the webshell; the method has the advantages of less manual intervention, high efficiency and certain precision in the aspects of detection rate and false alarm rate.
Drawings
FIG. 1 is a flow chart of a method for machine learning classes provided by an embodiment of the present invention;
FIG. 2 is a diagram of an application scenario of the present invention provided by an embodiment of the present invention;
fig. 3 is a flowchart of a feature selection method according to an embodiment of the present invention;
FIG. 4 is a flow chart of a model training method provided by an embodiment of the present invention;
FIG. 5 is a block diagram of a model training process according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a model generation method according to an embodiment of the present invention;
FIG. 7 is a flowchart of a method for detecting a backdoor of a web page according to an embodiment of the present invention;
fig. 8 is a block diagram of a web page backdoor detection process according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a web page backdoor detection apparatus according to an embodiment of the present invention;
FIG. 10 is a block diagram of a predictive model generation module according to an embodiment of the present invention;
FIG. 11 is a schematic diagram of a selection module according to an embodiment of the present invention;
FIG. 12 is a schematic structural diagram of a model training module according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present invention;
FIG. 14 is a graph of experimental results of sample testing provided by embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The following explanations will be made first with respect to the terms involved in this specification:
webshell: a command execution environment in the form of a web page file such as asp, php, jsp, or cgi may also be referred to as a web page backdoor. After a hacker invades a website, the asp or php backdoor file and the normal webpage file in the WEB directory of the website server are mixed together, and then the asp or php backdoor can be accessed by using a browser to obtain a command execution environment, so that the purpose of controlling the website server is achieved.
Machine learning: a multi-field cross discipline relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer.
Deep learning is a method based on characterization learning of data in machine learning. An observation (e.g., an image) may be represented using a number of ways, such as a vector of intensity values for each pixel, or more abstractly as a series of edges, a specially shaped region, etc. Tasks (e.g., face recognition or facial expression recognition) are more easily learned from the examples using some specific representation methods. The benefit of deep learning is to replace the manual feature acquisition with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms.
Word vector: by training each word in a certain language to map into a short vector of fixed length (of course, the term "short" is relative to the term "long" of one-hot presentation), putting all the vectors together to form a word vector space, and each vector is a point in the space, introducing a "distance" into the space, the (lexical, semantic) similarity between the words can be judged according to the distance between the words. The word vector technique is to convert words into dense vectors, and for similar words, their corresponding word vectors are also similar. The most common usage of word vectors is two:
(1) directly used in the input layer of the neural network model.
(2) Existing models are augmented as an assist feature.
An svm (support Vector machine) refers to a support Vector machine, which is a common discrimination method, and in the field of machine learning, the svm is a supervised learning model and is generally used for pattern recognition, classification and regression analysis. The main ideas of Support Vector Machines (SVM) can be summarized into two points:
(1) the method is used for analyzing linear divisible conditions, and for linear inseparable conditions, linear divisible samples of a low-dimensional input space are converted into a high-dimensional feature space by using a nonlinear mapping algorithm so as to be linearly divisible, so that the high-dimensional feature space can carry out linear analysis on the nonlinear features of the samples by using the linear algorithm.
(2) It constructs the optimal hyperplane in the feature space based on the structure risk minimization theory, so that the learner gets global optimization and the expectation of the whole sample space meets a certain upper bound with a certain probability.
N-gram is a language model commonly used in large vocabulary continuous speech recognition based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the individual words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus. Binary Bi-grams and ternary Tri-grams are commonly used. N-grams (also sometimes referred to as N-grams) are a very important concept in natural Language processing, and in nlp (natural Language processing), one can predict or evaluate whether a sentence is reasonable or not by using N-grams based on a certain corpus. On the other hand, another function of the N-gram is to evaluate the degree of difference between two character strings.
N-gram is a concept in the domain of computer linguistics and probability theory, referring to a sequence of N items (items) in a given piece of text or speech. Items (item) can be syllables, letters, words, or base pairs, usually N-grams taken from text or corpora. For example, the phrase "do you vacate today", its Bi-gram is:
today, every break, vacation, fake, do, the reason for making such a language model is based on the idea that: in the whole language environment, the probability of occurrence of sentence T is composed of the probability of occurrence of N items that make up T, as shown in the following formula:
Figure BDA0001598322530000061
the above formula is difficult to apply in practice, when a markov model appears, which considers that the occurrence of a word depends only on the words that appear before it, which greatly simplifies the above formula.
Figure BDA0001598322530000062
The difficulty in the webshell detection process is that the code writing method is flexible, and the code deformation conditions are many; part of the normal script application code is similar to the webshell code; features are difficult to extract; the requirements for the operating environment make virtual machine detection difficult to some extent.
The methods for detecting webshell in the prior art include two types, machine learning and traditional methods. Referring to fig. 1, a flow diagram of a method for machine learning classes is shown, including obtaining script files, data preprocessing, feature extraction, training models, and classification.
For the machine learning method, the whole process is the same, and for the general method, the feature extraction is a great difficulty, and the final classification effect is also greatly influenced. The feature extraction of machine learning class has great flexibility, and there are two main types:
(1) manually extracting features, performing machine learning training, and then obtaining a model for classification prediction;
(2) compiling the script to obtain byte codes, carrying out N-gram feature extraction on the byte codes, and then carrying out feature learning and training.
The manual extraction has strong subjectivity and low efficiency, and the detection effect on unknown samples is not necessarily good. The features extracted by the N-gram method belong to discrete unit words without any genetic attribute, for webshell type samples, the nature of the samples is script language, continuous semantic connection exists between the features, and the features lose much information, so that the robustness and the accuracy of detection are reduced to a certain extent.
Conventional methods include regular expression methods and virtual machine detection methods, wherein,
(1) the regular expression method comprises the following steps:
the advantages are that: the speed is high, and the accuracy is high;
the disadvantages are as follows: in the face of flexible and variable webshells, the regular expression has poor detection capability.
(2) A virtual machine detection method comprises the following steps:
the advantages are that: the method has certain heuristic capability, and for different webshells, only relevant behaviors are detected;
the disadvantages are as follows: the operation efficiency is low, and the simulation cannot detect the texts due to more parameters of the webshell, so that the report is missed.
Firstly, the method for detecting the backdoor of the webpage can automatically select the characteristics, so that the characteristics are more objective and effective, the deformation of the webshell is more effective, the manual intervention is less, and the efficiency is high.
The embodiment of the invention can be applied to a scene formed by the server 201 and the terminal 202 as shown in fig. 2. Server 201 communicates with terminal 202 via a network including, but not limited to: the terminal 202 is not limited to a Personal Computer (PC), a mobile phone, a tablet computer, a web page backdoor detection device, etc. in a wide area network, a metropolitan area network, or a local area network. The web page backdoor detection method of the present invention may be executed by the server 201, the terminal 202, or both the server 201 and the terminal 202. The terminal 202 may execute the web page backdoor detection method according to the embodiment of the present invention, or may execute the web page backdoor detection method by a client installed thereon.
It should be noted that the hardware environment shown in fig. 2 may further include other hardware modules according to requirements, and the web page backdoor detection method of the present invention may also be executed by other hardware modules, which is not limited in this embodiment.
The following describes a web page backdoor detection method according to the present invention with specific examples.
Please refer to fig. 3, which shows a schematic flow chart of feature selection by deep learning, which is mainly used for selecting classification features capable of distinguishing different types in a sample set, and specifically includes:
s301, calculating word vectors of the black and white sample set.
Firstly, a sample set is required to be obtained and trained, wherein the sample set comprises black samples and white samples, the black samples refer to the set of webshells, and the white samples refer to the set of normal files.
The calculating the word vector of the black and white sample set specifically includes:
s3011, segmenting the text of the black and white sample;
s3012, counting the word frequency of each word in the text;
s3013, performing Huffman coding according to the word frequency;
s3014, performing word vector training on the text according to the Huffman coding.
Word vectors of the sample set can be calculated through a word2vec method, the word2vec is an efficient tool for representing words as real numerical vectors, processing of text contents is simplified into vector operation in a K-dimensional vector space through training by utilizing a deep learning thought, and similarity in the vector space can be used for representing similarity in text semantics. word vectors output by word2vec can be used to do many NLP related tasks such as clustering, synonym finding, part-of-speech analysis, etc. If the idea is changed and a word is taken as a feature, word2vec can map the feature to a K-dimensional vector space and can search deeper feature representation for text data. word2vec uses a word vector representation mode of Distributed representation, and the basic idea is to map each word to a K-dimensional real number vector (K is generally a hyper-parameter in a model) through training, and judge semantic similarity between words through distances between words (such as cosine similarity, euclidean distance, and the like).
S302, the black and white sample set is collided to obtain classification characteristics capable of classifying the black and white sample set.
The collision method mentioned here specifically means: and solving a difference set of the black sample and the white sample, and finding out different points of the two sets so as to find out the classification characteristic capable of distinguishing the black sample from the white sample.
S303, expanding the classification characteristics by using the word vectors.
For short texts, the method has the characteristics of short texts, and is mainly expressed in the aspects of short texts, sparse features, large noise influence, strong real-time performance and the like. The most important characteristic is the sparsity of features, each text usually only contains dozens of words or even less, and the number of features is small through feature extraction. For this purpose, the features may be expanded by a feature expansion method, such as synonym expansion.
The expanding the classification features using the word vectors specifically includes:
s3031, calculating the distance between the word vector corresponding to the classification characteristic and other word vectors;
and S3032, selecting words with the distance from the classification features to the classification features smaller than a preset threshold value as synonyms of the classification features, and expanding the synonyms into the classification features.
Referring to fig. 4, a flow chart of a model training method is shown, and referring to fig. 5, the method mainly includes:
s401, performing feature extraction on the black sample and the white sample by adopting supervised learning.
S402, inputting the features into a support vector machine for training and learning.
Support Vector Machines (SVM) solve the classification problem mainly by the principle of structure risk minimization, which separates two types of data by a maximum separation through an optimal classification hyperplane.
Let sample set S be linearly separable (S { (x)i,y i1,. n) }), where x isi∈Rd,y i1, -1 represents xiThe corresponding category. g (x) ═ wx + b is a general form of a linear discriminant function in d-dimensional space, and its corresponding classification plane equation is wx + b ═ 0. G (x) is normalized, so that the following conditions are met for both types of samples: | g (x) | ≧ 1. At this time, the classification interval is
Figure BDA0001598322530000091
The larger | | w | |, the smaller the classification interval. In order to enable the classification surface to correctly classify all samples, the following conditions are met:
yi[(wx)+b]-1≥0(i=1,2,...,n) (3)
the classification surface which meets the above conditions can also be called an optimal classification surface, and the problem of the optimal classification surface is equivalent to solving an objective function on the premise of meeting the above formula:
Figure BDA0001598322530000092
is measured. Hyperplane H with optimal classification plane1、H2The training samples that remain parallel and pass through the closest point to the classification surface of the two types of samples can be established by the equal sign in equation (3), and these training samples are called support vectors. For linear indivisible samples, a penalty factor C and a relaxation variable ξ are introducediIn this case, (4) can be rewritten as:
Figure BDA0001598322530000093
at this time, Lagrange multiplier alpha is introducediAnd (i ═ 1, 2.. times, n), converting the classification problem of the Support Vector Machine (SVM) into a constrained quadratic function extremum problem, and solving the optimal classification surface. The final solution is:
Figure BDA0001598322530000094
the optimal classification function may be rewritten as:
Figure BDA0001598322530000101
and S403, generating the prediction model.
One embodiment of the present invention provides a method for generating a prediction model, please refer to fig. 6, which includes:
s601, selecting classification features from the sample set by adopting deep learning.
Specifically, the selecting classification features from the sample set by using deep learning includes:
s6011, word vectors of a black and white sample set are calculated.
Segmenting the text of the black and white sample;
counting the word frequency of each word in the text;
performing Huffman coding according to the word frequency;
and performing word vector training on the text according to the Huffman coding.
Word vectors of the sample set can be calculated through a word2vec method, the word2vec is an efficient tool for representing words as real numerical vectors, processing of text contents is simplified into vector operation in a K-dimensional vector space through training by utilizing a deep learning thought, and similarity in the vector space can be used for representing similarity in text semantics.
S6012, colliding the black and white sample set to obtain classification features capable of classifying the black and white sample set.
The collision method mentioned here specifically means: and solving a difference set of the black sample and the white sample, and finding out different points of the two sets so as to find out the classification characteristic capable of distinguishing the black sample from the white sample.
S6013, the classification features are expanded by using the word vectors.
The steps specifically include:
calculating the distance between the word vector corresponding to the classification characteristic and other word vectors;
and selecting words with the distance from the classification features smaller than a preset threshold value as synonyms of the classification features, and expanding the synonyms into the classification features.
S602, using a Support Vector Machine (SVM) to train the model.
The model training by using a Support Vector Machine (SVM) specifically comprises:
performing feature extraction on the black sample and the white sample by adopting supervised learning;
inputting the features into a Support Vector Machine (SVM) for training and learning;
generating the predictive model.
An embodiment of the present invention provides a method for detecting a backdoor of a web page, and a flowchart of the method refers to fig. 7, and also refers to fig. 8, which shows a flowchart of a method for detecting a backdoor of a web page, including:
s701, acquiring the file to be detected.
The source of the file to be detected can be a webpage server, and for webshell detection, the format of the file to be detected includes but is not limited to asp, php, jsp or cgi.
S702, extracting the characteristics of the file to be detected.
And extracting corresponding features in the file to be detected according to the classification features.
And S703, inputting the characteristics of the file to be detected into a Support Vector Machine (SVM) prediction model, and outputting a detection result through the prediction model.
Inputting the extracted features from the file to be detected into a Support Vector Machine (SVM) prediction model, and outputting a detection result through the prediction model, namely judging whether the file to be detected is webshell.
The generation method of the prediction model adopts the generation method in the above embodiment, and specifically includes:
and selecting classification features from the sample set by adopting deep learning, and performing model training by using a Support Vector Machine (SVM).
Further, the selecting classification features from the sample set by adopting deep learning comprises:
s301, calculating word vectors of the black and white sample set.
For the calculation of the word vector, the words can be divided by the text of the black and white sample; counting the word frequency of each word in the text; performing Huffman coding according to the word frequency; and carrying out word vector training on the text according to the Huffman coding to obtain a word vector.
S302, the black and white sample set is collided to obtain classification characteristics capable of classifying the black and white sample set.
The collision method mentioned here specifically means: and solving a difference set of the black sample and the white sample, and finding out different points of the two sets so as to find out the classification characteristic capable of distinguishing the black sample from the white sample.
And S303, performing feature expansion on the classification features by using word vectors.
After the word vectors are obtained, all these vectors are put together to form a word vector space, and each vector is a point in the space, and the distance measure between the word vectors in this space can also represent the "distance" between the corresponding two words. The term "distance" between two words is the similarity between the grammatical, semantic meanings of the two words. Synonyms for the classification features can be obtained by the following methods:
calculating the distance between the word vector corresponding to the classification characteristic and other word vectors;
and selecting words with the distance from the classification features smaller than a preset threshold value as synonyms of the classification features, and expanding the synonyms into the classification features.
The distance referred to herein may be a cosine similarity or an euclidean distance.
Wherein the model training using a Support Vector Machine (SVM) comprises:
s401, performing feature extraction on the black sample and the white sample by adopting supervised learning;
s402, inputting the features into a support vector machine for training and learning;
and S403, generating a prediction model.
It should be noted that the web page backdoor detection method further includes:
and when the file to be detected is confirmed to be webshell after detection, carrying out deep learning on the file to be detected and the sample set again so as to update the prediction model.
Referring to fig. 9, a web page backdoor detection apparatus is shown, which mainly includes:
a prediction model generation module 910 for generating a prediction model;
an obtaining module 920, configured to obtain a file to be detected;
an extracting module 930, configured to extract features of the file to be detected;
the detection module 940 is configured to input the features of the file to be detected into a Support Vector Machine (SVM) prediction model, and output a detection result through the prediction model;
referring to fig. 10, the prediction model generation module 910 includes a feature extraction module 911 and a model training module 912,
the feature selection module 911 is configured to select a classification feature from a sample set, please refer to fig. 11, where the feature selection module 911 includes:
a word vector calculation module 9111, configured to calculate word vectors of the black and white sample set;
a black-and-white sample collision module 9112, configured to collide the black-and-white sample set to obtain a classification feature capable of classifying the black-and-white sample set;
a feature expansion module 9113 for expanding the classification features using the word vectors.
In particular, the feature augmentation module 9113 further comprises:
the distance calculation module is used for calculating the distance between the word vector corresponding to the classification characteristic and other word vectors;
and the synonym selecting module is used for selecting a word with the distance from the classification characteristic to the classification characteristic smaller than a preset threshold value as the synonym of the classification characteristic and expanding the synonym into the classification characteristic.
Referring to fig. 12, the model training module 912 includes:
the sample characteristic extraction module 9121 is used for performing characteristic extraction on the black sample and the white sample by adopting supervised learning;
a training module 9122, configured to input the features into a support vector machine for training and learning;
a generating module 9123, configured to generate a prediction model according to the training result.
The web page backdoor detection apparatus may further include an updating module 950, where the updating module 950 is configured to perform deep learning on the file to be detected and the sample set again to update the prediction model when the file to be detected is determined to be a webshell after being detected.
An embodiment of the present invention provides a terminal, please refer to fig. 13, where the terminal 1300 may be a computer or a mobile terminal, such as a mobile phone, a tablet computer, and the like, at least one Application (APP) based on a client/service mechanism is installed in the terminal 1300, and the terminal may establish a communication connection with a server to which the APP belongs and send an access request to the server. The terminal is configured to implement the web page backdoor detection method provided in the foregoing embodiment, and specifically, the terminal structure may include the web page backdoor detection device provided in the foregoing embodiment.
The terminal 1300 includes, among other things, a processor 1310, a memory 1320, a memory 1330, a network interface 1340, a display 1350, and an input device 1360. An operating system is stored in the memory 1320 of the terminal 1300, and an application program is also stored in the memory 1320.
The processor 1310 executes various functional applications and data processing by running applications and modules stored in the memory 1320. The processor 1310 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, etc.
Memory 1330 provides an environment for the operation of applications in memory 1320, and memory 1330 typically comprises semiconductor memory units including Random Access Memory (RAM), Read Only Memory (ROM), and CACHE memory (CACHE), of which RAM is the most important. The memory 1330 is one of the important components of the terminal, and it is a bridge for communicating with the processor 1310, and all the programs are executed in the memory 1330.
Network interface 1340 is used for network communications with a server. The display 1350 may be a liquid crystal display or an electronic ink display, and the input device 1360 may be a touch layer covering the display, or may be a button, a trackball, or a touch pad provided on the terminal housing. The input device 1360 may accept input numeric or character information and generate signal inputs related to user settings and function control.
In a specific implementation process, the terminal can detect files on the server periodically or aperiodically, for example, detect and judge newly created web program files on the server.
An embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps in the embodiments shown in fig. 3 to 7, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 3 to 7, which are not described herein again.
In the practical application process, the method can be applied to a cloud mirror, a cloud mirror client can judge suspicious risks of a newly created web program file on a server, a small amount of suspected webshell files can be uploaded to a cloud end, further detection is carried out through a machine learning detection engine module of the cloud end, the sample file can be deleted in real time after detection is finished, and any data of user privacy cannot be extracted in the detection process.
An embodiment of the present invention performs an experiment on a detection result of the web page backdoor detection method provided in the above embodiment, where the experiment result is shown in fig. 14, where experiment parameters in the experiment process are as follows:
testing a sample:
black samples: 8000; the source is as follows: operating mobile phone, friend exchange and downloading on network;
white sample: 100 ten thousand; the source is as follows: normal web script files.
The experimental results show that:
by adopting the method for detecting the webshell file provided by the embodiment, the detection rate of the webshell is 98%, and the false alarm rate is less than three hundred thousand. Therefore, the webpage backdoor detection method has strong webpage backdoor detection capability and certain precision in the aspects of detection rate, false alarm rate and the like.
The invention adopts deep learning to extract features, and then uses a Support Vector Machine (SVM) to train a prediction model, so that the whole sample learning amount is large, the false alarm rate and the missing report rate are low, and a certain enlightening capability is realized.
The configurations shown in the present embodiment are only partial configurations related to the present application, and do not constitute a limitation on the devices to which the present application is applied, and a specific device may include more or less components than those shown, or combine some components, or have an arrangement of different components. It should be understood that the methods, apparatuses, and the like disclosed in the embodiments may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a division of one logic function, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or unit modules.
Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (6)

1. A webpage backdoor detection method is characterized by comprising the following steps:
acquiring a webpage script file to be detected; the language of the webpage script file to be detected is a script language;
calculating a word vector of the black and white sample set; solving a difference set of a black sample set and a white sample set, and finding out different points of the black sample set and the white sample set; finding a classification feature capable of distinguishing the black sample set from the white sample set based on the different points;
calculating the distance between the word vector corresponding to the classification characteristic and other word vectors; selecting words with the distance from the classification features smaller than a preset threshold value as synonyms of the classification features, and expanding the synonyms into the classification features;
extracting the characteristics of the webpage script file to be detected according to the classification characteristics; continuous semantic connection exists among the characteristics of the webpage script files to be detected;
inputting the characteristics of the webpage script file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model;
when the webpage script file to be detected is detected and then is confirmed to be webshell, selecting classification characteristics from the webpage script file to be detected and the sample set; performing feature extraction on the black sample and the white sample according to the classification features; inputting the features into the support vector machine prediction model for training and learning; and generating an updated support vector machine prediction model according to the training result.
2. The method according to claim 1, wherein the calculating the word vector of the black and white sample set specifically comprises:
acquiring the black and white sample, and segmenting words of the text of the black and white sample;
counting the word frequency of each word in the text;
performing Huffman coding according to the word frequency;
and performing word vector training on the text according to the Huffman coding.
3. A web page backdoor detection apparatus, comprising:
the acquisition module is used for acquiring the webpage script file to be detected; the language of the webpage script file to be detected is a script language;
the word vector calculation module is used for calculating word vectors of the black and white sample set;
the black and white sample collision module is used for solving a difference set of a black sample set and a white sample set and finding out different points of the black sample set and the white sample set; finding a classification feature capable of distinguishing the black sample set from the white sample set based on the different points;
the distance calculation module is used for calculating the distance between the word vector corresponding to the classification characteristic and other word vectors;
a synonym selecting module, configured to select a word whose distance from the classification feature is smaller than a preset threshold as a synonym of the classification feature, and expand the synonym into the classification feature;
the extraction module is used for extracting the characteristics of the webpage file to be detected according to the classification characteristics; continuous semantic connection exists among the characteristics of the webpage script files to be detected;
the detection module is used for inputting the characteristics of the webpage script file to be detected into a support vector machine prediction model and outputting a detection result through the prediction model;
the prediction model updating module is used for selecting classification characteristics from the webpage script file to be detected and the sample set when the webpage script file to be detected is determined to be webshell after detection; performing feature extraction on the black sample and the white sample according to the classification features; inputting the features into the support vector machine prediction model for training and learning; and generating an updated support vector machine prediction model according to the training result.
4. The apparatus of claim 3, wherein the word vector calculating module comprises:
the word segmentation module is used for acquiring the black and white sample and segmenting words of the text of the black and white sample;
the word frequency counting module is used for counting the word frequency of each word in the text;
the coding module is used for carrying out Huffman coding according to the word frequency;
and the word vector training module is used for carrying out word vector training on the text according to the Huffman coding.
5. A terminal, comprising:
a processor and a memory, wherein the processor is configured to invoke and execute a program stored in the memory, and the memory is configured to store a program configured to:
acquiring a webpage script file to be detected; the language of the webpage script file to be detected is a script language;
calculating a word vector of the black and white sample set; solving a difference set of a black sample set and a white sample set, and finding out different points of the black sample set and the white sample set; finding a classification feature capable of distinguishing the black sample set from the white sample set based on the different points;
calculating the distance between the word vector corresponding to the classification characteristic and other word vectors; selecting words with the distance from the classification features smaller than a preset threshold value as synonyms of the classification features, and expanding the synonyms into the classification features;
extracting the characteristics of the webpage script file to be detected according to the classification characteristics; continuous semantic connection exists among the characteristics of the webpage script files to be detected;
inputting the characteristics of the webpage script file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model;
when the webpage script file to be detected is detected and then is confirmed to be webshell, selecting classification characteristics from the webpage script file to be detected and the sample set; performing feature extraction on the black sample and the white sample according to the classification features; inputting the features into the support vector machine prediction model for training and learning; and generating an updated support vector machine prediction model according to the training result.
6. A computer storage medium having stored thereon computer-executable instructions, the computer-executable instructions being loaded by a processor and performing the steps of:
acquiring a webpage script file to be detected; the language of the webpage script file to be detected is a script language;
calculating a word vector of the black and white sample set; solving a difference set of a black sample set and a white sample set, and finding out different points of the black sample set and the white sample set; finding a classification feature capable of distinguishing the black sample set from the white sample set based on the different points;
calculating the distance between the word vector corresponding to the classification characteristic and other word vectors; selecting words with the distance from the classification features smaller than a preset threshold value as synonyms of the classification features, and expanding the synonyms into the classification features;
extracting the characteristics of the webpage script file to be detected according to the classification characteristics; continuous semantic connection exists among the characteristics of the webpage script files to be detected;
inputting the characteristics of the webpage script file to be detected into a support vector machine prediction model, and outputting a detection result through the prediction model;
when the webpage script file to be detected is detected and then is confirmed to be webshell, selecting classification characteristics from the webpage script file to be detected and the sample set; performing feature extraction on the black sample and the white sample according to the classification features; inputting the features into the support vector machine prediction model for training and learning; and generating an updated support vector machine prediction model according to the training result.
CN201810226945.3A 2018-03-15 2018-03-15 Webpage backdoor detection method, device, terminal and storage medium Active CN110198291B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810226945.3A CN110198291B (en) 2018-03-15 2018-03-15 Webpage backdoor detection method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810226945.3A CN110198291B (en) 2018-03-15 2018-03-15 Webpage backdoor detection method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN110198291A CN110198291A (en) 2019-09-03
CN110198291B true CN110198291B (en) 2022-02-18

Family

ID=67751079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810226945.3A Active CN110198291B (en) 2018-03-15 2018-03-15 Webpage backdoor detection method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN110198291B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111242291A (en) * 2020-04-24 2020-06-05 支付宝(杭州)信息技术有限公司 Neural network backdoor attack detection method and device and electronic equipment
CN111371812B (en) * 2020-05-27 2020-09-01 腾讯科技(深圳)有限公司 Virus detection method, device and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN103839006A (en) * 2010-11-29 2014-06-04 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN105389379A (en) * 2015-11-20 2016-03-09 重庆邮电大学 Rubbish article classification method based on distributed feature representation of text
CN105975857A (en) * 2015-11-17 2016-09-28 武汉安天信息技术有限责任公司 Method and system for deducing malicious code rules based on in-depth learning method
CN106789871A (en) * 2016-11-10 2017-05-31 东软集团股份有限公司 Attack detection method, device, the network equipment and terminal device
KR20170140049A (en) * 2016-06-10 2017-12-20 주식회사 케이티 Method for detecting webshell, server and computer readable medium
CN107659570A (en) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 Webshell detection methods and system based on machine learning and static and dynamic analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315954B (en) * 2016-04-27 2020-06-12 腾讯科技(深圳)有限公司 File type identification method and server
CN106503255B (en) * 2016-11-15 2020-05-12 科大讯飞股份有限公司 Method and system for automatically generating article based on description text

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN103839006A (en) * 2010-11-29 2014-06-04 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN105975857A (en) * 2015-11-17 2016-09-28 武汉安天信息技术有限责任公司 Method and system for deducing malicious code rules based on in-depth learning method
CN105389379A (en) * 2015-11-20 2016-03-09 重庆邮电大学 Rubbish article classification method based on distributed feature representation of text
KR20170140049A (en) * 2016-06-10 2017-12-20 주식회사 케이티 Method for detecting webshell, server and computer readable medium
CN106789871A (en) * 2016-11-10 2017-05-31 东软集团股份有限公司 Attack detection method, device, the network equipment and terminal device
CN107659570A (en) * 2017-09-29 2018-02-02 杭州安恒信息技术有限公司 Webshell detection methods and system based on machine learning and static and dynamic analysis

Also Published As

Publication number Publication date
CN110198291A (en) 2019-09-03

Similar Documents

Publication Publication Date Title
CN111967266B (en) Chinese named entity recognition system, model construction method, application and related equipment
CN112084337B (en) Training method of text classification model, text classification method and equipment
Wang et al. Application of convolutional neural network in natural language processing
WO2022007823A1 (en) Text data processing method and device
US9779085B2 (en) Multilingual embeddings for natural language processing
Wang et al. Common sense knowledge for handwritten chinese text recognition
CN111291195B (en) Data processing method, device, terminal and readable storage medium
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN110633577B (en) Text desensitization method and device
CN110705206A (en) Text information processing method and related device
Rizvi et al. Optical character recognition system for Nastalique Urdu-like script languages using supervised learning
CN110457585B (en) Negative text pushing method, device and system and computer equipment
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN113806493B (en) Entity relationship joint extraction method and device for Internet text data
CN117251551B (en) Natural language processing system and method based on large language model
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN113392179A (en) Text labeling method and device, electronic equipment and storage medium
CN114328919A (en) Text content classification method and device, electronic equipment and storage medium
CN110198291B (en) Webpage backdoor detection method, device, terminal and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN115730237B (en) Junk mail detection method, device, computer equipment and storage medium
CN115840817A (en) Information clustering processing method and device based on contrast learning and computer equipment
WO2021221535A1 (en) System and method for augmenting a training set for machine learning algorithms
CN115146589A (en) Text processing method, device, medium and electronic equipment
Haddad Cognitively Motivated Query Abstraction Model Based on Associative Root-Pattern Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant