CN111488452A - Webpage tampering detection method, detection system and related equipment - Google Patents
Webpage tampering detection method, detection system and related equipment Download PDFInfo
- Publication number
- CN111488452A CN111488452A CN201910075452.9A CN201910075452A CN111488452A CN 111488452 A CN111488452 A CN 111488452A CN 201910075452 A CN201910075452 A CN 201910075452A CN 111488452 A CN111488452 A CN 111488452A
- Authority
- CN
- China
- Prior art keywords
- webpage
- independent
- feature vector
- text
- semantic feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 62
- 239000013598 vector Substances 0.000 claims abstract description 214
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000013145 classification model Methods 0.000 claims abstract description 23
- 230000011218 segmentation Effects 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 15
- 238000007477 logistic regression Methods 0.000 claims description 8
- 238000003062 neural network model Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 description 17
- 230000008569 process Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 241000288105 Grus Species 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000005728 strengthening Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a webpage tampering detection method, a detection system and related equipment, which are used for improving the detection rate of webpage tampering. The method provided by the embodiment of the invention comprises the following steps: extracting each independent text in the webpage to be detected, and calculating semantic feature vectors of each independent text; inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text; multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected; and inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
Description
Technical Field
The invention relates to the field of network security detection, in particular to a webpage tampering detection method, a detection system and related equipment.
Background
The method for judging whether a webpage is tampered by using texts in the webpage is a mainstream technology in the existing webpage tampering detection. The webpage texts are used as objects for filtering, detecting and modeling, and an attacker can be found to insert some illegal text information into the target webpage, so that the webpage tampering is detected.
At present, webpage tampering is mainly detected by judging whether a webpage is tampered or not according to word frequency information of hit words based on keyword matching. The existing scheme mainly utilizes word frequency of keywords to detect whether a webpage is falsified, however, an attacker may embed an illegal text in an unobtrusive position in the webpage and may inject the illegal text and an illegal link into the head or the tail of the webpage, and the variability of the falsification mode makes the traditional method difficult to extract real and effective statistical information and characteristics, so that a large amount of misjudgments appear in the algorithm.
Disclosure of Invention
The embodiment of the invention provides a webpage tampering detection method, a detection system and related equipment, which are used for improving the detection rate of webpage tampering.
A first aspect of an embodiment of the present invention provides a method for detecting webpage tampering, including:
extracting each independent text in the web page to be detected, and calculating semantic feature vectors of each independent text
Inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected;
and inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
Optionally, as a possible implementation manner, in the embodiment of the present invention, the calculating the semantic feature vector of each independent text includes:
performing word segmentation processing on each independent text, wherein all words of one independent text form a corresponding word segmentation sequence;
generating word vectors of all the participles based on a preset word vector model;
and calculating the semantic feature vector of each independent text according to the word vector of each participle in the participle sequence.
Optionally, as a possible implementation manner, in an embodiment of the present invention, the calculating semantic feature vectors of each independent text according to word vectors of each participle in a participle sequence includes:
performing semantic feature vector operations, the semantic feature vector operations comprising: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and repeating the semantic feature vector operation to obtain the semantic feature vector of each independent text.
Optionally, as a possible implementation manner, in the embodiment of the present invention, the extracting each independent text in the web page to be detected includes:
and cutting the webpage to be detected into different independent texts according to the labels in the HTM L source file, wherein the texts between the two different labels are used as one or more independent texts.
Optionally, as a possible implementation manner, in an embodiment of the present invention, the webpage classification model includes:
a logistic regression model, a multilayer perceptron model and a multilayer fully-connected neural network model.
A second aspect of the embodiments of the present invention provides a detection system, which is applied to webpage tampering detection, and includes:
the processing module is used for extracting each independent text in the webpage to be detected and calculating semantic feature vectors of each independent text;
the generating module is used for inputting the semantic feature vectors corresponding to all the independent texts into the first attention model, and determining the weight corresponding to each independent text and the hidden layer feature vector corresponding to each independent text;
the calculation module is used for multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights and then performing addition operation to obtain the webpage feature vectors of the webpage to be detected;
and the classification module is used for inputting the webpage feature vectors into a preset webpage classification model to identify whether the webpage to be detected is tampered.
Optionally, as a possible implementation manner, the processing module in the embodiment of the present invention includes:
the extraction unit is used for extracting each independent text in the webpage to be detected;
the word segmentation unit is used for performing word segmentation processing on each independent text, forming a corresponding word segmentation sequence by all the word segments of one independent text, and generating word vectors of all the word segments based on a preset word vector model;
and the calculating unit is used for calculating the semantic feature vector of each independent text according to the word vector of each participle in the participle sequence.
Optionally, as a possible implementation manner, the computing unit in the embodiment of the present invention includes:
a computing subunit, configured to perform a semantic feature vector operation, where the semantic feature vector operation includes: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and the control subunit is used for repeating the semantic feature vector operation to obtain the semantic feature vector of each independent text.
Optionally, as a possible implementation manner, in an embodiment of the present invention, the extracting unit includes:
and the extraction subunit cuts the webpage to be detected into different independent texts according to the labels in the HTM L source file, and the text between the two different labels is used as one or more independent texts.
Optionally, as a possible implementation manner, in an embodiment of the present invention, the webpage classification model includes:
a logistic regression model, a multilayer perceptron model and a multilayer fully-connected neural network model.
A third aspect of an embodiment of the present invention provides a computer apparatus, which is characterized in that the computer apparatus includes a processor, and the processor is configured to implement the steps in any one of the possible implementations of the first aspect and the first aspect when executing a computer program stored in a memory.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that: the computer program realizes the steps of the first aspect and any of the possible implementations of the first aspect when executed by a processor.
According to the technical scheme, the embodiment of the invention has the following advantages:
in the embodiment of the invention, the detection system can acquire a plurality of independent texts in the web page to be detected and calculate the semantic feature vector of each independent text, compared with the detection mode of the integral text in the prior art, the detection granularity of the embodiment of the invention is smaller, the accuracy is higher, in addition, the detection system allocates a corresponding weight value and a hidden layer feature vector corresponding to each independent text based on an attention mechanism, and finally, inputting the webpage feature vectors into a preset webpage classification model to identify whether the webpage to be detected is falsified, and using an attention mechanism to highlight the weight of the key text in the webpage feature vectors so as to improve the detection rate of webpage falsification.
Drawings
Fig. 1 is a schematic diagram of an embodiment of a method for detecting webpage tampering in an embodiment of the present invention;
fig. 2 is a schematic diagram of another embodiment of a method for detecting webpage tampering in the embodiment of the present invention;
fig. 3 is a schematic flow chart illustrating a process of calculating semantic feature vectors of independent texts in a web page tampering detection method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an embodiment of a detection system in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of another embodiment of a detection system in accordance with an embodiment of the present invention;
FIG. 6 is a diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a webpage tampering detection method, a detection system and related equipment, which are used for improving the detection rate of webpage tampering.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The existing work mostly takes the webpage tampering problem as a text classification problem, and adopts mature text classification technology in natural language processing, such as TF-IDF, naive Bayes and support vector machine, to solve the problem. However, these methods typically suffer from at least one of several problems:
the webpage tampering problem is obviously different from the common text classification problem, and the common text classification model aims at orderly and regular phrases, sentences, paragraphs, articles and the like.
There are various forms of web page tampering, an attacker may embed an illegal text in an unobtrusive position in a web page, or may inject a large amount of illegal text and illegal links into the head or tail of a web page. Due to the variability of the tampering mode, the traditional method is difficult to extract real and effective statistical information and characteristics, so that a large amount of misjudgments and missed judgments appear in the algorithm.
Aiming at the defects of the existing scheme, the invention designs a webpage tampering detection method based on an attention mechanism. The method and the device obtain the semantic representation of the webpage from the independent text part and the whole webpage, and obtain the more important content in the webpage content by using the attention mechanism in the process of obtaining the semantic representation, thereby improving the relevance ratio of the model.
For convenience of understanding, a specific flow in the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a method for detecting webpage tampering in the embodiment of the present invention may include:
101. extracting each independent text in the webpage to be detected, and calculating semantic feature vectors of each independent text;
in order to overcome the problems, in the embodiment of the invention, a detection system acquires a plurality of independent texts in a webpage to be detected according to the typesetting condition of the webpage and acquires important independent texts in the webpage content by using an attention mechanism, so that the detection rate of a model is improved.
Specifically, the detection system first needs to extract each independent text in the web page to be detected, and calculates the semantic feature vector of each independent text.
102. Inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
in the process of determining the feature vector of the web page, the semantic feature vectors of all independent texts are processed indiscriminately in the conventional implementation mode, and some important information is often ignored, so that the detection omission of the tampered web page is caused. According to the embodiment of the invention, an attention mechanism is introduced to perform weighting processing on each independent text to protrude the weight of the key text in the webpage feature vector, a first attention model is preset, semantic feature vectors corresponding to all the independent texts are input into the first attention model, the weight corresponding to each independent text and the hidden layer feature vector corresponding to each independent text are determined, and the detection rate of webpage tampering is improved. For example, a text whose context semantic difference exceeds a threshold may be given a relatively high weight, and a text at the edge of a web page may also be given a relatively high weight, and specific weight assignment may be set according to a requirement of a user, which is not limited herein.
The Attention Mechanism (Attention Mechanism) is an information extraction method, and can be used in the fields of computer vision and natural language processing. The attention mechanism can learn the importance degree of different contents, so that more resources can be put into focus on the important area (namely attention focus) in decision making, and other useless information is suppressed. The attention model in the embodiment of the present invention may be implemented based on self-attention algorithm and multi-head attention algorithm, and the specific implementation manner is not limited herein.
103. Multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected;
after the weight is distributed to the semantic feature vector of each independent text, the hidden layer feature vectors corresponding to all the independent texts can be multiplied by the corresponding weight and then subjected to addition operation to obtain the webpage feature vector of the webpage to be detected.
104. And inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
After the webpage feature vector of the webpage to be detected is obtained, the detection system can input the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
The webpage classification model is obtained by training according to the feature vector of the webpage which is tampered in the past, and specifically, the detection system can collect a large number of positive samples (feature vectors of tampered webpages) and negative samples (feature vectors of normal webpages) from the network as training texts, manually classify the training texts, and classify the training texts into two categories, namely tampered webpages and non-tampered webpages.
The specific training process is as follows:
marking the webpage feature vector as X and marking the label of manual classification as Y;
inputting the vector X and the label Y into a classifier model for training, for example, inputting the vector X and the label Y into a logistic regression L R classifier model, calculating parameters required in the process of mapping the vector X to the label Y by the L R model according to a preset algorithm, and finally obtaining a preset model lr., wherein the model can map lr: X- > Y from the word frequency vector set X of the unknown text to the label set Y.
It is understood that the type of the classifier model in this embodiment may be a logistic regression L R classifier, a multilayer perceptron model, a multilayer fully-connected neural network model, a support vector machine SVM classifier, a convolutional neural network CNN classifier, and the like, and is not limited herein.
In the embodiment of the invention, the detection system can acquire a plurality of independent texts in the web page to be detected and calculate the semantic feature vector of each independent text, compared with the detection mode of the integral text in the prior art, the detection granularity of the embodiment of the invention is smaller, the accuracy is higher, in addition, the detection system allocates a corresponding weight value and a hidden layer feature vector corresponding to each independent text based on an attention mechanism, and finally, inputting the webpage feature vectors into a preset webpage classification model to identify whether the webpage to be detected is falsified, and using an attention mechanism to highlight the weight of the key text in the webpage feature vectors so as to improve the detection rate of webpage falsification.
On the basis of the embodiment shown in fig. 1, in consideration that the importance of each participle in the participle sequence corresponding to each independent text is also different in the independent text, the importance of each participle is distinguished, and the detection rate of webpage tampering can be further improved by projecting the important participle. Referring to fig. 2, another embodiment of a method for detecting webpage tampering according to the embodiment of the present invention includes:
201. extracting each independent text in the webpage to be detected, and performing word segmentation processing on each independent text, wherein all words of one independent text form a corresponding word segmentation sequence;
however, the webpage text is composed of small texts which are irregular and different in length, the texts may come from titles, hyperlinks, display contents and the like of the webpage, and noise information such as HTM L comments and the like can be included, so that the traditional statistical-based algorithm is difficult to find truly effective statistical information and characteristics in the scattered texts.
Optionally, as a possible implementation manner, in the embodiment of the present invention, the web page to be detected may be cut into different independent texts according to a tag in an HTM L source file, where a text between two different tags is used as one or more independent texts.
After extracting each independent text in the web page to be detected, word segmentation processing can be performed on each independent text, all words of one independent text form a corresponding word segmentation sequence, and each independent text corresponds to one word segmentation sequence.
202. Generating word vectors of all the participles based on a preset word vector model;
after each independent text is participled, Word vectors of all participles can be generated based on a preset Word vector model, and the specific Word vector model can be a Word vector model trained based on the Word2vec technology or the GloVe, fastText and other technologies.
203. Performing semantic feature vector operation;
after the word vectors of the respective participles in the participle sequence are obtained, the semantic feature vectors of the respective independent texts can be calculated according to the word vectors of the respective participles in the participle sequence. Considering that the importance of each participle in the participle sequence corresponding to each independent text is different in the independent text, the importance of each participle is distinguished, and the detection rate of webpage tampering can be further improved by protruding the important participle.
The semantic feature vector of each independent text is obtained through semantic feature vector operation, and the semantic feature vector operation specifically comprises the following steps: and inputting the word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain the semantic feature vector of the independent text corresponding to the participle sequence.
For example, referring to FIG. 3, FIG. 3 is a flow chart illustrating the calculation of semantic feature vectors for independent texts according to an embodiment of the present invention, assuming that a web page has L independent texts each independently containing T words, a separate text is given s2First vectorize each word to get w21,w22,...w2T. Bidirectional GRUs (note that here is not limited to GRUs, and other neural network models may be used) are used to extract contextual connections between words. Wherein,
whereinIndicating that the two vectors are concatenated. Note that h2tNeed not be the same as the dimensions of the word vector. At the same time, each word s is learnediCorresponding to hitThe weight of (c). Corresponding the weight to hitMultiply and sum to obtain the sentence s2The corresponding vector representation. s2Dimension and h2tThe same is true.
Similarly, the feature vector of the web page can also be obtained by a similar method, and the dimension of the feature vector of the web page is not necessarily equal to s2Is the same as s, but is instead2The output dimensions through the bidirectional GRU are the same.
204. Repeating semantic feature vector operation to obtain semantic feature vectors of the independent texts;
205. inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
206. multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected;
207. and inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
In the embodiment of the invention, the detection system can acquire a plurality of independent texts in the webpage to be detected and calculate the semantic feature vector of each independent text, and compared with the detection mode of the integral text in the prior art, the detection granularity of the embodiment of the invention is smaller and the accuracy is higher; in addition, the attention mechanism is used for strengthening the weight of the key words in the semantic feature vector of each independent text, so that the features of important key words are highlighted, and the detection rate of webpage tampering is improved; in addition, the detection system allocates a corresponding weight to each independent text based on the attention mechanism, then multiplies hidden layer feature vectors corresponding to all independent texts by the corresponding weights and performs addition operation to obtain a webpage feature vector of the webpage to be detected, and finally inputs the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered, the attention mechanism can be used for highlighting the weight of the key text in the webpage feature vector, and the detection rate of webpage tampering is further improved.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the above steps do not mean the execution sequence, and the execution sequence of each step should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
The foregoing embodiment describes a method for detecting webpage tampering in the embodiment of the present invention, and referring to fig. 4, a detection system in the embodiment of the present invention is described below, where an embodiment of a detection system in the embodiment of the present invention may include:
the processing module 401 is configured to extract each independent text in the web page to be detected, and calculate a semantic feature vector of each independent text;
a generating module 402, configured to input semantic feature vectors corresponding to all the independent texts into the first attention model, and determine a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
the calculating module 403 is configured to multiply hidden layer feature vectors corresponding to all the independent texts by the corresponding weights and then perform addition operation to obtain a webpage feature vector of the webpage to be detected;
the classification module 404 is configured to input the webpage feature vector into a preset webpage classification model to identify whether tampering exists in the webpage to be detected.
Optionally, as a possible implementation manner, referring to fig. 5, a processing module 401 in the embodiment of the present invention includes:
the extraction unit 4011 is configured to extract each independent text in the to-be-detected web page;
the segmentation unit 4012 is configured to perform segmentation processing on each independent text, where all the segments of an independent text form a corresponding segment sequence, and generate word vectors of all the segments based on a preset word vector model;
the calculating unit 4013 is configured to calculate semantic feature vectors of the independent texts according to the word vectors of the participles in the participle sequence.
Optionally, as a possible implementation manner, the calculating unit 4013 in the embodiment of the present invention includes:
the calculating subunit 40131 is configured to perform semantic feature vector operation, where the semantic feature vector operation includes: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and the control subunit 40132 is configured to repeat semantic feature vector operations to obtain semantic feature vectors of the individual texts.
Optionally, as a possible implementation manner, the extraction unit 4011 in the embodiment of the present invention includes:
the extracting sub-unit 40111 cuts the web page to be detected into different independent texts according to the tags in the source file of the HTM L, and the text between two different tags is used as one or more independent texts.
Optionally, as a possible implementation manner, the webpage classification model in the embodiment of the present invention includes:
a logistic regression model, a multilayer perceptron model and a multilayer fully-connected neural network model.
In the embodiment of the invention, the detection system can acquire a plurality of independent texts in the webpage to be detected and calculate the semantic feature vector of each independent text, and compared with the detection mode of the integral text in the prior art, the detection granularity of the embodiment of the invention is smaller and the accuracy is higher; in addition, the attention mechanism is used for strengthening the weight of the key words in the semantic feature vector of each independent text, so that the features of important key words are highlighted, and the detection rate of webpage tampering is improved; in addition, the detection system allocates a corresponding weight to each independent text based on the attention mechanism, then multiplies hidden layer feature vectors corresponding to all independent texts by the corresponding weights and performs addition operation to obtain a webpage feature vector of the webpage to be detected, and finally inputs the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered, the attention mechanism can be used for highlighting the weight of the key text in the webpage feature vector, and the detection rate of webpage tampering is further improved.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The detection system in the embodiment of the present invention is described above from the perspective of the modular functional entity, and the computer apparatus in the embodiment of the present invention is described below from the perspective of hardware processing:
fig. 6 shows only a portion related to the embodiment of the present invention for convenience of description, and please refer to the method portion of the embodiment of the present invention for reference, though specific technical details are not disclosed. The computer device 6 is generally a computer device with a high processing capability, such as a server.
Referring to fig. 6, the computer device 6 includes: a power supply 610, a memory 620, a processor 630, a wired or wireless network interface 640, and computer programs stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps in the above-described embodiments of the web page tampering detection method, such as steps 101 to 104 shown in fig. 1. Alternatively, the processor, when executing the computer program, implements the functions of each module or unit in the above-described device embodiments.
In some embodiments of the present invention, the processor is specifically configured to implement the following steps:
extracting each independent text in the webpage to be detected, and calculating semantic feature vectors of each independent text;
inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected;
and inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
performing word segmentation processing on each independent text, wherein all words of one independent text form a corresponding word segmentation sequence;
generating word vectors of all the participles based on a preset word vector model;
and calculating the semantic feature vector of each independent text according to the word vector of each participle in the participle sequence.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
performing semantic feature vector operation, wherein the semantic feature vector operation comprises the following steps: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and repeating the semantic feature vector operation to obtain the semantic feature vector of each independent text.
Optionally, in some embodiments of the present invention, the processor may be further configured to cut the web page to be detected into different independent texts according to the tags in the source file of the HTM L, where the text between two different tags is used as one or more independent texts.
Optionally, in some embodiments of the present invention, the webpage classification model includes:
a logistic regression model, a multilayer perceptron model and a multilayer fully-connected neural network model.
The computer device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in a memory and executed by a processor. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.
Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the computer apparatus 6, that the computer apparatus 6 may comprise more or less components than those shown, or some components may be combined, or a different arrangement of components, e.g. the computer apparatus may further comprise input-output devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the computer device and the various interfaces and lines connecting the various parts of the overall computer device.
The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the computer device by executing or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The present invention also provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, performs the steps of:
extracting each independent text in the webpage to be detected, and calculating semantic feature vectors of each independent text;
inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected;
and inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
performing word segmentation processing on each independent text, wherein all words of one independent text form a corresponding word segmentation sequence;
generating word vectors of all the participles based on a preset word vector model;
and calculating the semantic feature vector of each independent text according to the word vector of each participle in the participle sequence.
Optionally, in some embodiments of the present invention, the processor may be further configured to implement the following steps:
performing semantic feature vector operation, wherein the semantic feature vector operation comprises the following steps: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and repeating the semantic feature vector operation to obtain the semantic feature vector of each independent text.
Optionally, in some embodiments of the present invention, the processor may be further configured to cut the web page to be detected into different independent texts according to the tags in the source file of the HTM L, where the text between two different tags is used as one or more independent texts.
Optionally, in some embodiments of the present invention, the webpage classification model includes:
a logistic regression model, a multilayer perceptron model and a multilayer fully-connected neural network model.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A webpage tampering detection method is characterized by comprising the following steps:
extracting each independent text in the webpage to be detected, and calculating semantic feature vectors of each independent text;
inputting semantic feature vectors corresponding to all independent texts into a first attention model, and determining a weight corresponding to each independent text and a hidden layer feature vector corresponding to each independent text;
multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights, and then performing addition operation to obtain webpage feature vectors of the webpage to be detected;
and inputting the webpage feature vector into a preset webpage classification model to identify whether the webpage to be detected is tampered.
2. The method of claim 1, wherein computing the semantic feature vector for each independent text comprises:
performing word segmentation processing on each independent text, wherein all words of one independent text form a corresponding word segmentation sequence;
generating word vectors of all the participles based on a preset word vector model;
and calculating the semantic feature vector of each independent text according to the word vector of each participle in the participle sequence.
3. The method of claim 2, wherein computing the semantic feature vector of each individual text from the word vector of each participle in the participle sequence comprises:
performing semantic feature vector operations, the semantic feature vector operations comprising: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and repeating the semantic feature vector operation to obtain the semantic feature vector of each independent text.
4. The method according to any one of claims 1 to 3, wherein the extracting each independent text in the web page to be detected comprises:
and cutting the webpage to be detected into different independent texts according to the labels in the HTM L source file, wherein the texts between the two different labels are used as one or more independent texts.
5. The method of claim 4, wherein the web page classification model comprises:
a logistic regression model, a multilayer perceptron model and a multilayer fully-connected neural network model.
6. A detection system for detecting webpage tampering, comprising:
the processing module is used for extracting each independent text in the webpage to be detected and calculating semantic feature vectors of each independent text;
the generating module is used for inputting the semantic feature vectors corresponding to all the independent texts into the first attention model, and determining the weight corresponding to each independent text and the hidden layer feature vector corresponding to each independent text;
the calculation module is used for multiplying hidden layer feature vectors corresponding to all independent texts by corresponding weights and then performing addition operation to obtain the webpage feature vectors of the webpage to be detected;
and the classification module is used for inputting the webpage feature vectors into a preset webpage classification model to identify whether the webpage to be detected is tampered.
7. The detection system of claim 6, wherein the processing module comprises:
the extraction unit is used for extracting each independent text in the webpage to be detected;
the word segmentation unit is used for performing word segmentation processing on each independent text, forming a corresponding word segmentation sequence by all the word segments of one independent text, and generating word vectors of all the word segments based on a preset word vector model;
and the calculating unit is used for calculating the semantic feature vector of each independent text according to the word vector of each participle in the participle sequence.
8. The detection system according to claim 7, wherein the calculation unit comprises:
a computing subunit, configured to perform a semantic feature vector operation, where the semantic feature vector operation includes: inputting word vectors corresponding to all the participles in a participle sequence into a second attention model to obtain a weight corresponding to each participle and a hidden layer feature vector corresponding to each participle, and multiplying the hidden layer feature vectors corresponding to all the participles in the participle sequence by the corresponding weights respectively and then performing addition operation to obtain a semantic feature vector of an independent text corresponding to the participle sequence;
and the control subunit is used for repeating the semantic feature vector operation to obtain the semantic feature vector of each independent text.
9. A computer arrangement, characterized in that the computer arrangement comprises a processor for implementing the steps of the method according to any one of claims 1 to 5 when executing a computer program stored in a memory.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910075452.9A CN111488452A (en) | 2019-01-25 | 2019-01-25 | Webpage tampering detection method, detection system and related equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910075452.9A CN111488452A (en) | 2019-01-25 | 2019-01-25 | Webpage tampering detection method, detection system and related equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111488452A true CN111488452A (en) | 2020-08-04 |
Family
ID=71812276
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910075452.9A Pending CN111488452A (en) | 2019-01-25 | 2019-01-25 | Webpage tampering detection method, detection system and related equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111488452A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111967063A (en) * | 2020-09-02 | 2020-11-20 | 开普云信息科技股份有限公司 | Data tampering monitoring and identifying method and device based on multi-dimensional analysis, electronic equipment and storage medium thereof |
CN112528190A (en) * | 2020-12-23 | 2021-03-19 | 中移(杭州)信息技术有限公司 | Web page tampering judgment method and device based on fragmentation structure and content and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6990628B1 (en) * | 1999-06-14 | 2006-01-24 | Yahoo! Inc. | Method and apparatus for measuring similarity among electronic documents |
CN104217160A (en) * | 2014-09-19 | 2014-12-17 | 中国科学院深圳先进技术研究院 | Method and system for detecting Chinese phishing website |
CN105718577A (en) * | 2016-01-22 | 2016-06-29 | 中国互联网络信息中心 | Method and system for automatically detecting phishing aiming at added domain name |
CN107437038A (en) * | 2017-08-07 | 2017-12-05 | 深信服科技股份有限公司 | A kind of detection method and device of webpage tamper |
CN107566391A (en) * | 2017-09-20 | 2018-01-09 | 上海斗象信息科技有限公司 | Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage |
CN108090098A (en) * | 2016-11-22 | 2018-05-29 | 科大讯飞股份有限公司 | A kind of text handling method and device |
JP2018097468A (en) * | 2016-12-09 | 2018-06-21 | 日本電信電話株式会社 | Sentence classification learning device, sentence classification device, sentence classification learning method and sentence classification learning program |
CN108595717A (en) * | 2018-05-18 | 2018-09-28 | 北京慧闻科技发展有限公司 | For the data processing method of text classification, data processing equipment and electronic equipment |
CN108683666A (en) * | 2018-05-16 | 2018-10-19 | 新华三信息安全技术有限公司 | A kind of web page identification method and device |
CN108763384A (en) * | 2018-05-18 | 2018-11-06 | 北京慧闻科技发展有限公司 | For the data processing method of text classification, data processing equipment and electronic equipment |
CN109165529A (en) * | 2018-08-14 | 2019-01-08 | 杭州安恒信息技术股份有限公司 | A kind of dark chain altering detecting method, device and computer readable storage medium |
-
2019
- 2019-01-25 CN CN201910075452.9A patent/CN111488452A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6990628B1 (en) * | 1999-06-14 | 2006-01-24 | Yahoo! Inc. | Method and apparatus for measuring similarity among electronic documents |
CN104217160A (en) * | 2014-09-19 | 2014-12-17 | 中国科学院深圳先进技术研究院 | Method and system for detecting Chinese phishing website |
CN105718577A (en) * | 2016-01-22 | 2016-06-29 | 中国互联网络信息中心 | Method and system for automatically detecting phishing aiming at added domain name |
CN108090098A (en) * | 2016-11-22 | 2018-05-29 | 科大讯飞股份有限公司 | A kind of text handling method and device |
JP2018097468A (en) * | 2016-12-09 | 2018-06-21 | 日本電信電話株式会社 | Sentence classification learning device, sentence classification device, sentence classification learning method and sentence classification learning program |
CN107437038A (en) * | 2017-08-07 | 2017-12-05 | 深信服科技股份有限公司 | A kind of detection method and device of webpage tamper |
CN107566391A (en) * | 2017-09-20 | 2018-01-09 | 上海斗象信息科技有限公司 | Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage |
CN108683666A (en) * | 2018-05-16 | 2018-10-19 | 新华三信息安全技术有限公司 | A kind of web page identification method and device |
CN108595717A (en) * | 2018-05-18 | 2018-09-28 | 北京慧闻科技发展有限公司 | For the data processing method of text classification, data processing equipment and electronic equipment |
CN108763384A (en) * | 2018-05-18 | 2018-11-06 | 北京慧闻科技发展有限公司 | For the data processing method of text classification, data processing equipment and electronic equipment |
CN109165529A (en) * | 2018-08-14 | 2019-01-08 | 杭州安恒信息技术股份有限公司 | A kind of dark chain altering detecting method, device and computer readable storage medium |
Non-Patent Citations (2)
Title |
---|
ZICHAO YANG ET AL: "Hierarchical Attention Networks for Document Classification", 《PROCEEDINGS OF NAACL-HLT 2016》 * |
ZICHAO YANG ET AL: "Hierarchical Attention Networks for Document Classification", 《PROCEEDINGS OF NAACL-HLT 2016》, 17 June 2016 (2016-06-17), pages 1480, XP055539296, DOI: 10.18653/v1/N16-1174 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111967063A (en) * | 2020-09-02 | 2020-11-20 | 开普云信息科技股份有限公司 | Data tampering monitoring and identifying method and device based on multi-dimensional analysis, electronic equipment and storage medium thereof |
CN112528190A (en) * | 2020-12-23 | 2021-03-19 | 中移(杭州)信息技术有限公司 | Web page tampering judgment method and device based on fragmentation structure and content and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188194B (en) | False news detection method and system based on multitask learning model | |
WO2019200806A1 (en) | Device for generating text classification model, method, and computer readable storage medium | |
WO2020114373A1 (en) | Method and apparatus for realizing element recognition in judicial document | |
Rain | Sentiment analysis in amazon reviews using probabilistic machine learning | |
WO2017167067A1 (en) | Method and device for webpage text classification, method and device for webpage text recognition | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN108241741B (en) | Text classification method, server and computer readable storage medium | |
Qian et al. | Identifying multiple userids of the same author | |
AU2022305355A1 (en) | Ai-augmented auditing platform including techniques for automated document processing | |
CN110019790B (en) | Text recognition, text monitoring, data object recognition and data processing method | |
JP5012078B2 (en) | Category creation method, category creation device, and program | |
CN109271624B (en) | Target word determination method, device and storage medium | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
Triyono et al. | Fake News Detection in Indonesian Popular News Portal Using Machine Learning For Visual Impairment | |
CN110888983A (en) | Positive and negative emotion analysis method, terminal device and storage medium | |
Smitha et al. | Meme classification using textual and visual features | |
CN111475651A (en) | Text classification method, computing device and computer storage medium | |
CN111488452A (en) | Webpage tampering detection method, detection system and related equipment | |
CN109753646B (en) | Article attribute identification method and electronic equipment | |
Shete et al. | Fake news detection using natural language processing and logistic regression | |
Ahmed et al. | Hateful meme prediction model using multimodal deep learning | |
CN107291686B (en) | Method and system for identifying emotion identification | |
CN108021609B (en) | Text emotion classification method and device, computer equipment and storage medium | |
KR101532652B1 (en) | Image Recognition Calculating Apparatus and the Method | |
CN107590163B (en) | The methods, devices and systems of text feature selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200804 |
|
RJ01 | Rejection of invention patent application after publication |