CN112073360B

CN112073360B - Detection method, device, terminal equipment and medium for hypertext transmission data

Info

Publication number: CN112073360B
Application number: CN201911157089.1A
Authority: CN
Inventors: 安宇飞; 陈剑勇; 林秋镇; 陈飞; 李坚强
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2022-12-20
Anticipated expiration: 2039-11-22
Also published as: CN112073360A

Abstract

The application is applicable to the field of computer security, and provides a detection method, a device, terminal equipment and a medium for hypertext transfer data, wherein the detection method comprises the following steps: receiving hypertext transfer data; clustering the hypertext transmission data and preset training data, and dividing the hypertext transmission data into normal data and data to be detected according to a clustering result; training a first classifier using the normal data and the training data; and classifying the data to be detected by adopting the first classifier, and dividing the data to be detected into normal data and abnormal data. By the method, new abnormity can be found in time, and the misjudgment rate of data is reduced.

Description

Detection method, device, terminal equipment and medium for hypertext transmission data

Technical Field

The application belongs to the field of computer security, and particularly relates to a method and a device for detecting hypertext transfer data, a terminal device and a medium.

Background

The hypertext transfer protocol (HTTP) is widely used over the internet. HTTP, which is an application layer protocol of a distributed, cooperative, and hypermedia information system, has become a general transmission protocol at present, and a large amount of data is transmitted through the HTTP protocol. However, the HTTP request data often contains a large number of actual intrusions, especially in the non-static HTTP requests, which are related to the dynamic parameters entered by the user.

The existing methods for detecting the abnormal data of the HTTP are mainly divided into two types, one type is to detect the abnormal data of the HTTP by using pattern matching or statistical analysis, but the method cannot find new abnormal data because the characteristics of the new abnormal data are not known. The other type is anomaly detection based on a machine learning related method, which has good effect on finding novel anomalies, but easily causes high false alarm rate due to the problems of non-uniformity and the like of training samples for learning.

Disclosure of Invention

The embodiment of the application provides a detection method, a device, terminal equipment and a medium for hypertext transmission data, and can solve the problems that an existing detection method for the abnormality of the hypertext transmission data is high in false positive rate and cannot effectively detect some to-be-detected abnormalities.

In a first aspect, an embodiment of the present application provides a method for detecting hypertext transfer data, including:

receiving hypertext transfer data;

clustering the hypertext transmission data and preset training data, and dividing the hypertext transmission data into normal data and data to be detected according to a clustering result;

training a first classifier using the normal data and the training data;

and classifying the data to be detected by adopting the first classifier, and dividing the data to be detected into normal data and abnormal data.

In a second aspect, an embodiment of the present application provides an apparatus for detecting hypertext transfer data, including:

the receiving module is used for receiving the hypertext transmission data;

the clustering module is used for clustering the hypertext transmission data and preset training data and dividing the hypertext transmission data into normal data and data to be detected according to a clustering result;

a training module for training a first classifier using the normal data and the training data;

and the classification module is used for classifying the data to be detected by adopting the first classifier and dividing the data to be detected into normal data and abnormal data.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the method of any one of the above first aspects when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method of any one of the first aspect.

In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method for detecting hypertext transfer data according to any one of the above first aspects.

Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the application, certain training data is set firstly, and the training data are normal data; when a certain amount of HTTP data are received, clustering the training data and the HTTP data by adopting a preset clustering algorithm, and dividing the HTTP data and the training data into various data classes; then, according to the fact whether training data exist in the classification result or not, dividing the HTTP data into normal data and data to be detected; training a first classifier by using normal data and training data, wherein the trained first classifier can judge whether the data is normal data or not according to the characteristics of the training data and the normal data; then, classifying the data to be detected by using the trained first classifier; if the data to be detected has the characteristics of normal data, the data to be detected is divided into normal data by the first classifier, and if not, the data to be detected is abnormal data. Timely updating of data is beneficial to finding new exceptions; in the data detection process, the data is judged through clustering and classifying twice detection, so that the misjudgment rate of the data is reduced; when the detected data is abnormal, the trained first classifier can be tested by adopting normal data and training data, if the misjudged data is larger than a preset threshold value in the test result, the second classifier can be trained by adopting the normal data and the training data, then the first classifier and the second classifier are used for classifying the data to be detected respectively, only if the two classifiers are judged to be abnormal data, the data to be detected is recognized as abnormal data, and the false alarm rate in the HTTP data detection process is also reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for detecting hypertext transfer data according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a detection method for hypertext transfer data according to a second embodiment of the present application;

fig. 3 is a schematic flowchart of a detection method for hypertext transfer data according to a third embodiment of the present application;

fig. 4 is a schematic flowchart of a detection method for hypertext transfer data according to a fourth embodiment of the present application;

fig. 5 is a schematic flowchart of a method for detecting hypertext transfer data according to a fifth embodiment of the present application;

FIG. 6 is a schematic diagram of a classification process provided in the fifth embodiment of the present application;

FIG. 7 is a system block diagram of a method for detecting hypertext transfer data according to a sixth embodiment of the present application;

FIG. 8 is a schematic diagram of an apparatus for detecting hypertext transfer data according to a seventh embodiment of the present application;

fig. 9 is a schematic structural diagram of a terminal device according to an eighth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

Fig. 1 is a schematic flowchart of a method for detecting hypertext transfer data according to an embodiment of the present application, where as shown in fig. 1, the method includes the following steps:

s101, receiving hypertext transmission data;

the method for detecting hypertext transfer data provided in this embodiment is suitable for a server, a gateway device, or other terminal devices that need to detect and intercept abnormal data, and in this embodiment, is not particularly limited.

Specifically, when the internet surfing behavior is performed, the server sends an HTTP request to the network through the router, and the router receives HTTP data and can store the HTTP data in the traffic memory.

S102, clustering the hypertext transmission data and preset training data, and dividing the hypertext transmission data into normal data and data to be detected according to a clustering result;

the preset training data may be normal HTTP data set in advance, and these training data may be comparison targets for determining whether the HTTP data is normal.

Specifically, the HTTP data received and stored is processed at intervals. Clustering the HTTP data and the training data by using a clustering algorithm, such as a Birch (BalancerrativeReducingand clustering Using Hierarchies) clustering algorithm; the clustering algorithm divides the HTTP data and the training data into a plurality of data classes to finish the preliminary classification of the data. In the clustering result, if one data class contains training data, identifying HTTP data in the data class as normal data; and if one data class does not contain the training data, identifying the HTTP data in the data class as the data to be detected.

S103, training a first classifier by adopting the normal data and the training data;

the first classifier may include a single classification support vector machine (OC-SVM) classifier for classifying the HTTP data into normal data and abnormal data.

Specifically, because the training data is also normal data, the normal data and the training data are loaded into the OC-SVM classifier as samples, and the classifier can train according to the characteristics of the samples to obtain the basis for judging whether the data is normal data. After the single-classification support vector machine classifier is trained by using normal data and training data, the single-classification support vector machine can judge whether the HTTP data are normal data or not.

And S104, classifying the data to be detected by adopting the first classifier, and dividing the data to be detected into normal data and abnormal data.

Specifically, a first classifier is adopted to classify data to be detected, and if the first classifier detects that the data to be detected has characteristics conforming to normal data, the data to be detected is identified as the normal data; otherwise, it is identified as anomalous data.

In the embodiment, the clustering algorithm and the classification algorithm are adopted to classify the HTTP data twice, which is equivalent to twice detection of the data, normal data and data to be detected can be distinguished in the clustering process, the data to be detected is further detected in the classification process, and the false alarm rate in the abnormal data detection process can be reduced; and the classifier is trained by using normal data in the received data in the detection process, so that new abnormity can be found in time.

Fig. 2 is a schematic flowchart of a method for detecting hypertext transfer data according to a second embodiment of the present application, and as shown in fig. 2, the method includes the following steps:

S201, receiving hypertext transmission data;

specifically, the server sends HTTP data to the network through the router to perform various internet access behaviors, and the router can store the HTTP data in the traffic memory when receiving the HTTP data.

S202, dividing the hypertext transmission data into a plurality of data fragments by taking a domain as a unit;

in particular, when clustering HTTP data and training data, features need to be extracted. The HTTP data is generally relatively long, and when it is processed, it can be divided into individual data fragments by using a domain as a basic unit. For example, domains may be selected that include, but are not limited to, the 11 categories listed: (1) URL, (2) Accept-Language, (3) User-Agent, (4) Cookie, (5) Client-IP, (6) X-Forward-For, (7) Referer, (8) Content-Type, (9) Origin, (10) Via and (11) Req _ data.

S203, if the domain corresponding to the data fragment belongs to the target domain, identifying the data fragment as an effective data fragment;

in the process of dividing the data fragment, if the data fragment is data corresponding to the domains belonging to the 11 categories, the data fragment may be identified as a valid data fragment. During the data slicing process, a dictionary may be created, key names (keys) of the dictionary may be domain names listed in S202, and the key values are data included in the domains; and for each piece of HTTP data, the HTTP data is fragmented by taking the domain as a basic unit, the data fragment of the corresponding domain is stored in a dictionary, and other redundant data are deleted.

In the detection process, preset training data is needed, wherein the training data are normal HTTP data, can be the whole HTTP data directly, and are fragmented together with the HTTP data; or directly using the normal data fragments corresponding to each domain as training data.

Illustratively, the following HTTP data exists as follows:

GET http://zfxxgk.beijing.gov.cn/myq11SH00/jgsz12j/mybm_list.shtml HTTP/1.1

User-Agent:Mozilla/5.0(Windows NT 6.2；Win64；x64)AppleWebKit/537.36(KHTML,like Gecko)

Pragma:no-cache

Cache-control:no-cache

Accept:*/*

Accept-Encoding:gzip

Accept-Charset:utf-8,utf-8；q＝0.5,*；q＝0.5

Accept-Language:zh-CN,en,*

Host:zfxxgk.beijing.gov.cn

X-Forwarded-For:59.44.201.158

Referer:http://zfxxgk.beijing.gov.cn/myq11SH00/jgsz12j/mybm_list.shtml

Cookie:__jsluid＝3d1fe437aaec9a5107593c15b1f7faa8

If-Modified-Since:Wed,21Mar 2018 09:42:58GMT

Connection:close

the data fragmentation was performed using the above method, with the following results:

{"accept-language":"zh-CN,en,*",

"x-forwarded-for":"59.44.201.158",

"user-agent":"Mozilla/5.0(Windows NT 6.2；Win64；x64)AppleWebKit/537.36(KHTML,like Gecko)

"url":"/default/xhtml/myqzf/js/page.js",

"host":"zfxxgk.beijing.gov.cn",

"cookie":"__jsluid＝3d1fe437aaec9a5107593c15b1f7faa8",

"referer":"http://zfxxgk.beijing.gov.cn/myq11SH00/jgsz12j/mybm_list.shtml"}

s204, regularizing the effective data fragments, and converting the effective data fragments into identification data;

specifically, the data shards are generalized, and for example, the letters irrelevant to the database in the data shards may be replaced by a letter a, all the numbers of the data shards may be replaced by a letter B, the letters relevant to the database in the data shards may be replaced by a letter C, and all the symbols may be reserved. This translates the data slice into a signature (sig).

S205, vectorizing the identification data to obtain a digital vector corresponding to the identification data;

specifically, a sig contains 36 different characters, and after binary (2-gram) analysis of the characters, we can obtain a bag of 1296 words, and each sig can be converted into a digital vector by the bag. The digital vector is a feature extracted based on the data slice.

Taking sig "A:// A.A.A.A.A", it is obtained"A:", ":/", "/A", "A.", and ". A" is a set of 6 different subsequences, the set being 12 in length. And counting the occurrence times of each subsequence in the set to obtain the occurrence frequency of each subsequence. In contrast to the bag of words, for each element in the bag of words, if it appears in the set of subsequences, the numeric vector for that position is the frequency of occurrence of that element in the set of subsequences, otherwise it is 0. Finally, this sig can be converted into a digital vector

S206, clustering the hypertext transmission data and the training data according to the digital vectors corresponding to the effective data fragments to obtain a plurality of different data classes;

the data classes include one or more HTTP data, and the data in each data class has similar characteristics.

Specifically, a clustering algorithm, such as a BIRCH clustering algorithm, is used for clustering, digital vectors of data fragments of HTTP data and training data are input into the algorithm, and the clustering algorithm can cluster the data fragments with similar characteristics into one class.

S207, if the training data is contained in the target data class, identifying the hypertext transmission data in the target data class as normal data, wherein the target data class is any one of the data classes;

because the clustering algorithm can cluster the data fragments into different data classes according to the characteristics, and the data in each data class has similar characteristics, if a certain data class contains training data, the training data indicates that the other data in the data class have characteristics similar to the training data, namely normal data, and the HTTP data in the data class is normal data.

S208, if the training data is not contained in the target data class, identifying the hypertext transmission data in the data class as data to be detected;

for a target data class not containing training data, if the HTTP data is identified as abnormal data, a high false positive rate may be obtained. It needs to be detected again to identify it as data to be detected.

S209, training a first classifier by adopting the normal data and the training data;

specifically, normal data and training data are input into a first classifier as normal data samples, and a first classifier model is obtained. The trained first classifier identifies normal data according to the characteristics of the data.

S210, classifying the data to be detected by adopting the first classifier, and dividing the data to be detected into normal data and abnormal data.

And inputting the data to be detected into a first classifier, if the data to be detected accords with the characteristics of the normal data, identifying the data to be detected as the normal data by the first classifier, and otherwise, identifying the data to be detected as the abnormal data by the first classifier.

In the embodiment, the data is processed after being fragmented, and the detection is performed by using the effective data fragments, so that the influence of redundant data is avoided; the data are classified and detected twice, so that the misjudgment rate is reduced; meanwhile, the received normal data and the training data are used as sample data together, and new abnormity can be detected in time.

Fig. 3 is a schematic flowchart of a method for detecting hypertext transfer data according to a third embodiment of the present application, where as shown in fig. 3, the method includes the following steps:

s301, receiving hypertext transmission data;

S302, clustering the hypertext transmission data and preset training data, and dividing the hypertext transmission data into normal data and data to be detected according to a clustering result;

s301 to S302 in this embodiment are similar to S101 to S102 in embodiment 1, and may refer to each other, which is not described again in this embodiment.

S303, extracting characteristic information of the normal data and the training data;

when the classifier is trained by using the normal data and the training data, the features of the normal data and the training data need to be extracted. Specifically, feature extraction can be completed by adopting a distributed memory (Doc 2 Vec) model of the paragraph vector, features are learned from a corpus in an unsupervised mode by utilizing the Doc2Vec model, then a feature vector with a fixed length is provided as output, and the output feature vector is extracted feature information.

S304, training a preset single-classification support vector machine classifier by using the normal data and the feature information of the training data to obtain a first classifier;

and loading the characteristic information of the normal data and the training data as normal data samples into a single classification support vector machine (OC-SVM) classifier, mapping the normal data samples into a high-dimensional characteristic space, searching a hyperplane in the high-dimensional space, and separating the mapped samples from an origin. The distance between the hyperplane and the origin in the feature space is maximum, all data points are separated from the origin at the same time, and the proper hyperplane is found to obtain the OC-SVM classifier model. And taking the OC-SVM classifier model obtained after training as a first classifier.

S305, testing the normal data by adopting the first classifier, and counting the data volume of misjudged data in the testing process;

the normal data includes both normal data identified during the clustering process and training data.

Specifically, the OC-SVM classifier obtained by training is used for classifying normal data, if a certain normal data is recognized as abnormal data, the data is identified as misjudgment data, and the data amount of the misjudgment data is counted.

S306, if the data volume of the misjudged data is smaller than a preset threshold value, classifying the data to be detected by using the first classifier, and dividing the data to be detected into normal data and abnormal data;

if the data volume of the misjudgment data is smaller than the preset threshold value, the test result of the trained OC-SVM classifier is correct in the preset range, and classification in the expected range can be achieved. Therefore, the OC-SVM classifier can be independently used for classifying the data to be detected.

Before the data to be detected is classified by adopting the OC-SVM classifier, the feature information of the data to be detected also needs to be extracted. For example, a Doc2Vec model may be used to extract feature information of the data to be detected.

The OC-SVM classifier can divide the data to be detected into normal data and abnormal data according to the characteristic information of the data to be detected.

S307, if the data volume of the misjudged data is larger than or equal to the preset threshold, adopting the improved support vector data of the misjudged data to describe a classifier, and obtaining a second classifier;

and if the data quantity of the misjudgment data is larger than or equal to the preset threshold value, indicating that the classification effect of the trained OC-SVM classifier does not reach the expected value. In order to achieve better detection of the data to be detected, the detection may be performed by using an improved support vector data description (improved SVDD) classifier. And training the improved SVDD classifier by adopting the misjudgment data, and taking the trained improved SVDD classifier as a second classifier.

S308, classifying the data to be detected by adopting the first classifier and the second classifier respectively to obtain a first classification result and a second classification result;

specifically, the data to be detected are loaded into a first classifier and a second classifier respectively, the classification result of the data to be detected by the first classifier is a first classification result, and the classification result of the data to be detected by the second classifier is a second classification result.

S309, if the first classification result and the second classification result are both abnormal data, identifying the data to be detected as abnormal data, and otherwise, identifying the data to be detected as normal data.

For data to be detected, if the first classification result is abnormal data and the second classification result is also abnormal data, the data to be detected is identified as the abnormal data; and if the first classification result or the second classification result of the data to be detected is normal data, the data to be detected is identified as normal data.

The above process is illustrated by way of example as follows:

now there are input data a, b, c, d, e, f, training data g, h. After data preprocessing, clustering detection is performed in the clustering process. At this time, a and g are grouped together, b and h are grouped together, c and d are grouped together, and e and f are grouped together. And according to the clustering result, judging that a and b are normal data. And training an OC-SVM classifier model by using the data a, b, g and h, testing the data a, b, g and h by using the model, counting the number of misjudged data, finishing joint classification detection, and finally judging, wherein c and d are normal data, and e and f are abnormal data.

In the embodiment, the classification effect of the first classifier is tested in the classification process, and when the effect of the first classifier is lower than expected, the first classifier and the second classifier are adopted to classify the data to be detected simultaneously, so that the possibility that the data to be detected is misjudged is reduced; meanwhile, training data and received partial normal data are used in the training process of the classifier, which is equivalent to updating the detection standard in time, and is beneficial to finding out new abnormity.

Fig. 4 is a schematic flowchart of a method for detecting hypertext transfer data according to a fourth embodiment of the present application, and as shown in fig. 4, the method includes the following steps:

s401, receiving hypertext transmission data;

s402, clustering the hypertext transmission data and preset training data, and dividing the hypertext transmission data into normal data and data to be detected according to a clustering result;

s403, training a first classifier by using the normal data and the training data;

s404, classifying the data to be detected by using the first classifier, and dividing the data to be detected into normal data and abnormal data;

s401 to S404 in this embodiment are similar to S101 to S104 in embodiment 1, and may refer to each other, which is not described again in this embodiment.

S405, updating a data interception rule of the firewall by adopting the abnormal data;

specifically, when abnormal data is detected, the abnormal data may be added to a blacklist of the firewall for improving the defense and detection capabilities of the firewall.

S406, adding the normal data into the training data.

Specifically, when the data amount corresponding to the normal data is larger than the threshold S1, the data is allowed to be added to the training data. The training data can be set in the training pool, because the capacity of the training pool is limited, in order to ensure that the number of data in the training pool does not exceed the limit of the storage space, when the number of data in the training pool is greater than S2, the data in the training pool needs to be clustered. And (3) utilizing a Birch clustering algorithm, designating the number of the classes in the clustering result to be less than S2, and randomly selecting a piece of data from each class in the clustering result to form a group of new training data to replace all data in the training pool. The values of S1 and S2 can be set according to the storage space or the detection requirement.

In the embodiment, after HTTP data detection is finished, abnormal data is added into a rule base of the firewall, so that the defense and detection capabilities of the firewall are improved; and the training data is updated in time by using normal data, so that new abnormity can be found in time.

Fig. 5 is a schematic flowchart of a method for detecting hypertext transfer data according to the fifth embodiment of the present application, as shown in fig. 5, where the method for detecting hypertext transfer data includes a data preprocessing process, a clustering process, and a classification process.

Carrying out data preprocessing on input data and training data, and then clustering by adopting a BIRCH clustering algorithm to obtain normal data and data to be detected; then extracting the features of the normal data, and training a classifier according to the features of the normal data; after the features of the data to be detected are extracted, the trained classifier is adopted to classify the data to be detected according to the features of the data to be detected, and the data to be detected is divided into normal data and abnormal data. By the method, after clustering and classifying are carried out twice, the input data can be finally divided into normal data and abnormal data, and the abnormal data in the HTTP data can be detected. The input data are HTTP data to be detected, and the training data are normal data.

The preprocessing process comprises a data fragmentation process, wherein during data fragmentation, some types of domains can be selected as dividing units, each domain has corresponding training data, input data is divided into each data fragment by taking the domain as a unit, the data fragments belonging to the selected domain are taken as effective data fragments, and the data fragments not belonging to the selected domain are taken as redundant data to be deleted. And carrying out regularization processing on the effective data fragments, converting the effective data fragments into identifiers, and carrying out vectorization processing on the identifiers to obtain the characteristic vectors. The training data can also be whole non-fragmented normal data, and if the training data is whole non-fragmented normal data, the training data and the input data are subjected to a data preprocessing process; and if the training data are normal data of the fragments corresponding to each domain, extracting the feature vectors of the training data.

And in the clustering process, the data fragments are clustered according to the characteristic vectors of the effective data fragments by adopting a BIRCH clustering algorithm. The clustering algorithm divides the data into different data classes according to the characteristic vectors of the data fragments, and if the data classes comprise normal data, the input data in the data classes are divided into normal data; otherwise, it is recognized as data to be detected.

Fig. 6 is a schematic diagram of a classification process provided in the fifth embodiment of the present application, in the classification process, a Doc2Vec model is used to perform feature extraction on the normal data and the data to be detected obtained by clustering, and an One-ClassHYBIRD classifier is used to classify the data to be detected, as shown in fig. 5. The steps of the One-classhhyperbard classifier may be specifically as shown in fig. 6:

the OC-SVM classifier is obtained by training with normal data, and then the classifier is continuously tested with the normal data. In the testing process, some normal data can be judged as abnormal data by mistake, and the number N of the misjudged data is counted.

If N is larger than or equal to the threshold value T, training an improved SVDD classifier by using misjudged data, classifying the data to be detected by using an OC-SVM classifier and the improved SVDD classifier respectively, and finally classifying the result into a combined classification result of the two classifiers. The association rules are as follows: and the data which are judged to be abnormal by the two classifiers are abnormal data, otherwise, the data are judged to be normal data. And if the N is smaller than the threshold T, directly utilizing the OC-SVM classifier to classify the data to be detected, wherein the final classification result is the classification result of the OC-SVM classifier. Finally, the received HTTP data is distinguished into normal data and abnormal data.

In this embodiment, two classifiers are used for performing combined classification detection, which is beneficial to reducing the misjudgment rate of data.

Fig. 7 is a system block diagram of a method for detecting hypertext transfer data according to a sixth embodiment of the present application, and as shown in fig. 7, the system for detecting an anomaly of hypertext transfer data may include a server, a traffic storage, a training pool, a detection module, a filtering module, and a firewall.

The server sends HTTP data to the network through the router to perform various internet access behaviors.

The flow memory is used for storing the input HTTP data. The HTTP data sent by the server may be stored in a data traffic memory, the data in the traffic memory may be transmitted to a detection module at intervals to be detected, and after the detection is completed, the HTTP data at the intervals may be cleaned.

The normal data of each domain exists in the training pool, and the detection module can use the data in the training pool to train in the system running process. Meanwhile, the data in the training pool can be continuously updated according to the requirements in the filtering module.

The detection module is used for dividing data into normal data and abnormal data. The HTTP data in the data storage and the training data in the training pool enter the detection module together, the data in the detection module are subjected to preprocessing, then clustering detection is carried out, the HTTP data are divided into normal data and data to be detected, then classification detection is carried out, the data to be detected are classified, and finally normal data and abnormal data are obtained. The normal data can be used for updating the training pool after passing through the filtering module, and the abnormal data can be used for improving the defense capability of the firewall.

The filtering module is used for updating the training data. In the process of continuous operation of the system, the training pool is continuously updated so as to improve the detection effect. In order to ensure the accuracy of the detection result in the updating process of the training pool and avoid generating a large amount of false reports in the continuous operation process, a new addition of normal data needs to be screened. Only when the data amount corresponding to the normal data is larger than the threshold value S1, the data is allowed to be added into the training pool. Because the capacity of the training pool is limited, in order to ensure that the amount of data in the training pool does not exceed the limit of the storage space, when the amount of data in the training pool is greater than S2, the data in the training pool needs to be clustered. And (3) designating the number of the classes in the clustering result to be less than S2 by using a Birch clustering algorithm, and randomly selecting a piece of data from each class in the clustering result to form a group of new training data to replace all data in the training pool.

The firewall is used for intercepting abnormal data to guarantee the safety of the system, and the abnormal data obtained by the final detection result can be added into a rule base of the firewall to improve the defense and detection capabilities of the firewall.

Fig. 8 is a schematic diagram of an apparatus for detecting hypertext transfer data according to a seventh embodiment of the present application, and as shown in fig. 8, the apparatus may include the following modules:

the receiving module is used for receiving the hypertext transmission data;

The device further comprises a preprocessing module, and the preprocessing module can comprise the following sub-modules:

the data fragmentation submodule is used for dividing the hypertext transmission data into a plurality of data fragments by taking a domain as a unit;

the effective data fragment identification submodule is used for identifying the data fragment as an effective data fragment if the domain corresponding to the data fragment belongs to a target domain;

the conversion identification sub-module is used for carrying out regularization processing on the effective data fragments and converting the effective data fragments into identification data;

and the vectorization sub-module is used for vectorizing the identification data to obtain a digital vector corresponding to the identification data.

The clustering module may specifically include the following sub-modules:

the data class acquisition submodule is used for clustering the hypertext transmission data and the training data according to the digital vectors corresponding to the effective data fragments to obtain a plurality of different data classes;

a normal data identification sub-module, configured to identify the hypertext transfer data in a target data class as normal data if the target data class includes the training data, where the target data class is any one of the data classes;

and the abnormal data identification sub-module is used for identifying the hypertext transmission data in the data class as the data to be detected if the training data is not contained in the target data class.

The training module may specifically include the following sub-modules:

the characteristic extraction submodule is used for extracting the characteristic information of the normal data and the training data;

the first classifier training sub-module is used for training a preset single-classification support vector machine classifier by adopting the normal data and the feature information of the training data to obtain a first classifier;

and the testing sub-module is used for testing the normal data by adopting the first classifier and counting the data volume of misjudged data in the testing process.

The classification sub-module may specifically include the following sub-modules:

the first classifier classification submodule is used for classifying the data to be detected by adopting the first classifier if the data volume of the misjudged data is smaller than a preset threshold value, and dividing the data to be detected into normal data and abnormal data;

and the joint classification submodule is used for describing a classifier by adopting the support vector data after the misjudgment data training is improved to obtain a second classifier if the data quantity of the misjudgment data is larger than or equal to the preset threshold, classifying the data to be detected by adopting the first classifier and the second classifier, and dividing the data to be detected into normal data and abnormal data.

The joint classification submodule may specifically include the following units:

a classification result obtaining unit, configured to classify the data to be detected by using the first classifier and the second classifier respectively, and obtain a first classification result and a second classification result;

and the data identification unit is used for identifying the data to be detected as abnormal data if the first classification result and the second classification result are both abnormal data, and otherwise, identifying the data to be detected as normal data.

The device also comprises the following modules:

the firewall rule updating module is used for updating the data interception rule of the firewall by adopting the abnormal data;

and the training data updating module is used for adding the normal data into the training data.

Fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 9, the terminal device 9 of this embodiment includes: at least one processor 90 (only one shown in fig. 9), a memory 91, and a computer program 92 stored in the memory 91 and operable on the at least one processor 90, the processor 90 implementing the steps in any of the various embodiments of the method for detecting hypertext transfer data described above when executing the computer program 92.

The terminal device 9 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 90, a memory 91. Those skilled in the art will appreciate that fig. 9 is only an example of the terminal device 9, and does not constitute a limitation to the terminal device 9, and may include more or less components than those shown in the drawings, or may combine some components, or different components, and may further include, for example, an input/output device, a network access device, and the like.

The processor 90 may be a Central Processing Unit (CPU), and the processor 90 may be other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 91 may in some embodiments be an internal storage unit of the terminal device 9, such as a hard disk or a memory of the terminal device 9. The memory 91 may also be an external storage device of the terminal device 9 in other embodiments, such as a plug-in hard disk provided on the terminal device 9, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash memory card (FlashCard), and the like. Further, the memory 91 may also include both an internal storage unit and an external storage device of the terminal device 9. The memory 91 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 91 may also be used to temporarily store data that has been output or is to be output.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal device, a recording medium, computer memory, read-only memory (ROM), random Access Memory (RAM), electrical carrier signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In some jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and proprietary practices.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and they should be construed as being included in the present application.

Claims

1. A method for detecting hypertext transfer data, comprising:

receiving hypertext transfer data;

training a first classifier using the normal data and the training data;

classifying the data to be detected by adopting the first classifier, and dividing the data to be detected into normal data and abnormal data;

the method further comprises the following steps:

testing the normal data by adopting the first classifier, and counting the data volume of misjudged data in the testing process;

if the data quantity of the misjudged data is larger than or equal to a preset threshold value, adopting the improved support vector data description classifier of the misjudged data training to obtain a second classifier;

classifying the data to be detected by adopting the first classifier and the second classifier respectively to obtain a first classification result and a second classification result;

and if the first classification result and the second classification result are both abnormal data, identifying the data to be detected as the abnormal data, otherwise, identifying the data to be detected as the normal data.

2. The method of claim 1, wherein before clustering the hypertext transfer data with the predetermined training data and dividing the hypertext transfer data into normal data and data to be detected according to a clustering result, further comprising:

dividing the hypertext transfer data into a plurality of data fragments by taking a domain as a unit;

if the domain corresponding to the data fragment belongs to the target domain, identifying the data fragment as an effective data fragment;

regularizing the effective data fragments, and converting the effective data fragments into identification data;

vectorizing the identification data to obtain a digital vector corresponding to the identification data.

3. The method of claim 2, wherein the clustering the hypertext transfer data with the predetermined training data and the dividing the hypertext transfer data into normal data and data to be detected according to the clustering result comprises:

clustering the hypertext transmission data and the training data according to the digital vectors corresponding to the effective data fragments to obtain a plurality of different data classes;

if the training data is contained in the target data class, identifying the hypertext transmission data in the target data class as normal data, wherein the target data class is any one of the data classes;

and if the target data class does not contain the training data, identifying the hypertext transmission data in the data class as data to be detected.

4. The method of claim 1, wherein said training a first classifier using said normal data and said training data comprises:

extracting characteristic information of the normal data and the training data;

and training a preset single-classification support vector machine classifier by using the normal data and the characteristic information of the training data to obtain a first classifier.

5. The method as claimed in claim 4, wherein said classifying the data to be detected by using the first classifier, and dividing the data to be detected into normal data and abnormal data comprises:

and if the data volume of the misjudged data is smaller than a preset threshold value, classifying the data to be detected by adopting the first classifier, and dividing the data to be detected into normal data and abnormal data.

6. The method of claim 1, further comprising:

updating the data interception rule of the firewall by adopting the abnormal data;

adding the normal data to the training data.

7. An apparatus for detecting hypertext transfer data, comprising:

the receiving module is used for receiving the hypertext transmission data;

the classification module is used for classifying the data to be detected by adopting the first classifier and dividing the data to be detected into normal data and abnormal data;

the training module is further configured to:

the classification module is further configured to:

if the data quantity of the misjudged data is larger than or equal to a preset threshold value, adopting the improved support vector data of the misjudged data to describe a classifier, and obtaining a second classifier;

8. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.