CN111639277A

CN111639277A - Automated extraction method of machine learning sample set and computer-readable storage medium

Info

Publication number: CN111639277A
Application number: CN202010440435.3A
Authority: CN
Inventors: 陈建勇; 范渊
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: DBAPPSecurity Co Ltd; Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2020-09-08

Abstract

The present application relates to a method, computer device and computer-readable storage medium for automated extraction of a set of machine learning samples. The automatic extraction method of the machine learning sample set comprises the following steps: extracting an access data set from a website access log according to unit time; extracting target access data with different source IP addresses and request URLs in each unit time from the access data set; and marking the target access data as normal access data and storing the normal access data in a machine learning sample set. By the aid of the method and the device, the problem that the preparation process of the machine learning sample is low in efficiency is solved, and the preparation efficiency of the machine learning sample is improved.

Description

Automated extraction method of machine learning sample set and computer-readable storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a method, a computer device, and a computer-readable storage medium for automated extraction of a set of machine learning samples.

Background

As WEB applications become more and more abundant, WEB servers are becoming the main target of attacks with their powerful computing power, processing performance and high value of implications. SQL injection, web page tampering, web page horse hanging, and other security events occur frequently.

Users such as enterprises generally adopt firewalls as the first line of defense of security systems. However, in reality, they have such a problem, and thus a Web Application defense system (WAF for short) has been produced. WAF represents an emerging class of information security technologies to address Web application security issues that are overwhelmed by traditional device handles such as firewalls. Unlike traditional firewalls, the WAF works at the application layer, thus having inherent technical advantages for Web application protection. Based on deep understanding of Web application service and logic, the WAF detects and verifies the content of various requests from a Web application program client, ensures the security and the legality of the requests, and blocks illegal requests in real time, thereby effectively protecting various website sites.

The current main safeguard of WAFs relies on rule-based protection. Rule-based protection may provide security rules for various Web applications, and the WAF manufacturer may maintain and update this rule base from time to time. The user can perform all-around detection of the application according to the rules. By adopting a rule-based protection method, false reports and false reports often occur. Because it is essentially matched in the website traffic based on known characteristics, it is inevitable to bring false positives and false negatives.

In the WAF, the security of all access logs is learned by using a machine learning model, and then the security of an access request is predicted by using the well-learned machine learning model, so that the security detection of the access request is an effective way to discover the security risk of unknown characteristics.

However, the machine learning model is usually trained by adopting a supervised learning method, and a large number of samples are needed to train the machine learning model, and the number of samples is usually varied from thousands to tens of thousands. Taking the simplest two-class machine learning model as an example, the two-class machine learning samples may use only positive samples, only negative samples, or both positive and negative samples. For each sample, the class label of the sample is usually determined manually and labeled manually.

At present, an effective solution is not provided aiming at the problems that a great deal of manpower is consumed for preparing a machine learning sample caused by manually judging a classification label of the sample and manually marking the label in the related technology, and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides an automatic extraction method of a machine learning sample set, computer equipment and a computer readable storage medium, so as to at least solve the problem of low efficiency of a preparation process of a machine learning sample in the related art.

In a first aspect, an embodiment of the present application provides an automated extraction method for a machine learning sample set, including: extracting an access data set from a website access log according to unit time; extracting target access data with different source IP addresses and request URLs in each unit time from the access data set; and marking the target access data as normal access data and storing the normal access data in a machine learning sample set.

In some of these embodiments, extracting the access data set from the website access log in units of time includes: and extracting the access data in one unit time from the website access log, and taking the access data in one unit time as the access data set.

In some of these embodiments, extracting the access data set from the website access log in units of time includes: and extracting access data from the website access log, and segmenting the access data according to the unit time to obtain a plurality of access data sets.

In some embodiments, after extracting the access data set from the website access log by unit time, the method further comprises: and screening out the access data with preset characteristics from the access data set.

In some embodiments, the preset features include HTTP access data with an HTTP response code of 404 or 5 XX; screening out access data having a preset characteristic from the access data set comprises: extracting an HTTP response code of the access data in the access data set; the access data with the HTTP response code of 404 or 5XX is deleted from the access data set.

In some embodiments, the predetermined characteristic includes that the access object is a static file, wherein the static file includes at least one of: pictures, CSS pages, JS pages.

In some of these embodiments, the unit time is 8 hours, 12 hours, or 24 hours.

In some of these embodiments, after marking the target access data as normal access data and storing in a machine learning sample set, the method further comprises: judging whether the number of samples in the machine learning sample set reaches a preset number threshold value or not; under the condition that the number of the samples in the machine learning sample set is judged to be less than a preset number threshold value, continuously extracting a new access data set from the website access log according to the unit time, and extracting new target access data with different source IP addresses and request URLs in each unit time from the new access data set; and marking the new target access data as normal access data and storing the normal access data in the machine learning sample set.

In a second aspect, the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the automatic extraction method of the machine learning sample set according to the first aspect when executing the computer program.

In a third aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for automatically extracting a machine learning sample set according to the first aspect.

Compared with the related art, the automatic extraction method, the computer device and the computer-readable storage medium for the machine learning sample set provided by the embodiment of the application extract the access data set from the website access log according to unit time; target access data with different source IP addresses and request URLs in each unit time are extracted from the access data set; the target access data are marked as normal access data and stored in the machine learning sample set, so that the problem of low efficiency of the preparation process of the machine learning samples is solved, and the preparation efficiency of the machine learning samples is improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flow diagram of a method of automated extraction of a set of machine learning samples according to an embodiment of the application;

FIG. 2 is a flow chart of a method for automated extraction of a set of machine learning samples according to a preferred embodiment of the present application;

fig. 3 is a hardware configuration diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments provided in the present application without any inventive work are within the scope of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the application, from which the application can also be applied to other similar scenarios without inventive effort for a person skilled in the art. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The use of the terms "including," "comprising," "having," and any variations thereof herein, is meant to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more.

Interpretation of terms:

machine learning: machine learning is a multi-disciplinary cross specialty, covers probability theory knowledge, statistical knowledge, approximate theoretical knowledge and complex algorithm knowledge, uses a computer as a tool and is dedicated to a real-time simulation human learning mode, and knowledge structure division is carried out on the existing content to effectively improve learning efficiency. Machine learning includes shallow learning such as support vector machines, random forests, and deep learning such as convolutional neural networks, robust learning networks, generative confrontation networks, cyclic neural networks, and the like.

Sample set: in the fields of machine learning and pattern recognition, etc., samples may be divided into independent sets of one or more samples. The samples can be divided into a positive sample set and a negative sample set according to the classification labels (secondary classification) of the samples; according to the effect of the sample, the method can be divided into a training set (train set), a verification set (validation set), a test set (test set) and the like. The training set is used to train the model, the validation set is used to determine parameters of the network structure or control model complexity, and the test set is used to verify the model performance. One typical division is that the training set is 50% of the total samples, while the others are 25%, all three being randomly drawn from the sample set.

Labeling: in machine learning, samples are classified and the formed classes belong to labels. In a typical binary classification scenario, the classification labels include a positive label and a negative label. For example, in an access request detection scenario, the access request may be divided into a secure request (positively tagged) and an unsecure request (negatively tagged).

Machine learning supervision learning: supervised learning is a machine learning task that infers a function from labeled training data. The training data includes a set of training examples. In supervised learning, each instance consists of an input object (usually a vector) and a desired output value (also called a supervisory signal). Supervised learning algorithms analyze the training data and produce an inferred function that can be used to map out new instances. An optimal solution would allow the algorithm to correctly determine class labels for those instances that are not visible. This requires that the learning algorithm be formed in a "rational" manner from a point of view of the training data to a point of view that is not visible.

The embodiment provides an automatic extraction method of a machine learning sample set. Fig. 1 is a flowchart of an automated extraction method of a machine learning sample set according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:

step S101, an access data set is extracted from a website access log according to unit time.

The website access log in this embodiment may be an online access log directly obtained in real time from a Web server or an auditing device, or an offline access log obtained from a database. Wherein, unit time can be set as required, and the principle of setting is: in the case that the larger the number of access data that can be extracted, the longer the unit time can be set accordingly; in the case that the larger the number of samples in the sample set that needs to be generated, the shorter the unit time can be set accordingly. The length per unit time is chosen as a compromise between the number of access data that can be extracted and the number of samples in the sample set that need to be generated. In general, the unit time length may be selected to be 24 hours, 12 hours, or 8 hours.

The access data set acquired in step S101 generally includes a plurality of pieces of access data, each piece of access data at least including: HTTP response request code, request URL, access source IP address, access time, etc. If the access data is generated by the access proxy, the access source IP address can be obtained from the X _ Forwarded _ For field in the HTTP request header in the access data.

And S102, extracting target access data with different source IP addresses and request URLs in each unit time from the access data set.

In step S102, in each access data set divided according to unit time, target access data is extracted, and each piece of access data in the extracted target access data is deduplicated by using the source IP address and the request URL as a joint keyword, that is, access data with the same source IP address and the same request URL is removed from the extracted target access data, so that for an access request of the same IP address to the same URL, only once access data is retained in the extracted target access data in unit time as a normal access traffic of the website.

And step S103, marking the target access data as normal access data and storing the normal access data in a machine learning sample set.

In step S103, the extracted target access data is marked as a label of normal access data, so as to obtain a positive sample; and storing the positive samples into a machine learning sample set, and finally obtaining a sample library for machine learning.

In a normally visited network, it is unlikely that all sources of access will be attacks. In a web site, this is almost impossible if access requests initiated by most source IP addresses are attacked for a long time; this is equivalent to a long period of abnormal traffic higher than normal traffic for a website, so that the cost of the website traffic carrying value is far lower than the cost of the attack traffic, which is unrealistic. In the embodiment, by weighting the source IP address and the access URL, the number of attacks occurring in the extracted log sample is extremely small due to dilution of the number of access logs for a long time. Formally based on the principle, in this embodiment, the samples in the sample set extracted from the website visit log can be regarded as normal visit data, and the samples are marked as positive samples, that is, the sample set capable of being used for machine learning model training is obtained.

Through the steps, the problem of low efficiency of the preparation process of the machine learning sample is solved, and the preparation efficiency of the machine learning sample is improved.

When the access data in the website access log is extracted, each piece of access data can be traversed one by one according to the access time. In order to be able to conveniently extract the target access data per unit time, in step S101, the access data per unit time may be extracted from the website access log each time, and the access data per unit time may be used as the access data set. In step S101, each time long-time access data is extracted from the website access log, the access data is divided into a plurality of access data sets according to unit time, and then in step S102, target access data is independently extracted from each access data set.

Since the number of target access data extracted from each access data set is uncertain, in order to obtain a machine learning sample set having a preset number of samples, after step S103, it may be determined whether the number of samples in the machine learning sample set reaches a preset number threshold; and under the condition that the number of the samples in the machine learning sample set is judged not to reach the preset number threshold, executing the steps S101 to S103 again until the number of the sub-cost in the machine learning sample set reaches the preset number threshold. In each cycle, the accessed data sets extracted in step S101 and the accessed data sets extracted in the previous cycle have as few intersections as possible, that is, the same original data, and only the sequentially accessed data sets are extracted, so that the same original data are prevented from being extracted multiple times.

Some access data in the extracted access data set are access data which do not affect the security detection of the access request, such as HTTP access data with an HTTP response code of 404 or 5 XX; as well as access to static files, for example. Therefore, the access data in the access data set can be preprocessed according to the preset characteristics, and the access data which has no influence on the security detection of the access request can be screened out. The preset features include, but are not limited to, HTTP access data with an HTTP response code of 404 or 5XX, and access data with an access object of a static file. Wherein the static file includes but is not limited to at least one of the following: pictures, CSS pages, JS pages, etc.

The HTTP response code is also called an HTTP status code. Response code 404 indicates that the server cannot find the requested web page. Response code 5XX represents a class of HTTP response codes, for example, response code 500 indicates an internal error in the server that cannot complete the request; response code 502 indicates that the server acts as a gateway or proxy, receiving an invalid response from an upstream server; response code 503 indicates that the server is currently unavailable; response code 504 indicates that the server is acting as a gateway or proxy, but has not timely received a request from an upstream server; response code 505 indicates that the server does not support the HTTP protocol version used in the request.

In some embodiments, the preset features include HTTP access data with HTTP response code 404 or 5 XX; screening out access data having a predetermined characteristic from the access data set comprises: extracting an HTTP response code of the access data in the access data set; the access data with HTTP response code 404 or 5XX is deleted from the access data set.

In the embodiment, the access log of the website is weighted according to the access source IP address and the access URL, the same IP accesses the same URL, only 1 time of access log is recorded to the normal flow sample library of the website in a certain unit time, and the normal flow sample library of the website can be automatically constructed by using a large amount of historical access logs. The mass normal flow data of the website can be automatically extracted through the embodiment, so that no matter what machine learning training model is adopted, great workload of manual marking is not worried.

The present application is described and illustrated below by means of preferred embodiments. In the present embodiment, several pieces of access data of the website "www.test.com" are described as an example. In this embodiment, the unit time is determined to be 24 hours according to the verified value, and the threshold value of the number of training samples is 10 ten thousand pieces of access data.

Fig. 2 is a flowchart of an automated extraction method of a machine learning sample set according to a preferred embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, traverse the access log, and extract basic data.

The total number of the input access data is 4, and the HTTP response code, the request URL, the access source IP data and the access time extracted by traversing each access data are as follows:

source IP	URL	Status code	Time of day
				183.149.70.160	/AdminCP/SubSystem/PanelMenuWeb.jsp	200	03/Aug/2016:00:11:23
115.202.252.19	/AdminCP/SubSystem/PanelMenuWeb.jsp	200	03/Aug/2016:00:11:23
				115.202.252.19	/AdminCP/SubSystem/PanelMenuWeb.jsp	200	03/Aug/2016:00:11:30
127.0.0.1	/AdminCP/generate/tpl_89143.jsp	404	03/Aug/2016:00:11:46

Step S202, screening the access data meeting the conditions, and performing calculation.

And screening out the access data which do not meet the conditions. Removing the 4 th access log, wherein the status code is 404; the 3 rd log is removed because the log is the same as the access source IP address and the request URL of the 2 nd access data, and the time between two requests is 3 seconds, which is less than 24 hours per unit time. So the samples meeting the condition after operation are the following 2:

source IP	URL	Status code	Time of day
				183.149.70.160	/AdminCP/SubSystem/PanelMenuWeb.jsp	200	03/Aug/2016:00:11:23
115.202.252.19	/AdminCP/SubSystem/PanelMenuWeb.jsp	200	03/Aug/2016:00:11:23

Step S203, a normal flow sample library is generated.

After the operation, the access data which accords with the input of the normal flow sample library are 2 pieces. This number of hours is only 2, and the threshold of training sample number (100000) is not reached, so the process needs to be completed by going through the access log to the threshold.

In addition, the automatic extraction method of the machine learning sample set described in conjunction with fig. 1 in the embodiment of the present application may be implemented by a computer device. Fig. 3 is a hardware structure diagram of a computer device according to an embodiment of the present application.

The computer device may comprise a processor 31 and a memory 32 in which computer program instructions are stored.

Specifically, the processor 31 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 35 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 35 may include a Hard Disk Drive (Hard Disk Drive, abbreviated HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, a tape or Universal Serial Bus (USB) Drive, or a combination of two or more of these. Memory 35 may include removable or non-removable (or fixed) media, where appropriate. The memory 35 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 35 is a Non-Volatile (Non-Volatile) memory. In certain embodiments, Memory 35 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (abbreviated PROM), erasable PROM (abbreviated EPROM), electrically erasable PROM (abbreviated EEPROM), electrically rewritable ROM (abbreviated EEPROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

Memory 35 may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by processor 32.

The processor 31 may implement any of the above-described embodiments of the automated extraction method of a machine-learned sample set by reading and executing computer program instructions stored in the memory 32.

In some of these embodiments, the computer device may also include a communication interface 33 and a bus 30. As shown in fig. 3, the processor 31, the memory 32, and the communication interface 33 are connected via the bus 30 to complete mutual communication.

The communication interface 33 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication interface 33 may also enable communication with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 30 comprises hardware, software, or both coupling the components of the computer device to each other. The bus 30 includes, but is not limited to, at least one of: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 30 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (FSB), a HyperTransport (HT) Interconnect, an ISA (ISA) Bus, an InfiniBand (InfiniBand) Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a Microchannel Architecture (MCA) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, an Audio Video Bus, abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 30 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

In addition, in combination with the automatic extraction method of the machine learning sample set in the foregoing embodiments, the present application embodiment may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement the automated extraction method of a machine learning sample set of any of the above embodiments.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered as being described in the present specification.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An automatic extraction method of a machine learning sample set is characterized by comprising the following steps:

extracting an access data set from a website access log according to unit time;

extracting target access data with different source IP addresses and request URLs in each unit time from the access data set;

and marking the target access data as normal access data and storing the normal access data in a machine learning sample set.

2. The method of claim 1, wherein extracting access data sets from the website access log by unit time comprises:

and extracting the access data in one unit time from the website access log, and taking the access data in one unit time as the access data set.

3. The method of claim 1, wherein extracting access data sets from the website access log by unit time comprises:

and extracting access data from the website access log, and segmenting the access data according to the unit time to obtain a plurality of access data sets.

4. The method of claim 1, wherein after extracting the access data set from the website access log in units of time, the method further comprises:

and screening out the access data with preset characteristics from the access data set.

5. The method of claim 4, wherein the preset features include HTTP access data with HTTP response code 404 or 5 XX; screening out access data having a preset characteristic from the access data set comprises:

extracting an HTTP response code of the access data in the access data set;

the access data with the HTTP response code of 404 or 5XX is deleted from the access data set.

6. The method of claim 4, wherein the predetermined characteristic comprises that the access object is a static file, wherein the static file comprises at least one of the following: pictures, CSS pages, JS pages.

7. The method according to any one of claims 1 to 6, wherein the unit time is 8 hours, 12 hours or 24 hours.

8. The method of any of claims 1-6, wherein after marking the target access data as normal access data and storing in a machine learning sample set, the method further comprises:

judging whether the number of samples in the machine learning sample set reaches a preset number threshold value or not;

under the condition that the number of the samples in the machine learning sample set is judged to be less than a preset number threshold value, continuously extracting a new access data set from the website access log according to the unit time, and extracting new target access data with different source IP addresses and request URLs in each unit time from the new access data set; and marking the new target access data as normal access data and storing the normal access data in the machine learning sample set.

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the method of automated extraction of a set of machine learning samples of any one of claims 1 to 8.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of automated extraction of a set of machine learning samples according to any one of claims 1 to 8.