CN112839010B - Method, system, device and medium for marking samples - Google Patents

Method, system, device and medium for marking samples Download PDF

Info

Publication number
CN112839010B
CN112839010B CN201911158382.XA CN201911158382A CN112839010B CN 112839010 B CN112839010 B CN 112839010B CN 201911158382 A CN201911158382 A CN 201911158382A CN 112839010 B CN112839010 B CN 112839010B
Authority
CN
China
Prior art keywords
pattern
url
address
parameter
dangerous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911158382.XA
Other languages
Chinese (zh)
Other versions
CN112839010A (en
Inventor
潘廷珅
丛磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shuan Xinyun Information Technology Co ltd
Original Assignee
Beijing Shuan Xinyun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shuan Xinyun Information Technology Co ltd filed Critical Beijing Shuan Xinyun Information Technology Co ltd
Priority to CN201911158382.XA priority Critical patent/CN112839010B/en
Publication of CN112839010A publication Critical patent/CN112839010A/en
Application granted granted Critical
Publication of CN112839010B publication Critical patent/CN112839010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method, a system, equipment and a medium for marking a sample, wherein the method comprises the following steps: acquiring page browsing quantity of each url_pattern in a first preset time period, and accessing the number of the IP addresses of the url_pattern after de-duplication so as to determine dangerous url_pattern; and acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period, further determining the IP address with abnormal access, and marking the IP address with abnormal access as a positive sample. The method for marking the positive sample has the advantages of high marking accuracy, reduced labor cost, high marking efficiency and good universality, and is convenient for screening positive sample data in a machine learning model.

Description

Method, system, device and medium for marking samples
Technical Field
The present invention relates to the field of web network security, and in particular, to a method, system, device, and medium for marking a sample.
Background
With the continuous development of network applications, the internet plays an increasingly important role in people's daily work and life. The continuous development of internet technology increases unsafe factors in the network, malicious access from malicious IP addresses easily causes paralysis of a network server, seriously affects the service quality of a network service provider, and further affects the use of users.
In order to prevent malicious access from malicious IP addresses, network anomaly visitor detection techniques are used in the prior art to identify malicious IP addresses. That is, data mining is performed through a Web log, modeling is performed by using the Web log accessed by a user history, a portrait of the user is constructed, abnormal user behaviors are analyzed from the Web log by using a machine learning model, and a malicious IP address is determined.
The machine learning model is divided into a supervised learning mode and an unsupervised learning mode. In the unsupervised learning mode, due to the limitation of unsupervised learning, the unsupervised learning has the problems of low accuracy and poor interpretation, and the malicious IP address cannot be accurately identified. In order to accurately identify a malicious IP address from among a plurality of IP addresses, a machine learning model needs to be trained in a supervised learning mode.
The marking sample used in the supervised learning mode of the existing machine learning model is determined by a technical expert through a manual marking mode, and the problems that the sample marking process is high in labor cost and low in marking efficiency, and the marking process is greatly influenced by manual experience exist. Meanwhile, the sample marking is performed by adopting a manual method, so that the supervised learning mode also has the problems of high maintenance cost and poor universality.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method, a system, equipment and a medium for marking a sample.
The method for marking the sample provided by the invention comprises the following steps: acquiring page browsing quantity of each url_pattern in a first preset time period, and accessing the number of the de-duplicated IP addresses of the url_pattern;
determining dangerous url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after the duplication removal;
acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
and determining the IP address with abnormal access according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, and marking the IP address with abnormal access as a positive sample.
The method also has the following characteristics: the determining dangerous url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after the duplication removal includes:
calculating the attacked parameter of each url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after the duplication removal;
and determining the dangerous url_pattern according to the attacked parameters of the url_pattern.
The method also has the following characteristics: the calculating the attacked parameter of each url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after the duplication removal includes:
and calculating the ratio between the page browsing amount corresponding to the url_pattern and the number of the de-duplicated IP addresses corresponding to the url_pattern, and taking the ratio as the attacked parameter of the url_pattern.
The method also has the following characteristics: the determining the dangerous url_pattern according to the attacked parameter of the url_pattern includes:
sorting all the calculated attacked parameters of the url_pattern according to the numerical value from big to small, and determining the attacked parameters of the url_pattern ranked in the first N to be the dangerous url_pattern;
or alternatively, the process may be performed,
and determining the url_pattern corresponding to the attacked parameter value with the value larger than or equal to a first preset value as the dangerous url_pattern in all the calculated attacked parameters of the url_pattern.
The method also has the following characteristics: the determining, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, the IP address where the abnormal access occurs includes:
and determining the IP address with the time parameter being greater than or equal to a second preset value and the page browsing parameter being greater than or equal to a third preset value as the IP address with abnormal access.
The method also has the following characteristics: the time parameter of the IP address is the ratio of the time of the IP address accessing the dangerous url_pattern to the time of the IP address performing all access operation in a second preset time period;
and/or the number of the groups of groups,
and the page browsing parameter of the IP address is the ratio of the number of times the IP address accesses the dangerous url_pattern to the number of times the IP address performs all page browsing within a second preset time period.
The present invention provides a system for marking a sample, comprising: the acquisition unit is used for acquiring the page browsing quantity of each url_pattern in a first preset time period and the number of the IP addresses subjected to the de-duplication of the url_pattern;
the calculation unit is used for determining dangerous url_pattern according to the page browsing quantity corresponding to each url_pattern and the number of the IP addresses after duplication removal;
the acquiring unit is further configured to acquire a time parameter and a page browsing parameter for accessing the IP address of each dangerous url_pattern in a second preset time period;
the computing unit is further configured to determine an IP address where abnormal access occurs according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern;
and the marking unit is used for marking the IP address with abnormal access as a positive sample.
The system also has the following characteristics: the computing unit is further configured to compute an attacked parameter of each url_pattern according to the page browsing amount corresponding to each url_pattern and the number of IP addresses after duplication removal;
and determining the dangerous url_pattern according to the attacked parameters of the url_pattern.
The system also has the following characteristics: the calculating unit is further configured to calculate a ratio between the page browsing amount corresponding to the url_pattern and the number of IP addresses after de-duplication corresponding to the url_pattern, and use the ratio as an attacked parameter of the url_pattern.
The system also has the following characteristics: the computing unit is further configured to rank all the computed attacked parameters of the url_pattern according to a numerical order from large to small, and determine that the attacked parameters are ranked in the url_pattern of the first N number, as the dangerous url_pattern;
or alternatively, the process may be performed,
and the method is also used for determining the url_pattern corresponding to the attacked parameter value with the value larger than or equal to a first preset value as the dangerous url_pattern in all the calculated attacked parameters of the url_pattern.
The system also has the following characteristics: the computing unit is further configured to determine, as an IP address where abnormal access occurs, an IP address where the time parameter is greater than or equal to a second preset value and the page browsing parameter is greater than or equal to a third preset value.
The system also has the following characteristics: the time parameter of the IP address is the ratio of the time of the IP address accessing the dangerous url_pattern to the time of the IP address performing all access operation in a second preset time period;
and/or the number of the groups of groups,
and the page browsing parameter of the IP address is the ratio of the number of times the IP address accesses the dangerous url_pattern to the number of times the IP address performs all page browsing within a second preset time period.
The transmission device provided by the invention comprises: a transceiver, a memory, a processor;
the transceiver is used for receiving and transmitting messages;
the memory is used for storing instructions and data;
the processor is configured to read the instructions and data stored in the memory to perform the method of marking samples as described above.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method of marking a sample as described above.
According to the method for marking the samples, according to the url_pattern and the number of the IP addresses subjected to the de-duplication of the url_pattern, the IP addresses subjected to the abnormal access are accurately marked after calculation, and are used as positive samples, so that the accuracy of the positive samples is ensured. The method for marking the positive sample has the advantages of high marking accuracy, small influence by artificial factors, high marking efficiency and good universality, and is convenient for screening positive sample data in a machine learning model.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is one of the flow charts of a method of marking a sample in an exemplary embodiment;
FIG. 2 is a second flowchart of a method of marking a sample in an exemplary embodiment;
FIG. 3 is a third flowchart of a method of marking a sample in an exemplary embodiment;
FIG. 4 is a fourth flow chart of a method of marking a sample in an exemplary embodiment;
FIG. 5 is a fifth flow chart of a method of marking a sample in an exemplary embodiment;
fig. 6 is a schematic diagram of a system for marking a sample in an exemplary embodiment.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be arbitrarily combined with each other.
The application provides a method for marking samples, which is used for marking positive samples used in a machine learning model in the web security field, namely, searching out abnormally accessed IP addresses from a plurality of IP addresses accessing domain names, marking the abnormally accessed IP addresses as the positive samples for the machine learning model so as to screen positive sample data in the machine learning model. The method of the invention is used for marking the positive sample used in the machine learning model in the web security field, realizes the automation of marking the positive sample, does not need to manually participate in marking the sample, eliminates the problem of sample marking error caused by insufficient human experience, and improves the sample marking efficiency and the marking accuracy.
As shown in fig. 1, in an exemplary embodiment, a method of marking a sample in the web security domain includes:
s1, acquiring page browsing quantity of each url_pattern in a first preset time period and accessing the number of the IP addresses of the url_pattern after de-duplication;
s2, determining dangerous url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after duplication removal;
s3, acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
s4, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, determining the IP address with abnormal access, and marking the IP address with abnormal access as a positive sample.
The steps S1 to S4 all belong to the steps in the preprocessing stage in the supervised learning mode in the machine learning model in the web safety field, and the positive samples required to be used by the machine learning model in the web safety field are marked in the preprocessing stage by using the method, so that the positive samples can be obtained quickly and accurately. Here, it should be noted that in the supervised learning mode of the machine learning model in the web security field, positive and negative samples need to be used. In an application scenario, for example, in the web security field, a machine learning model is used to build a model to identify a malicious IP address in a domain name access process, where in a supervised learning mode of the machine learning model, a positive sample is a sample that has been determined to be a malicious IP address, and a negative sample is a normal IP address. The method for marking the sample in the invention accurately judges the IP address with abnormal behavior, namely the malicious IP address, from a plurality of IP addresses accessing the domain name, and takes the IP address as a positive sample.
The url_pattern used in steps S1 to S4 refers to a series of access paths with wild cards, and can also be understood as matching similar urls with wild cards. For example www.hello/1.Com, www.hello/2.Com, both url's can be used www.hello/, and both url's can be classified as one url_pattern when they are used to access domain names.
The number of IP addresses after the duplicate removal of the url_pattern is referred to, for example, the number of times that one IP address accesses the same url_pattern is 100 times, and the number of IP addresses after the duplicate removal is 1. For another example, the first IP address accesses a domain name 20 times, the second IP address accesses the same domain name 80 times, and the number of IP addresses after de-duplication is 2. Dangerous url_pattern can be understood as url_pattern which is easy to attack in reality, namely url_pattern which is accessed by a small part of people in a large quantity for a long time, and the condition that url_pattern is attacked is described, so that a malicious IP address can be more accurately determined.
In step S1, the duration of the first preset time period is set according to specific requirements, which is not specifically limited in this embodiment, and may be, for example, 1 hour or 1 day. In step S3, the duration of the second preset time period is also set according to specific requirements, and in this embodiment, the duration is not specifically set, for example, may be 10 minutes or may be half an hour.
In step S3, the time parameter of the IP address is a ratio between the duration of the access of the IP address to the dangerous url_pattern and the duration of the total access operation of the IP address in the second preset time period. For example, when the second preset time period is 24 hours, the duration of accessing the dangerous url_pattern by one IP address is 100 minutes within 24 hours, the access operation performed by the IP address (including the access operation of accessing the dangerous url_pattern and all other urls) is 400 minutes, and the time parameter of the IP address is 100 to 400, i.e. 0.25. The page browsing parameter of the IP address is the ratio of the number of times the IP address accesses the dangerous url_pattern to the number of times the IP address performs all page browsing within a second preset time period. For example, when the second preset time period is half an hour, the number of times of accessing the dangerous url_pattern by one IP address is 200 times in half an hour, the number of times of accessing operations (including accessing the dangerous url_pattern and all other url accessing operations) performed by the IP address is 400 times, and the time parameter of the IP address is 200 times 400, namely 0.5.
As shown in fig. 2, in one exemplary embodiment, a method of marking a sample includes:
s1, acquiring page browsing quantity of each url_pattern in a first preset time period and accessing the number of the IP addresses of the url_pattern after de-duplication;
s21, calculating the attacked parameter of each url_pattern according to the page browsing quantity corresponding to each url_pattern and the number of the IP addresses after duplication removal;
s22, determining dangerous url_pattern according to the attacked parameters of url_pattern;
s3, acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
s4, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, determining the IP address with abnormal access, and marking the IP address with abnormal access as a positive sample.
The attack parameter of url_pattern calculated in step S21 is a specific value or other data that can be quantized. In step S22, by comparing the quantifiable data with the values for evaluation, it is determined whether url_pattern can be determined as dangerous url_pattern, that is, whether url_pattern is vulnerable url_pattern.
As shown in fig. 3, in one exemplary embodiment, a method of marking a sample includes:
s1, acquiring page browsing quantity of each url_pattern in a first preset time period and accessing the number of the IP addresses of the url_pattern after de-duplication;
s211, calculating the ratio between the page browsing amount corresponding to the url_pattern and the number of the de-duplicated IP addresses corresponding to the url_pattern, and taking the ratio as an attacked parameter of the url_pattern;
s221, sorting all calculated url_pattern attacked parameters according to the numerical value from big to small, and arranging the attacked parameters in the first N url_patterns to determine dangerous url_patterns;
s3, acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
s4, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, determining the IP address with abnormal access, and marking the IP address with abnormal access as a positive sample.
In step S211, when calculating the attacked parameter of url_pattern, for example, the number of page views corresponding to url_pattern is 1000 times, and the number of IP addresses after de-duplication corresponding to url_pattern is 10, the ratio of the number of page views corresponding to url_pattern to the number of IP addresses after de-duplication corresponding to url_pattern is 1000 divided by 10, that is, 100, that is, the attacked parameter of url_pattern is 100. For another example, if the number of page views corresponding to url_pattern is 1000 times and the number of IP addresses after de-duplication corresponding to url_pattern is 500, the ratio of the number of page views corresponding to url_pattern to the number of IP addresses after de-duplication corresponding to url_pattern is 1000 divided by 500, that is, 2, that is, the attacked parameter of url_pattern is 2.
Different judging modes can be adopted according to the actual situation, and judgment is carried out according to the attacked parameters so as to determine the dangerous url_pattern. In this embodiment, step S221 is to sort all the calculated attack parameters of url_pattern according to the numerical order from large to small, and determine that the attack parameters are ranked in the first N url_patterns and are dangerous url_patterns. For example, when 10 url_pattern's attacked parameters are calculated, the attacked parameters are 3, 8, 7, 9, 10, 20, 15, 2, 1, 24, respectively. The attacked parameters of these url_patterns are arranged in order from the top to the bottom as 24, 20, 15, 10, 9, 8, 7, 3, 2, 1. The value of N is determined by the specific case, for example, N may be 5 or 8. When N is 5, the url_pattern corresponding to the attack parameters 24, 20, 15, 10, and 9 is dangerous url_pattern, that is, is vulnerable url_pattern.
As shown in fig. 4, in one exemplary embodiment, a method of marking a sample includes:
s1, acquiring page browsing quantity of each url_pattern in a first preset time period and accessing the number of the IP addresses of the url_pattern after de-duplication;
s211, calculating the ratio between the page browsing amount corresponding to the url_pattern and the number of the de-duplicated IP addresses corresponding to the url_pattern, and taking the ratio as an attacked parameter of the url_pattern;
s222, determining url_pattern corresponding to the attacked parameter value with the value larger than or equal to the first preset value as dangerous url_pattern in the calculated attacked parameters of all url_patterns;
s3, acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
s4, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, determining the IP address with abnormal access, and marking the IP address with abnormal access as a positive sample.
In step S222, in a specific implementation process, when 10 url_pattern attacked parameters are calculated, the attacked parameters are respectively 3, 8, 7, 9, 10, 20, 15, 2, 1, 24, and the first preset value may be set according to specific situations, and in this embodiment, the first preset value is not specifically limited, for example, may be 10 or 6. When the first preset value is 10, the value of the attacked parameter is greater than or equal to the first preset value 10 when the attacked parameter is 24, 20, 15 or 10. The url_pattern corresponding to these attacked parameters is determined to be a dangerous url_pattern, i.e., a vulnerable url_pattern.
As shown in fig. 5, in one exemplary embodiment, a method of marking a sample includes:
s1, acquiring page browsing quantity of each url_pattern in a first preset time period and accessing the number of the IP addresses of the url_pattern after de-duplication;
s211, calculating the ratio between the page browsing amount corresponding to the url_pattern and the number of the de-duplicated IP addresses corresponding to the url_pattern, and taking the ratio as an attacked parameter of the url_pattern;
s223, sorting all calculated url_pattern attacked parameters according to the numerical value from big to small, and arranging the attacked parameters in the first N url_patterns to determine dangerous url_patterns; or alternatively, the process may be performed,
among all the calculated attacked parameters of url_pattern, determining url_pattern corresponding to the attacked parameter value with the value larger than or equal to the first preset value as dangerous url_pattern;
s3, acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
s41, determining the IP address with the time parameter being greater than or equal to a second preset value and the page browsing parameter being greater than or equal to a third preset value as the IP address with abnormal access, and marking the IP address with abnormal access as a positive sample.
In step S41, the second preset value and the third preset value are determined according to the specific situation, and are not limited in this embodiment, for example, the second preset value may be 0.9, and the third preset value may be 0.8. The method in step S41, when the time parameter of the IP address of one dangerous url_pattern is greater than or equal to 0.9 and the page view parameter is greater than or equal to 0.8, the IP address is the IP address where the abnormal access occurs, and the IP address may be marked as a positive sample for the subsequent machine learning model.
The method can accurately mark the malicious IP addresses from a plurality of IP addresses, and marks the malicious IP addresses as positive samples, so that the marking efficiency is high, and the marking accuracy is high. Because the automatic mode is adopted to mark the positive sample, the human participation is reduced, the problem of marking errors caused by the influence of human experiences is greatly reduced, and the labor cost is reduced.
As shown in fig. 6, the present invention further provides a system for marking a sample, for implementing the method for marking a sample as shown in fig. 1, the system comprising:
the acquisition unit is used for acquiring the page browsing quantity of each url_pattern in a first preset time period and the number of the IP addresses subjected to the de-duplication of the url_pattern;
the calculation unit is used for determining dangerous url_pattern according to the page browsing quantity corresponding to each url_pattern and the number of the IP addresses after duplication removal;
the acquisition unit is also used for acquiring the time parameter and the page browsing parameter of the IP address of each dangerous url_pattern accessed in the second preset time period;
the computing unit is also used for determining the IP address with abnormal access according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern;
and the marking unit is used for marking the IP address with abnormal access as a positive sample.
Further, in executing the method shown in fig. 2, the computing unit of the present invention is further configured to calculate the attacked parameter of each url_pattern according to the page browsing amount corresponding to each url_pattern and the number of IP addresses after duplication removal, and further determine the dangerous url_pattern according to the attacked parameter of the url_pattern.
Further, in the process of executing the method shown in fig. 3, the calculating unit in the present invention is further configured to calculate a ratio between the page view amount corresponding to url_pattern and the number of IP addresses after de-duplication corresponding to url_pattern, and use the ratio as the attacked parameter of url_pattern. Meanwhile, the computing unit is also used for sequencing all the computed attacked parameters of all url_patterns from large to small according to the numerical value, and determining the attacked parameters in the first N url_patterns as dangerous url_patterns.
When executing the method in step S222 in fig. 4, the calculating unit in the present invention is further configured to determine, as the dangerous url_pattern, url_pattern corresponding to the value of the attacked parameter having the value greater than or equal to the first preset value, from among all the calculated attacked parameters of url_pattern.
Further, in executing the method step S41 in fig. 5, the calculating unit in the present invention is further configured to determine, as the IP address where the abnormal access occurs, the IP address where the time parameter is greater than or equal to the second preset value and the page view parameter is greater than or equal to the third preset value.
In addition, the invention also discloses a transmission device, which comprises: a transceiver, a memory, a processor; the transceiver is used for receiving and transmitting the message; the memory is used for storing instructions and data; the processor is used for reading the instructions and data stored in the memory to perform the method of marking the sample.
The invention also discloses a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the method for marking the sample is realized when the program is executed by a processor.
The above description may be implemented alone or in various combinations and these modifications are within the scope of the present invention.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a program that instructs associated hardware, and the program may be stored on a computer readable storage medium such as a read-only memory, a magnetic or optical disk, etc. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits, and accordingly, each module/unit in the above embodiments may be implemented in hardware or may be implemented in a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional identical elements in an article or apparatus that comprises the element.
The above embodiments are only for illustrating the technical scheme of the present invention, not for limiting the same, and the present invention is described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent substitutions may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention, and the present invention is intended to be covered by the scope of the appended claims.

Claims (12)

1. A method for marking a sample, characterized in that,
acquiring page browsing quantity of each url_pattern in a first preset time period, and accessing the number of the de-duplicated IP addresses of the url_pattern;
determining dangerous url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after the duplication removal;
acquiring a time parameter and a page browsing parameter of accessing the IP address of each dangerous url_pattern in a second preset time period;
determining an IP address with abnormal access according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, and marking the IP address with abnormal access as a positive sample;
the determining, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, the IP address where the abnormal access occurs includes:
determining the IP address with the time parameter being greater than or equal to a second preset value and the page browsing parameter being greater than or equal to a third preset value as the IP address with abnormal access;
the time parameter of the IP address is the ratio between the time of the IP address accessing the dangerous url_pattern and the time of the IP address performing all access operation, and the page browsing parameter of the IP address is the ratio between the times of the IP address accessing the dangerous url_pattern and the times of the IP address performing all page browsing.
2. The method of marking samples as claimed in claim 1, wherein said determining dangerous url_pattern based on said page view amount and said number of de-duplicated IP addresses corresponding to each url_pattern comprises:
calculating the attacked parameter of each url_pattern according to the page browsing amount corresponding to each url_pattern and the number of the IP addresses after the duplication removal;
and determining the dangerous url_pattern according to the attacked parameters of the url_pattern.
3. The method of marking samples as claimed in claim 2, wherein said calculating an attacked parameter of each url_pattern based on the page view amount and the number of IP addresses after de-duplication corresponding to each url_pattern comprises:
and calculating the ratio between the page browsing amount corresponding to the url_pattern and the number of the de-duplicated IP addresses corresponding to the url_pattern, and taking the ratio as the attacked parameter of the url_pattern.
4. The method of marking samples as claimed in claim 2, wherein said determining dangerous url_pattern based on said url_pattern's attacked parameters comprises:
sorting all the calculated attacked parameters of the url_pattern according to the numerical value from big to small, and determining the attacked parameters of the url_pattern ranked in the first N to be the dangerous url_pattern;
or alternatively, the process may be performed,
and determining the url_pattern corresponding to the attacked parameter value with the value larger than or equal to a first preset value as the dangerous url_pattern in all the calculated attacked parameters of the url_pattern.
5. The method of marking samples according to any one of claims 1 to 4, wherein the time parameter of the IP address is a ratio between a duration of the IP address accessing the dangerous url_pattern and a duration of the IP address performing all access operation within a second preset period of time;
and/or the number of the groups of groups,
and the page browsing parameter of the IP address is the ratio of the number of times the IP address accesses the dangerous url_pattern to the number of times the IP address performs all page browsing within a second preset time period.
6. A system for marking a sample, the system comprising:
the acquisition unit is used for acquiring the page browsing quantity of each url_pattern in a first preset time period and the number of the IP addresses subjected to the de-duplication of the url_pattern;
the calculation unit is used for determining dangerous url_pattern according to the page browsing quantity corresponding to each url_pattern and the number of the IP addresses after duplication removal;
the acquiring unit is further configured to acquire a time parameter and a page browsing parameter of accessing an IP address of each dangerous url_pattern in a second preset time period, where the time parameter of the IP address is a ratio between a duration of the IP address accessing the dangerous url_pattern and a duration of performing all access operations on the IP address, and the page browsing parameter of the IP address is a ratio between a number of times the IP address accessing the dangerous url_pattern and a number of times the IP address performs all page browsing;
the computing unit is further configured to determine, according to the time parameter and the page browsing parameter corresponding to each dangerous url_pattern, an IP address where the time parameter is greater than or equal to a second preset value and the page browsing parameter is greater than or equal to a third preset value as an IP address where abnormal access occurs;
and the marking unit is used for marking the IP address with abnormal access as a positive sample.
7. The system for labeling a sample of claim 6,
the computing unit is further configured to compute an attacked parameter of each url_pattern according to the page browsing amount corresponding to each url_pattern and the number of IP addresses after duplication removal;
and determining the dangerous url_pattern according to the attacked parameters of the url_pattern.
8. The system for marking samples according to claim 7, wherein the calculating unit is further configured to calculate a ratio between a page view amount corresponding to the url_pattern and a number of IP addresses after de-duplication corresponding to the url_pattern, and use the ratio as an attacked parameter of the url_pattern.
9. The system for marking samples according to claim 7, wherein said calculation unit is further configured to rank all of the calculated attacked parameters of said url_pattern in order of magnitude from top to bottom, rank said attacked parameters in top N of said url_patterns, and determine said dangerous url_pattern;
or alternatively, the process may be performed,
and the method is also used for determining the url_pattern corresponding to the attacked parameter value with the value larger than or equal to a first preset value as the dangerous url_pattern in all the calculated attacked parameters of the url_pattern.
10. The system for marking samples according to any one of claims 6 to 9, wherein the time parameter of the IP address is a ratio between a duration of access of the IP address to the dangerous url_pattern and a duration of a total access operation of the IP address in a second preset period of time;
and/or the number of the groups of groups,
and the page browsing parameter of the IP address is the ratio of the number of times the IP address accesses the dangerous url_pattern to the number of times the IP address performs all page browsing within a second preset time period.
11. A transmission apparatus, characterized in that the transmission apparatus comprises: a transceiver, a memory, a processor;
the transceiver is used for receiving and transmitting messages;
the memory is used for storing instructions and data;
the processor is configured to read instructions and data stored in the memory to perform the method of marking a sample as claimed in any one of claims 1 to 5.
12. A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements a method of marking a sample according to any of claims 1 to 5.
CN201911158382.XA 2019-11-22 2019-11-22 Method, system, device and medium for marking samples Active CN112839010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911158382.XA CN112839010B (en) 2019-11-22 2019-11-22 Method, system, device and medium for marking samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911158382.XA CN112839010B (en) 2019-11-22 2019-11-22 Method, system, device and medium for marking samples

Publications (2)

Publication Number Publication Date
CN112839010A CN112839010A (en) 2021-05-25
CN112839010B true CN112839010B (en) 2023-08-04

Family

ID=75922605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911158382.XA Active CN112839010B (en) 2019-11-22 2019-11-22 Method, system, device and medium for marking samples

Country Status (1)

Country Link
CN (1) CN112839010B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105939361A (en) * 2016-06-23 2016-09-14 杭州迪普科技有限公司 Method and device for defensing CC (Challenge Collapsar) attack
CN108206802A (en) * 2016-12-16 2018-06-26 华为技术有限公司 The method and apparatus for detecting webpage back door
CN109729094A (en) * 2019-01-24 2019-05-07 中国平安人寿保险股份有限公司 Malicious attack detection method, system, computer installation and readable storage medium storing program for executing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI610196B (en) * 2016-12-05 2018-01-01 財團法人資訊工業策進會 Network attack pattern determination apparatus, determination method, and computer program product thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105939361A (en) * 2016-06-23 2016-09-14 杭州迪普科技有限公司 Method and device for defensing CC (Challenge Collapsar) attack
CN108206802A (en) * 2016-12-16 2018-06-26 华为技术有限公司 The method and apparatus for detecting webpage back door
CN109729094A (en) * 2019-01-24 2019-05-07 中国平安人寿保险股份有限公司 Malicious attack detection method, system, computer installation and readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN112839010A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN108092962B (en) Malicious URL detection method and device
CN106557695B (en) A kind of malicious application detection method and system
CN112839014B (en) Method, system, equipment and medium for establishing abnormal visitor identification model
US20080301090A1 (en) Detection of abnormal user click activity in a search results page
CN110602045A (en) Malicious webpage identification method based on feature fusion and machine learning
CN107231382A (en) A kind of Cyberthreat method for situation assessment and equipment
KR102047929B1 (en) Method of web site verification
US20110231415A1 (en) Web page searching system and method using access time and frequency
CN113176968A (en) Safety test method, device and storage medium based on interface parameter classification
KR20110037578A (en) The integration security monitoring system and method thereof
CN112839010B (en) Method, system, device and medium for marking samples
CN116361529B (en) Crawler monitoring method and device, electronic equipment and storage medium
CN113296992A (en) Method, device, equipment and storage medium for determining abnormal reason
CN117254983A (en) Method, device, equipment and storage medium for detecting fraud-related websites
CN112598326A (en) Model iteration method and device, electronic equipment and storage medium
CN110808947A (en) Automatic vulnerability quantitative evaluation method and system
CN112182441A (en) Method and device for detecting violation data
CN113822684B (en) Black-birth user identification model training method and device, electronic equipment and storage medium
CN115587017A (en) Data processing method and device, electronic equipment and storage medium
Frasier A note on the use of multiple linear regression in molecular ecology
KR100622129B1 (en) Dynamically changing web page defacement validation system and method
CN110941709B (en) Information screening method and device, electronic equipment and readable storage medium
CN107239704A (en) Malicious web pages find method and device
CN113992390A (en) Phishing website detection method and device and storage medium
CN110134594B (en) Function test method and device for application comprising account name and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant