CN114301713A

CN114301713A - Risk access detection model training method, risk access detection method and risk access detection device

Info

Publication number: CN114301713A
Application number: CN202111680480.7A
Authority: CN
Inventors: 曹世伟
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-08

Abstract

The present disclosure provides a risk access detection model training method, a risk access detection method and apparatus, an electronic device, and a computer-readable storage medium, which can be applied to the technical field of information security and the financial field. The training method of the risk access detection model comprises the following steps: acquiring an initial training sample data set, wherein initial training samples in the initial training sample data set comprise preprocessed initial access behavior information and label information corresponding to the initial access behavior information; processing the initial training sample data set by using a preset sampling method to obtain a training sample data set, wherein the training samples in the training sample data set comprise sampled access behavior information and label information; respectively training a plurality of candidate risk access detection models by utilizing a training sample data set to obtain a plurality of trained risk access detection models; and determining a target risk visit detection model from the plurality of trained risk visit detection models.

Description

Risk access detection model training method, risk access detection method and risk access detection device

Technical Field

The present disclosure relates to the field of information security technologies and finance, and more particularly, to a method and an apparatus for training a risk access detection model, a risk access detection method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

With the development of computer technology and internet technology, the network security risk coefficient is continuously improved, and the current network security technology mainly adopts firewall, identity authentication, data encryption and the like.

In implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: due to the fact that the safety performance of the websites is different from the maintenance level of managers and the hacker technology is upgraded day by day, more and more websites are subjected to illegal intrusion attacks in different degrees; in addition, the existing security technologies cannot meet the requirements of users on network security in the aspects of platform compatibility, protocol adaptability, multi-interface satisfaction and the like.

Disclosure of Invention

In view of the above, the present disclosure provides a training method of a risk access detection model, a risk access detection method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to a first aspect of the present disclosure, there is provided a training method of a risk access detection model, including:

acquiring an initial training sample data set, wherein an initial training sample in the initial training sample data set comprises preprocessed initial access behavior information and label information corresponding to the initial access behavior information;

processing the initial training sample data set by using a preset sampling method to obtain a training sample data set, wherein training samples in the training sample data set comprise sampled access behavior information and the label information;

respectively training a plurality of candidate risk access detection models by using the training sample data set to obtain a plurality of trained risk access detection models; and

a target risk visit detection model is determined from the plurality of trained risk visit detection models.

According to an embodiment of the present disclosure, the acquiring an initial training sample data set includes:

acquiring a plurality of log texts, wherein each log text in the plurality of log texts comprises original access behavior information of a user, the original access behavior information is used for representing the access behavior of the user, and the original access behavior information comprises data of a plurality of attribute categories;

screening the original access behavior information to obtain access behavior information to be processed, wherein the number of the attribute types in the access behavior information to be processed is smaller than the number of the attribute types in the original access behavior information;

preprocessing the to-be-processed access behavior information to obtain the initial access behavior information, wherein the preprocessing comprises normalization processing; and

and generating the training sample data set according to the initial access behavior information, wherein the tag information is used for representing that the access behavior corresponding to the initial access behavior information belongs to a normal access behavior or an illegal access behavior.

According to the embodiment of the present disclosure, the attribute categories include any two or more of an access request mode, an access resource, an accessor query parameter, a server port number, a protocol state, a protocol sub-state, a system state, an access duration, an access date, an access time, a server address, an access user name, an access user address, and a user agent.

According to an embodiment of the present disclosure, the preset sampling method includes one or more of a random undersampling algorithm, a thompy link algorithm, a mean clustering algorithm, a single-side selection algorithm, and a composite few-class oversampling algorithm.

According to the embodiment of the disclosure, the candidate risk access detection model includes any one of a candidate risk access detection model based on a decision tree classification algorithm, a candidate risk access detection model based on a neighbor classification algorithm, and a candidate risk access detection model based on a naive bayes classification algorithm.

According to an embodiment of the present disclosure, the training a plurality of candidate risk access detection models respectively by using the training sample data set to obtain a plurality of trained risk access detection models includes:

dividing the training sample data set into a training set and a test set;

training the candidate risk access detection models by using the training set to respectively generate a plurality of risk access detection models to be tested;

and training a plurality of to-be-tested risk visit detection models by using the test set to obtain the trained risk visit detection models and corresponding test results, wherein the test results are used for representing the detection accuracy of the corresponding to-be-tested risk visit detection models.

According to an embodiment of the present disclosure, the determining the target risk access detection model from the plurality of trained risk access detection models includes:

determining a target sampling method from a plurality of preset sampling methods according to the test result; and

and determining the target risk visit detection model from the plurality of trained risk visit detection models according to the test result.

According to a second aspect of the present disclosure, there is provided a risk access detection method, including:

acquiring a log text, wherein the log text comprises original access behavior information;

preprocessing the original access behavior information to obtain preprocessed initial access behavior information;

sampling the initial access behavior information by using a target sampling method to obtain to-be-detected access behavior information; and

and inputting the information of the access behavior to be detected into a risk access detection model, and outputting a risk access detection result, wherein the risk access detection model is obtained by training the risk access detection model by the training method.

According to an embodiment of the present disclosure, the risk access detection method further includes:

and in response to the detection result representing that the access behavior of the user is an illegal access behavior, storing the information of the access behavior to be detected corresponding to the detection result and the label information representing that the information of the access behavior to be detected is the illegal access behavior.

According to a third aspect of the present disclosure, there is provided a training apparatus for a risk access detection model, including:

a first obtaining module, configured to obtain an initial training sample data set, where an initial training sample in the initial training sample data set includes preprocessed initial access behavior information and tag information corresponding to the initial access behavior information;

the first processing module is used for processing the initial training sample data set by using a preset sampling method to obtain a training sample data set, wherein training samples in the training sample data set comprise sampled access behavior information and the tag information;

the training module is used for respectively training a plurality of candidate risk access detection models by utilizing the training sample data set to obtain a plurality of trained risk access detection models; and

a determining module for determining a target risk visit detection model from the plurality of trained risk visit detection models.

According to a fourth aspect of the present disclosure, there is provided a risk access detection apparatus, comprising:

the second acquisition module is used for acquiring a log text, wherein the log text comprises original access behavior information;

the preprocessing module is used for preprocessing the original access behavior information to obtain preprocessed initial access behavior information;

the second processing module is used for sampling the initial access behavior information by using a target sampling method to obtain to-be-detected access behavior information; and

and the detection module is used for inputting the information of the access behavior to be detected into a risk access detection model and outputting a risk access detection result, wherein the risk access detection model is obtained by training the risk access detection model by the training method.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

one or more processors;

a memory to store one or more instructions that,

wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the method as described above.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising computer executable instructions for implementing the method as described above when executed.

According to the embodiment of the disclosure, an initial training sample data set is processed by using a preset sampling method to obtain a training sample data set comprising sampled access behavior information and tag information, then a plurality of candidate risk access detection models are respectively trained, and a target risk access detection model is determined. Through the technical means, the technical problem that the security prevention means in the related technology cannot meet the network security requirements of users is at least partially solved, and therefore the risk access behavior can be accurately detected when the access behavior information is detected by using the risk access detection model, the accuracy rate of risk access detection is improved, and the technical effect of reducing the manual detection cost is achieved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates a system architecture to which a training method of a risk access detection model, a risk access detection method, may be applied according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of training a risk access detection model according to an embodiment of the present disclosure;

FIG. 3 schematically shows a flowchart of a method of obtaining an initial training sample data set according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of a method of deriving a plurality of trained risk access detection models, in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a risk access detection method according to an embodiment of the present disclosure;

FIG. 6 schematically shows a block diagram of a training apparatus of a risk access detection model according to an embodiment of the present disclosure;

FIG. 7 schematically shows a block diagram of a risk access detection arrangement according to an embodiment of the present disclosure; and

fig. 8 schematically shows a block diagram of an electronic device adapted to implement a training method, a risk access detection method of a risk access detection model according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

With the development of computer technology and internet technology, the network security risk coefficient is continuously improved, and ensuring the security of network information is more and more important.

At present, technologies such as firewalls, identity authentication, data encryption and the like are mainly adopted for security prevention of network information. However, due to the disparity in security performance and maintenance level of the website and the increasing upgrading of hacker technologies, more and more websites are subjected to different degrees of illegal intrusion attacks, such as website trojan, SQL Injection (Structured Query Language Injection), and the like; in addition, the existing security technologies have many defects in the aspects of normalization, integrity, practicability, etc., for example, in the aspects of platform compatibility, protocol adaptability, multi-interface satisfaction, etc., the requirements of users on network security cannot be met.

In the process of implementing the disclosed concept, the inventors found that in an actual scene, a few classes are often the key points of research and attention, such as abnormal data which occupies a few in medical diagnosis, abnormal points which occupy a few in radar detection, illegal visitors which occupy a few in website intrusion detection, and the like. In an IIS (Internet Information Server) log of a website, each time of behavior Information of a visitor is recorded in real time, and an illegal access behavior may be hidden in the behavior Information. How to sample, analyze and learn the existing data set to find a certain relation and rule implicit in the data so as to complete classification, identify and separate a few classes from a plurality of classes, and apply the relation or rule to new data to perform classification prediction is particularly important.

In order to at least partially solve the technical problems in the related art, the present disclosure provides a risk access detection model training method, a risk access detection method and apparatus, an electronic device, and a computer-readable storage medium, which may be applied to the information security technology field and the financial field. The training method of the risk access detection model comprises the following steps: acquiring an initial training sample data set, wherein initial training samples in the initial training sample data set comprise preprocessed initial access behavior information and label information corresponding to the initial access behavior information; processing the initial training sample data set by using a preset sampling method to obtain a training sample data set, wherein the training samples in the training sample data set comprise sampled access behavior information and label information; respectively training a plurality of candidate risk access detection models by utilizing a training sample data set to obtain a plurality of trained risk access detection models; and determining a target risk visit detection model from the plurality of trained risk visit detection models.

It should be noted that the training method of the risk access detection model, the risk access detection method and the risk access detection device provided by the embodiments of the present disclosure may be used in the technical field of information security and the financial field, for example, to improve the security performance of a bank website. The training method of the risk access detection model, the risk access detection method and the risk access detection device provided by the embodiment of the disclosure can also be used in any fields except the technical field of information security and the financial field, such as unbalanced data processing. The application fields of the training method of the risk access detection model, the risk access detection method and the risk access detection device provided by the embodiment of the disclosure are not limited.

Fig. 1 schematically shows a system architecture of a training method, a risk access detection method, to which a risk access detection model may be applied according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and otherwise process the received data such as the user request, store the processing result in a local database (for example, a webpage, information, or data obtained or generated according to the user request), and feed back the processing result to the terminal device.

It should be noted that the training method of the risk access detection model and the risk access detection method provided by the embodiments of the present disclosure may be generally executed by the server 105. Accordingly, the training device of the risk access detection model, the risk access detection device provided by the embodiments of the present disclosure may be generally disposed in the server 105. The training method of the risk access detection model and the risk access detection method provided by the embodiments of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Correspondingly, the risk access detection model training device and the risk access detection device provided in the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Alternatively, the training method and the risk access detection method of the risk access detection model provided by the embodiment of the present disclosure may also be executed by the

terminal device

101, 102, or 103, or may also be executed by another terminal device different from the

terminal device

101, 102, or 103. Correspondingly, the training device of the risk access detection model and the risk access detection device provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103, or in another terminal device different from the

terminal device

101, 102, or 103.

For example, the log text may be originally stored in any one of the

terminal apparatuses

101, 102, or 103 (for example, but not limited to, the terminal apparatus 101), or stored on an external storage apparatus and may be imported into the terminal apparatus 101. Then, the terminal device 101 may locally perform the training method and the risk access detection method of the risk access detection model provided in the embodiment of the present disclosure, or send the log text to another terminal device, a server, or a server cluster, and perform the training method and the risk access detection method of the risk access detection model provided in the embodiment of the present disclosure by another terminal device, a server, or a server cluster that receives the log text.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a method of training a risk access detection model according to an embodiment of the present disclosure.

As shown in fig. 2, the training method of the risk access detection model includes operations S201 to S204.

In operation S201, an initial training sample data set is obtained, where an initial training sample in the initial training sample data set includes preprocessed initial access behavior information and label information corresponding to the initial access behavior information.

According to an embodiment of the present disclosure, the tag information may be used to characterize whether the access behavior corresponding to the initial access behavior information belongs to a normal access behavior or an illegal access behavior.

In operation S202, an initial training sample data set is processed by using a preset sampling method to obtain a training sample data set, where a training sample in the training sample data set includes access behavior information and tag information that are subjected to sampling processing.

According to the embodiment of the disclosure, the data included in the initial access behavior information generally belongs to unbalanced data, that is, the normal access data and the illegal access data may exhibit imbalance in number, and the normal access data may be regarded as a majority class and the illegal access data may be regarded as a minority class.

In operation S203, a plurality of candidate risk access detection models are respectively trained by using the training sample data set, so as to obtain a plurality of trained risk access detection models.

According to an embodiment of the present disclosure, the plurality of candidate risk access detection models may be models constructed based on different classification algorithms.

In operation S204, a target risk visit detection model is determined from the plurality of trained risk visit detection models.

The method shown in fig. 2 is further described with reference to fig. 3-5 in conjunction with specific embodiments.

According to the embodiment of the disclosure, the preset sampling method comprises one or more of a random undersampling algorithm, a Tomek link algorithm, a mean clustering algorithm, a single-side selection algorithm and a composite few-class oversampling algorithm.

According to the embodiments of the present disclosure, for example, a combination of a Synthetic Minority-sampling Technique (SMOTE) and a Tomek link algorithm (SMOTE), or a combination of a SMOTE algorithm and a mean clustering algorithm (K-Means), or a combination of a SMOTE algorithm and a One-Sided Selection algorithm (OSS), etc. may be selected as the preset sampling method.

According to the embodiment of the disclosure, the candidate risk access detection model comprises any one of a candidate risk access detection model based on a decision tree classification algorithm, a candidate risk access detection model based on a neighbor classification algorithm and a candidate risk access detection model based on a naive Bayesian classification algorithm.

According to the embodiment of the disclosure, the data set can be balanced by utilizing the training sample data sets obtained based on different sampling methods, and then the candidate risk access detection models obtained based on different classification algorithms are respectively trained, so that the target sampling method and the target risk access detection model with the optimal detection performance can be determined from the free combination of the different sampling algorithms and the different classification algorithms, and the accuracy of risk access detection is improved.

Fig. 3 schematically shows a flowchart of a method of acquiring an initial training sample data set according to an embodiment of the present disclosure.

As shown in fig. 3, the method for obtaining an initial training sample data set includes operations S301 to S304.

In operation S301, a plurality of log texts are obtained, where each log text in the plurality of log texts includes original access behavior information of a user, the original access behavior information is used to characterize an access behavior of the user, and the original access behavior information includes data of a plurality of attribute categories.

According to the embodiment of the present disclosure, Microsoft SQL Server2000 (relational database management system) may be installed and run on a Server, a Data source and a database name that require importing or exporting of a DTS (Data Transformation Services) are set, a Data Transformation task is created, and the DTS package is run to acquire an IIS (Internet Information Services) log text from a website Server using the DTS.

According to the embodiment of the disclosure, a collection log database can be established on the server and used for storing the obtained log text.

According to the embodiment of the disclosure, the attribute category includes any two or more of an access request mode, an access resource, an accessor query parameter, a server port number, a protocol state, a protocol sub-state, a system state, an access duration, an access date, an access time, a server address, an access user name, an access user address, and a user agent.

According to the embodiment of the disclosure, the access request mode can be expressed as cs-method and can include GET, POST, and HEAD.

According to an embodiment of the present disclosure, an access Resource may be denoted as cs-URI-stem and may include a URI (Uniform Resource Identifier) Resource of an access page.

According to embodiments of the present disclosure, the challenger query parameter may be denoted as cs-URI-query, which may include a URI query that the visitor is attempting to perform.

According to an embodiment of the present disclosure, the protocol status may be denoted sc-status, and may for example indicate a successful communication at a value of 200.

According to embodiments of the present disclosure, the access user name may be denoted as cs-username, which may be denoted by "-" when the access user is anonymous, for example.

According to embodiments of the present disclosure, the User Agent may be denoted as cs (User-Agent), and may include, for example, a client browser, an operating system, and the like.

According to an embodiment of the disclosure, a server port number may be represented as s-port, a protocol sub-state may be represented as sc-substatus, a system state may be represented as sc-win32-status, an access duration may be represented as time-token, an access date may be represented as data, an access time may be represented as time, a server address may be represented as s-ip, and an access user address may be represented as c-ip.

In operation S302, the original access behavior information is screened to obtain to-be-processed access behavior information, where the number of attribute categories in the to-be-processed access behavior information is smaller than the number of attribute categories in the original access behavior information.

According to the embodiment of the disclosure, the acquired log texts can be rearranged according to the required attribute categories, and the information of the illegal access behaviors which account for a small number is separated from the log texts and is used as the key of risk access detection.

In operation S303, the access behavior information to be processed is preprocessed to obtain initial access behavior information, where the preprocessing includes normalization processing.

According to the embodiment of the disclosure, normalization processing can be performed on the access behavior information to be processed, so that the information of each attribute category is in the same order of magnitude in value, and thus, comprehensive comparison and analysis can be performed subsequently.

According to the embodiment of the disclosure, a 0-1normalization method can be adopted to normalize the access behavior information to be processed, and operation can be performed according to the following formula (1) so as to obtain a normalized value z, wherein z is between 0 and 1.

Wherein, x represents the access behavior information to be processed, max (x) represents the maximum value in the access behavior information to be processed, and min (x) represents the minimum value in the access behavior information to be processed.

According to the embodiment of the disclosure, because the information of different attribute categories has larger difference in value, if the information of the access behavior to be processed is directly used for training, the outline of the cost function of the obtained model will be presented as a flat length, and each attribute can be treated to the same degree through normalization treatment.

According to an embodiment of the present disclosure, the initial access behavior information after the normalization process may be stored in a process log database.

In operation S304, a training sample data set is generated according to the initial access behavior information, where the tag information is used to represent that the access behavior corresponding to the initial access behavior information belongs to a normal access behavior or an illegal access behavior.

According to an embodiment of the present disclosure, for example, when the tag information is 0, it is characterized that the access behavior belongs to a normal access behavior; when the label information is 1, the access behavior is characterized to belong to illegal access behaviors.

According to the embodiment of the disclosure, the original access behavior information in the log text is screened and preprocessed, and the training sample data set is generated according to the obtained initial access behavior information, so that the required attribute category can be selected from the access behavior information, and the information of the corresponding attribute category is mapped between 0 and 1, thereby realizing the information of the attribute category which is considered to be required to the same extent.

FIG. 4 schematically illustrates a flow chart of a method of deriving a plurality of trained risk access detection models, according to an embodiment of the present disclosure.

As shown in FIG. 4, the method of deriving a plurality of trained risk access detection models includes operations S401-S403.

In operation S401, a training sample data set is divided into a training set and a test set.

In operation S402, a plurality of candidate risk access detection models are trained using a training set, and a plurality of to-be-tested risk access detection models are respectively generated.

In operation S403, a plurality of to-be-tested risk visit detection models are trained by using a test set, so as to obtain a plurality of trained risk visit detection models and corresponding test results, where the test results are used to characterize the detection accuracy of the corresponding to-be-tested risk visit detection models.

According to the embodiment of the disclosure, the candidate risk access detection model can be tested by using the ten-fold cross validation, the test result is displayed, and the trained candidate risk access detection model can be stored in the model database.

According to an embodiment of the present disclosure, determining a target risk visit detection model from a plurality of trained risk visit detection models comprises:

determining a target sampling method from a plurality of preset sampling methods according to the test result; and determining a target risk visit detection model from the trained risk visit detection models according to the test result.

According to the embodiment of the disclosure, the trained risk access detection model and the corresponding test result can be obtained by dividing the training sample data set into the training set and the test set, so that the determination of the target risk access detection model according to the test result can be realized, and the accuracy of risk access detection can be improved.

Fig. 5 schematically shows a flow chart of a risk access detection method according to an embodiment of the present disclosure.

As shown in fig. 5, the risk access detection method includes operations S501 to S504.

In operation S501, a log text is obtained, where the log text includes original access behavior information.

In operation S502, the original access behavior information is preprocessed to obtain preprocessed initial access behavior information.

According to embodiments of the present disclosure, the preprocessing may include normalization processing, data cleansing, and the like.

In operation S503, the initial access behavior information is sampled by using a target sampling method, so as to obtain the access behavior information to be detected.

In operation S504, the information of the access behavior to be detected is input to the risk access detection model, and a risk access detection result is output, where the risk access detection model is obtained by training the risk access detection model according to the training method of the risk access detection model.

According to the embodiment of the disclosure, the information of the access behavior to be detected after preprocessing and sampling processing can be input into the risk access detection model for detection, the detection result can represent that the access behavior belongs to normal access or illegal access, and the detection result of each log text can be displayed.

and responding to the detection result to indicate that the access behavior of the user is the illegal access behavior, storing the information of the access behavior to be detected corresponding to the detection result and the label information indicating that the information of the access behavior to be detected is the illegal access behavior.

According to the embodiment of the disclosure, the label information representing the illegal access behavior and the access behavior information to be detected corresponding to the illegal access behavior can be added into the test set, so that the characteristics of the illegal access behavior can be enriched.

According to the embodiment of the disclosure, the original access behavior information in the log text is preprocessed and sampled to obtain the access behavior information to be detected, and a risk access detection result is obtained by using a risk access detection model. By the technical means, risk access detection on user access can be realized, so that the safety performance of a website is improved, and the network environment is optimized.

Fig. 6 schematically shows a block diagram of a training apparatus of a risk access detection model according to an embodiment of the present disclosure.

As shown in fig. 6, the risk access detection model training apparatus 600 includes: a first acquisition module 601, a first processing module 602, a training module 603, and a determination module 604.

A first obtaining module 601, configured to obtain an initial training sample data set, where an initial training sample in the initial training sample data set includes preprocessed initial access behavior information and tag information corresponding to the initial access behavior information.

The first processing module 602 is configured to process an initial training sample data set by using a preset sampling method to obtain a training sample data set, where a training sample in the training sample data set includes access behavior information and tag information that are subjected to sampling processing.

The training module 603 is configured to train a plurality of candidate risk access detection models respectively by using a training sample data set, so as to obtain a plurality of trained risk access detection models.

A determining module 604 for determining a target risk access detection model from the plurality of trained risk access detection models.

According to an embodiment of the present disclosure, the first obtaining module 601 includes: the device comprises an acquisition unit, a screening unit, a preprocessing unit and a generation unit.

The device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of log texts, each log text in the plurality of log texts comprises original access behavior information of a user, the original access behavior information is used for representing the access behavior of the user, and the original access behavior information comprises data of a plurality of attribute categories.

And the screening unit is used for screening the original access behavior information to obtain the access behavior information to be processed, wherein the number of the attribute categories in the access behavior information to be processed is smaller than the number of the attribute categories in the original access behavior information.

And the preprocessing unit is used for preprocessing the access behavior information to be processed to obtain initial access behavior information, wherein the preprocessing comprises normalization processing.

And the generating unit is used for generating a training sample data set according to the initial access behavior information, wherein the label information is used for representing that the access behavior corresponding to the initial access behavior information belongs to a normal access behavior or an illegal access behavior.

According to an embodiment of the present disclosure, the training module 603 includes: the device comprises a dividing unit, a first training unit and a second training unit.

And the dividing unit is used for dividing the training sample data set into a training set and a test set.

And the first training unit is used for training the candidate risk visit detection models by using a training set and respectively generating a plurality of to-be-tested risk visit detection models.

And the second training unit is used for training the multiple to-be-tested risk visit detection models by using the test set to obtain the multiple trained risk visit detection models and corresponding test results, wherein the test results are used for representing the detection accuracy of the corresponding to-be-tested risk visit detection models.

According to an embodiment of the present disclosure, the determining module 604 includes: a first determination unit and a second determination unit.

And the first determining unit is used for determining a target sampling method from a plurality of preset sampling methods according to the test result.

And the second determining unit is used for determining a target risk visit detection model from the trained risk visit detection models according to the test result.

Fig. 7 schematically shows a block diagram of a risk access detection arrangement according to an embodiment of the present disclosure.

As shown in fig. 7, the risk access detection apparatus 700 includes: a second obtaining module 701, a preprocessing module 702, a second processing module 703 and a detecting module 704.

A second obtaining module 701, configured to obtain a log text, where the log text includes original access behavior information.

The preprocessing module 702 is configured to preprocess the original access behavior information to obtain preprocessed initial access behavior information.

The second processing module 703 is configured to perform sampling processing on the initial access behavior information by using a target sampling method, so as to obtain to-be-detected access behavior information.

The detection module 704 is configured to input the information of the access behavior to be detected to the risk access detection model, and output a risk access detection result, where the risk access detection model is obtained by training the risk access detection model according to the above-mentioned training method.

According to an embodiment of the present disclosure, the risk access detection apparatus 700 further includes: and a storage module.

And the storage module is used for responding to the detection result that the access behavior of the user is represented as the illegal access behavior, storing the information of the access behavior to be detected corresponding to the detection result and the label information representing that the information of the access behavior to be detected is the illegal access behavior.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any plurality of the first obtaining module 601, the first processing module 602, the training module 603, the determining module 604, the second obtaining module 701, the preprocessing module 702, the second processing module 703 and the detecting module 704 may be combined and implemented in one module/unit/sub-unit, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least part of the functionality of one or more of these modules/units/sub-units may be combined with at least part of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to an embodiment of the present disclosure, at least one of the first obtaining module 601, the first processing module 602, the training module 603, the determining module 604, the second obtaining module 701, the preprocessing module 702, the second processing module 703 and the detecting module 704 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by any other reasonable manner of integrating or packaging a circuit, such as hardware or the same, or implemented by any one of three implementations of software, hardware and firmware, or by a suitable combination of any of them. Alternatively, at least one of the first obtaining module 601, the first processing module 602, the training module 603, the determining module 604, the second obtaining module 701, the pre-processing module 702, the second processing module 703 and the detecting module 704 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.

It should be noted that, in the embodiment of the present disclosure, the training device portion of the risk access detection model corresponds to the training method portion of the risk access detection model in the embodiment of the present disclosure, and the description of the training device portion of the risk access detection model specifically refers to the training method portion of the risk access detection model, which is not described herein again. The risk access detection device part in the embodiment of the present disclosure corresponds to the risk access detection method part in the embodiment of the present disclosure, and the description of the risk access detection device part specifically refers to the risk access detection method part, which is not described herein again.

Fig. 8 schematically shows a block diagram of an electronic device adapted to implement a risk access detection model training method, a risk access detection method according to an embodiment of the present disclosure. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, a computer electronic device 800 according to an embodiment of the present disclosure includes a processor 801 which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 801 may also include onboard memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing different actions of the method flows according to embodiments of the present disclosure.

In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are stored. The processor 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM802 and/or RAM 803. Note that the programs may also be stored in one or more memories other than the ROM802 and RAM 803. The processor 801 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 800 may also include input/output (I/O) interface 805, input/output (I/O) interface 805 also connected to bus 804, according to an embodiment of the present disclosure. Electronic device 800 may also include one or more of the following components connected to I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program, when executed by the processor 801, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM802 and/or RAM 803 described above and/or one or more memories other than the ROM802 and RAM 803.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being configured to cause the electronic device to implement the method for training a risk access detection model and the method for risk access detection provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 801, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via communication section 809, and/or installed from removable media 811. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A training method of a risk access detection model comprises the following steps:

determining a target risk visit detection model from the plurality of trained risk visit detection models.

2. The method of claim 1, wherein said obtaining an initial training sample data set comprises:

the method comprises the steps of obtaining a plurality of log texts, wherein each log text in the plurality of log texts comprises original access behavior information of a user, the original access behavior information is used for representing the access behavior of the user, and the original access behavior information comprises data of a plurality of attribute categories;

screening the original access behavior information to obtain to-be-processed access behavior information, wherein the number of the attribute categories in the to-be-processed access behavior information is smaller than the number of the attribute categories in the original access behavior information;

and generating the training sample data set according to the initial access behavior information, wherein the label information is used for representing that the access behavior corresponding to the initial access behavior information belongs to a normal access behavior or an illegal access behavior.

3. The method of claim 2, wherein the attribute categories include any two or more of access request mode, access resource, visitor query parameter, server port number, protocol state, protocol sub-state, system state, access duration, access date, access time, server address, access user name, access user address, and user agent.

4. The method of any one of claims 1 to 3, wherein the preset sampling method comprises one or more of a random undersampling algorithm, a Tomek chaining algorithm, a mean clustering algorithm, a one-sided selection algorithm, a composite minority class of oversampling algorithms.

5. The method according to any one of claims 1 to 3, wherein the candidate risk access detection models comprise any one of a candidate risk access detection model based on a decision tree classification algorithm, a candidate risk access detection model based on a neighbor classification algorithm, and a candidate risk access detection model based on a naive Bayesian classification algorithm.

6. The method of claim 1, wherein the training a plurality of candidate risk access detection models separately using the training sample data set, resulting in a plurality of trained risk access detection models comprises:

dividing the training sample data set into a training set and a test set;

training a plurality of to-be-tested risk access detection models by using the test set to obtain a plurality of trained risk access detection models and corresponding test results, wherein the test results are used for representing the detection accuracy of the corresponding to-be-tested risk access detection models.

7. The method of claim 6, wherein the determining a target risk visit detection model from a plurality of trained risk visit detection models comprises:

determining the target risk visit detection model from the plurality of trained risk visit detection models according to the test result.

8. A risk access detection method, comprising:

inputting the information of the visit behavior to be detected into a risk visit detection model, and outputting a risk visit detection result, wherein the risk visit detection model is obtained by training the risk visit detection model according to any one of claims 1 to 7.

9. The method of claim 8, further comprising:

and responding to the detection result that the access behavior of the user is represented as an illegal access behavior, and storing the information of the access behavior to be detected corresponding to the detection result and the label information representing that the information of the access behavior to be detected is the illegal access behavior.

10. A training apparatus for a risk access detection model, comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an initial training sample data set, and an initial training sample in the initial training sample data set comprises preprocessed initial access behavior information and label information corresponding to the initial access behavior information;

a determination module to determine a target risk visit detection model from the plurality of trained risk visit detection models.

11. A risk access detection device comprising:

the detection module is used for inputting the information of the access behavior to be detected to a risk access detection model and outputting a risk access detection result, wherein the risk access detection model is obtained by training the risk access detection model according to any one of claims 1 to 7.

12. An electronic device, comprising:

one or more processors;

a memory to store one or more instructions that,

wherein the one or more instructions, when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

13. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 9.

14. A computer program product comprising computer executable instructions for implementing the method of any one of claims 1 to 9 when executed.